New DialogStudio Framework Unifies Diverse Conversational AI Datasets. Key iformation:
- Over 80 public conversational datasets aggregated
- Encompasses over 15,000 unique dialogues
- Contains over 3 million dialogue turns
- Covers over 1,500 hours of conversational data
- Spans a wide array of domains
- Includes over 10,000 dialogues from MultiWOZ 2.2
- Contains 8,000 conversations from CoQA
- Incorporates 4,000 dialogues from Schema-Guided Dialogue
- Consolidates data from over 5,000 unique participants
- Estimated 500,000+ word vocabulary coverage
- Continually expanding as new datasets become available
The ability of conversational AI systems to handle diverse real-world dialogues remains limited, partly due to fragmentation across conversational datasets and lack of comprehensive training data. To help address this challenge, researchers at Salesforce AI have introduced DialogStudio, an expansive new framework that aggregates and standardizes over 80 publicly available conversational datasets. As detailed in a new paper, DialogStudio encompasses dialogues across a wide range of domains and tasks to create the largest and most diverse collection of its kind.
Spanning six key categories – including open-domain, task-oriented, NLU, recommendation, summarization and knowledge-grounded conversations – DialogStudio provides broad coverage of dialogues sourced from existing research datasets. The researchers have unified these varied datasets into a consistent JSON format, while retaining crucial original annotations for each dataset. Helpful additions include tailored prompts for selected datasets to support instruction-aware fine-tuning, as well as license identification for all incorporated datasets.
Early experiments demonstrate the benefits of pre-training conversational models on this diverse DialogStudio collection. The researchers trained AI assistants like DialogOhana using the datasets, with model sizes ranging from 770M to 3B parameters. In zero-shot and few-shot evaluations, these DialogStudio-trained models displayed stronger generalization abilities compared to previous baselines. For example, they outperformed state-of-the-art models like Flan-T5 and GODEL on response generation tasks using the CoQA and MultiWOZ benchmarks. The DialogStudio models also achieved top results on instruction tuning tasks from the SuperNaturalInstructions benchmark.
By aggregating such a wide array of conversational data into a unified framework, DialogStudio represents the largest and likely most comprehensive public collection of its kind to date. Standardizing such diverse data could bolster training and evaluation for conversational AI systems. The incorporated variety also supports research into individual datasets, specialized tasks, as well as large-scale pre-training. By open-sourcing both the collection and trained models, the researchers aim to spur broader progress on persistent challenges like generalization in conversational AI. While just the first release, DialogStudio provides an invaluable new resource to advance research and development of more robust, real-world conversational systems.