Datasets

You can find some interesting datasets for dialogue systems on this page.


1. LCCC

About

The LCCC (Large-scale Cleaned Chinese Conversation) corpus (LCCC) is a cleaned open-domain dialogue dataset. A strict data cleaning pipeline is designed to ensure the quality of LCCC. It contains 7,273,804 single-turn dialogues and 4,733,955 multi-turn dialogues. You can find more details from this paper: A Large-Scale Chinese Short-Text Conversation Dataset.

Download

You can find the download link of LCCC from this repo.

Reference

Please cite the following paper if you find this dataset useful:

@inproceedings{wang2020large,
  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},
  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {NLPCC},
  year      = {2020}
}

2. PersonalDialog

About

The PersonalDialog dataset is a large-scale multi-turn Chinese dialogue dataset containing various traits from a large number of speakers. The whole dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker. This large-scale dataset will facilitate not only the study of personalized dialogue generation but also other researches on sociolinguistics or social science. You can find more details from this paper: Personalized Dialogue Generation with Diversified Traits

A Sample Dialogue from PersonaDilaog

A sample Dialogue from PersonaDilaog

Download

The PersonalDialog dataset contains various persona-related information collected from the web. You can find some sample data here. You need to email me if you want a download link.

Reference

Please cite the following papers if you find this dataset useful:

@article{zheng2019personalized,
  title   = {Personalized dialogue generation with diversified traits},
  author  = {Zheng, Yinhe and Chen, Guanyi and Huang, Minlie and Liu, Song and Zhu, Xuan},
  journal = {arXiv preprint arXiv:1901.09672},
  year    = {2019}
}

@inproceedings{zheng2020pre,
  title     = {A pre-training based personalized dialogue generation model with persona-sparse data},
  author    = {Zheng, Yinhe and Zhang, Rongsheng and Huang, Minlie and Mao, Xiaoxi},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume    = {34},
  number    = {05},
  pages     = {9693--9700},
  year      = {2020}
}

3. MMChat

About

MMCHAT contains image-grounded dialogues collected from real conversations on social media. Specifically, MMCHAT contains 120.84K sessions of open-domain dialogues (filtered out of 32.4M sessions of raw dialogues) and 204.32K corresponding images.

A Sample Dialogue from MMChat

A sample Dialogue from MMChat

Download

You can find the download link of MMChat from this repo.

Reference

Please cite the following paper if you find this dataset useful:

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}