Documen-grounded dialogue

Goal-oriented document-grounded dialogue systems enable end users to interactively query about
domain-specific information based on the given documents. The tasks of querying document knowledge via conversational systems continue to attract a lot of attention from both research and industrial communities for various applications. The previous works addressed the task of English and Chinese document-grounded dialogue systems, leaving other languages less well explored. Thus, large communities of users are prevented access to automated services and information. We aim to extend the effort by introducing the Third ACL DialDoc Workshop shared task involving documents and dialogues in diverse languages. We present this multilingual DGD challenge to encourage researchers to explore effective solutions for (1) transferring a DGD model from a high-resource language to a low-resource language; (2) developing a DGD model that is capable of providing multilingual responses given multilingual documents.

Specifically，we provide 797 dialogues in Vietnamese (3,446 turns), 816 dialogues in French (3,510 turns), and a corpus of 17272 paragraphs, where each dialogue turn is grounded in a paragraph from the corpus.
We also organize the currently available Chinese and English document-grounded dialogue data. We hope that participants can leverage the linguistic similarities, for example, a large number of Vietnamese words are derived from Chinese, and English and French both belong to the Indo-European language family, to improve their models’ performance in Vietnamese and French.

Model Description

So the task objective is to retrieve relevant paragraphs from a corpus based on the dialogue history and generate a response. To address this issue, we provide a baseline model consisting of three modules: retrieving the top-K relevant paragraphs from the corpus based on the dialogue history, ranking the top-N most relevant paragraphs, and concatenating them with the dialogue history to generate a response using a generation module.

Considering the lack of pre-trained models for the current retrieval, ranking, and generation models in multiple languages, we have constructed weakly annotated data sets containing 100,000 examples each in Chinese, English, Vietnamese, and French to pre-train each module.

This project is a generation module based on the mt5 architecture.

**We initialized the parameters using MT5 and pre-trained the model on weakly supervised data in four languages: English, Chinese, Vietnamese, and French. Finally, we fine-tuned the model on annotated data in Vietnamese and French. **

Usage and Scope of Application

This model is mainly used to generate response for a query. Users can try the model effect for various types of queries by themselves. Please refer to the code example for the specific calling method.

Usage

It is easily available in ModelScope.
For information on fine-tuning the model, please refer to our Github repository.

Limitations and Possible Biases

The model is trained on the dataset and may produce some bias, please decide how to use it after user’s own evaluation.

Datasets

The fine-tuning dataset is available at :

Training

For information on fine-tuning the model, please refer to our Github repository.

Related Papers and Citations

If you find this model helpful, please consider citing the following related paper:

@inproceedings{fu-etal-2022-doc2bot,
    title = "{D}oc2{B}ot: Accessing Heterogeneous Documents via Conversational Bots",
    author = "Fu, Haomin  and
      Zhang, Yeqin  and
      Yu, Haiyang  and
      Sun, Jian  and
      Huang, Fei  and
      Si, Luo  and
      Li, Yongbin  and
      Nguyen, Cam Tu",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.131",
    pages = "1820--1836",
}