Goal-oriented document-grounded dialogue systems enable end users to interactively query about
domain-specific information based on the given documents. The tasks of querying document knowledge via conversational systems continue to attract a lot of attention from both research and industrial communities for various applications. The previous works addressed the task of English and Chinese document-grounded dialogue systems, leaving other languages less well explored. Thus, large communities of users are prevented access to automated services and information. We aim to extend the effort by introducing the Third ACL DialDoc Workshop shared task involving documents and dialogues in diverse languages. We present this multilingual DGD challenge to encourage researchers to explore effective solutions for (1) transferring a DGD model from a high-resource language to a low-resource language; (2) developing a DGD model that is capable of providing multilingual responses given multilingual documents.
Specifically,we provide 797 dialogues in Vietnamese (3,446 turns), 816 dialogues in French (3,510 turns), and a corpus of 17272 paragraphs, where each dialogue turn is grounded in a paragraph from the corpus.
We also organize the currently available Chinese and English document-grounded dialogue data. We hope that participants can leverage the linguistic similarities, for example, a large number of Vietnamese words are derived from Chinese, and English and French both belong to the Indo-European language family, to improve their models’ performance in Vietnamese and French.
So the task objective is to retrieve relevant paragraphs from a corpus based on the dialogue history and generate a response. To address this issue, we provide a baseline model consisting of three modules: retrieving the top-K relevant paragraphs from the corpus based on the dialogue history, ranking the top-N most relevant paragraphs, and concatenating them with the dialogue history to generate a response using a generation module.
Considering the lack of pre-trained models for the current retrieval, ranking, and generation models in multiple languages, we have constructed weakly annotated data sets containing 100,000 examples each in Chinese, English, Vietnamese, and French to pre-train each module.
This project is a pre-trained retrieval model uses a duel-encoder architecture based on XLM-RoBERTa, and participants can fine-tune the model based on our baseline code. We also provide the fine-tuned retrieval model.
The model will use two encoders to encode documents and queries separately. The checkpoint is pre-trained on a corpus including Chinese, English, French and Vietnamese.
It is easily available in ModelScope.
We do not recommend using the pre-trained model directly for predicting results as it has only been pre-trained and not fine-tuned on high-quality manually labeled data.
For information on fine-tuning the model, please refer to our Github repository.
from modelscope.pipelines import pipeline
pipe_ins = pipeline('document-grounded-dialog-retrieval', model='DAMO_ConvAI/nlp_convai_retrieval_pretrain')
param = {
'query': [
'<last_turn>我想知道孩子如果出现阑尾炎的话会怎么样',
'<last_turn>好像是从肚脐开始,然后到右下方<system>您可以描述一下孩子的情况吗?<user>我想知道孩子如果出现阑尾炎的话会怎么样?',
]
}
print(pipe_ins(param))
The model is trained on the dataset and may produce some bias, please decide how to use it after user’s own evaluation.
The fine-tuning dataset is available at :
For information on fine-tuning the model, please refer to our Github repository.
If you find this model helpful, please consider citing the following related paper:
@inproceedings{fu-etal-2022-doc2bot,
title = "{D}oc2{B}ot: Accessing Heterogeneous Documents via Conversational Bots",
author = "Fu, Haomin and
Zhang, Yeqin and
Yu, Haiyang and
Sun, Jian and
Huang, Fei and
Si, Luo and
Li, Yongbin and
Nguyen, Cam Tu",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.131",
pages = "1820--1836",
}