coROM中文医疗文本表示模型

文本表示是自然语言处理(NLP)领域的核心问题, 其在很多NLP、信息检索的下游任务中发挥着非常重要的作用。近几年, 随着深度学习的发展，尤其是预训练语言模型的出现极大的推动了文本表示技术的效果, 基于预训练语言模型的文本表示模型在学术研究数据、工业实际应用中都明显优于传统的基于统计模型或者浅层神经网络的文本表示模型。这里, 我们主要关注基于预训练语言模型的文本表示。

文本表示示例, 输入一个句子, 输入一个固定维度的连续向量:

输入: 上消化道出血手术大约多少时间
输出: 0.16549307, -0.1374592 , -0.0132587 , …, 0.5855098 , -0.340697 , 0.08829002]

文本的向量表示通常可以用于文本聚类、文本相似度计算、文本向量召回等下游任务中。

Dual Encoder文本表示模型

基于监督数据训练的文本表示模型通常采用Dual Encoder框架, 如下图所示。在Dual Encoder框架中, Query和Document文本通过预训练语言模型编码后, 通常采用预训练语言模型[CLS]位置的向量作为最终的文本向量表示。基于标注数据的标签, 通过计算query-document之间的cosine距离度量两者之间的相关性。

使用方式和范围

使用方式:

直接推理, 对给定文本计算其对应的文本向量表示，向量维度768

使用范围:

本模型可以使用在医疗领域的文本向量表示及其下游应用场景, 包括双句文本相似度计算、query&多doc候选的相似度排序

如何使用

在ModelScope框架上，提供输入文本(默认最长文本长度为128)，即可以通过简单的Pipeline调用来使用coROM文本向量表示模型。ModelScope封装了统一的接口对外提供单句向量表示、双句文本相似度、多候选相似度计算功能

代码示例

from modelscope.models import Model
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

model_id = "damo/nlp_corom_sentence-embedding_chinese-tiny-medical"
pipeline_se = pipeline(Tasks.sentence_embedding,
                       model=model_id)

# 当输入包含“soure_sentence”与“sentences_to_compare”时，会输出source_sentence中首个句子与sentences_to_compare中每个句子的向量表示，以及source_sentence中首个句子与sentences_to_compare中每个句子的相似度。
inputs = {
    'source_sentence': ["上消化道出血手术大约多久"],
    'sentences_to_compare': [
        "上消化道出血手术大约要2-3个小时左右。手术后应观察血压、体温、脉搏、呼吸的变化。污染被服应随时更换，以避免不良刺激。出血停止后按序给予温凉流质、半流质及易消化的软饮食。",
        "胃出血一般住院30-60天。胃出血一般需要住院的时间需要注意根据情况来看，要看是胃溃疡引起，还是有无肝硬化门静脉高压引起的出血的情况，待消化道出血完全停止后病情稳定就可以出院，因此住院时间并不固定",
    ]
}
result = pipeline_se(input=inputs)
print (result)

# {'text_embedding': array([[ 0.16549307, -0.1374592 , -0.0132587 , ...,  0.5855098 ,
#        -0.340697  ,  0.08829002],
#       [ 0.24412914,  0.03988263,  0.00956912, ...,  0.3630614 ,
#        -0.47634274,  0.07621416],
#       [ 0.5064287 ,  0.21114293,  0.02383173, ...,  0.47307107,
#        -0.48332193,  0.18972419]], dtype=float32), 'scores': [75.60596466064453, 62.2199821472168]}


# 当输入仅含有soure_sentence时，会输出source_sentence中每个句子的向量表示以及首个句子与其他句子的相似度。
inputs2 = {
    'source_sentence': [
        "上消化道出血手术大约要2-3个小时左右。手术后应观察血压、体温、脉搏、呼吸的变化。污染被服应随时更换，以避免不良刺激。出血停止后按序给予温凉流质、半流质及易消化的软饮食。",
        "胃出血一般住院30-60天。胃出血一般需要住院的时间需要注意根据情况来看，要看是胃溃疡引起，还是有无肝硬化门静脉高压引起的出血的情况，待消化道出血完全停止后病情稳定就可以出院，因此住院时间并不固定",
    ]
}
result = pipeline_se(input=inputs2)
print (result)
# {'text_embedding': array([[ 0.24412914,  0.03988263,  0.00956912, ...,  0.3630614 ,
#        -0.47634274,  0.07621416],
#       [ 0.5064287 ,  0.21114293,  0.02383173, ...,  0.47307107,
#        -0.48332193,  0.18972419]], dtype=float32), 'scores': [57.70827102661133]}

默认向量维度768, scores中的score计算两个向量之间的L2距离得到

模型局限性以及可能的偏差

本模型基于MultiCPR(医疗领域)上训练，在其他垂类领域文本上的效果会有降低，请用户自行评测后决定如何使用

训练流程

模型: 双塔文本表示模型, 采用coROM模型作为预训练语言模型底座
二阶段训练: 模型训练分为两阶段, 一阶段的负样本数据从官方提供文档集随机采样负样本, 二阶段通过Dense Retrieval挖掘难负样本扩充训练训练数据重新训练

模型采用4张NVIDIA V100机器训练, 超参设置如下:

train_epochs=3
max_sequence_length=128
batch_size=64
learning_rate=5e-6
optimizer=AdamW

训练示例代码

# 需在GPU环境运行
# 加载数据集过程可能由于网络原因失败，请尝试重新运行代码
from modelscope.metainfo import Trainers                                                                                                                                                              
from modelscope.msdatasets import MsDataset
from modelscope.trainers import build_trainer
import tempfile
import os

tmp_dir = tempfile.TemporaryDirectory().name
if not os.path.exists(tmp_dir):
    os.makedirs(tmp_dir)

# load dataset
ds = MsDataset.load('dureader-retrieval-ranking', 'zyznull')
train_ds = ds['train'].to_hf_dataset()
dev_ds = ds['dev'].to_hf_dataset()
model_id = 'damo/nlp_corom_sentence-embedding_chinese-tiny-medical'
def cfg_modify_fn(cfg):
    cfg.task = 'sentence-embedding'
    cfg['preprocessor'] = {'type': 'sentence-embedding','max_length': 256}
    cfg['dataset'] = {
        'train': {
            'type': 'bert',
            'query_sequence': 'query',
            'pos_sequence': 'positive_passages',
            'neg_sequence': 'negative_passages',
            'text_fileds': ['text'],
            'qid_field': 'query_id'
        },
        'val': {
            'type': 'bert',
            'query_sequence': 'query',
            'pos_sequence': 'positive_passages',
            'neg_sequence': 'negative_passages',
            'text_fileds': ['text'],
            'qid_field': 'query_id'
        },
    }
    cfg['train']['neg_samples'] = 4
    cfg['evaluation']['dataloader']['batch_size_per_gpu'] = 30
    cfg.train.max_epochs = 1
    cfg.train.train_batch_size = 4
    return cfg 
kwargs = dict(
    model=model_id,
    train_dataset=train_ds,
    work_dir=tmp_dir,
    eval_dataset=dev_ds,
    cfg_modify_fn=cfg_modify_fn)
trainer = build_trainer(name=Trainers.nlp_sentence_embedding_trainer, default_args=kwargs)
trainer.train()

模型效果评估

我们主要在文本向量召回场景下评估模型效果, MultiCPR(医疗领域)召回评估结果如下:

Model	MRR@10	Recall@1000
BM25	18.69	48.20
CoROM-base	33.91	73.30
CoROM-tiny	22.78	64.54

引用

@article{Long2022MultiCPRAM,
  title={Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval},
  author={Dingkun Long and Qiong Gao and Kuan Zou and Guangwei Xu and Pengjun Xie and Rui Guo and Jianfeng Xu and Guanjun Jiang and Luxi Xing and P. Yang},
  booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  series = {SIGIR 22},
  year={2022}
}

Clone with HTTP

 git clone https://www.modelscope.cn/damo/nlp_corom_sentence-embedding_chinese-tiny-medical.git