随着数字化经济的进一步发展,越来越多的企业开始将现代信息网络作为数据资源的主要载体,并通过网络通信技术进行数据传输。同时,疫情也促使越来越多行业逐步将互联网作为主要的信息交流和分享的方式。以往的研究表明,会议记录的口语语言处理(SLP)技术如关键词提取和摘要,对于信息的提取、组织和排序至关重要,可以显著提高用户对重要信息的掌握效率。
本项目源自于ICASSP2023信号处理大挑战的通用会议理解及生成挑战赛(MUG challenge),赛事构建并发布了目前为止规模最大的中文会议数据集,并基于会议人工转写结果进行了多项SLP任务的标注;目标是推动SLP在会议文本处理场景的研究并应对其中的多项关键挑战,包括 人人交互场景下多样化的口语现象、会议场景下的长篇章文档建模 等。
赛事报名页面:
https://modelscope.cn/competition/12/summary - 话题分割
https://modelscope.cn/competition/13/summary - 抽取式摘要
https://modelscope.cn/competition/14/summary - 话题子标题生成
https://modelscope.cn/competition/17/summary - 行动项抽取
https://modelscope.cn/competition/18/summary - 关键词抽取
基线模型训练及推理:
https://github.com/alibaba-damo-academy/SpokenNLP
StructBERT的中文Large预训练模型是使用wikipedia数据和masked language model任务训练的中文自然语言理解预训练模型。可以用于下游的nlu自然语言理解任务训练
我们通过引入语言结构信息的方式,将BERT扩展为了一个新模型–StructBERT。我们通过引入两个辅助任务来让模型学习字级别的顺序信息和句子级别的顺序信息,从而更好的建模语言结构。详见论文StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
本模型主要用于给输入文档生成摘要内容。用户可以自行尝试各种输入文档。具体调用方式请参考代码示例。
在安装完成ModelScope-library之后即可使用text-generation的能力
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys
text_classification = pipeline(Tasks.text_classification, model='damo/nlp_structbert_alimeeting_action-classification_chinese-base')
output = text_classification("今天会议的第一个结论是明天先收集用户的需求。")
模型在会议行动项抽取相关数据集上训练,在会议等类似内容上性能较好,其他垂直领域效果可能会有所下降。
本模型是由竞赛数据集训练得到的, 具体数据可以参考右侧标签栏
用户可以基于这个StructBERT预训练底座模型进一步优化,训练代码如下,更详细代码请参考 alimeeting4mug
from datasets import Dataset
import os.path as osp
from modelscope.trainers import build_trainer
from modelscope.msdatasets import MsDataset
from modelscope.utils.hub import read_config
from modelscope.metainfo import Metrics
from modelscope.utils.constant import Tasks
# ---------------------------- Train ---------------------------------
model_id = 'damo/nlp_structbert_backbone_base_std'
WORK_DIR = 'workspace'
data_pth = "meeting_action"
def load_local_data(data_pth):
train_dataset_dict = {"label": [], "sentence": [], "dataset": []}
with open(osp.join(data_pth, "train.txt"), "r") as f:
for line in f:
sentence, label = line.strip().split("\t")
train_dataset_dict["label"].append(float(label))
train_dataset_dict["sentence"].append(sentence)
train_dataset_dict["dataset"].append("meeting")
eval_dataset_dict = {"label": [], "sentence": [], "dataset": []}
with open(osp.join(data_pth, "dev.txt"), "r") as f:
for line in f:
sentence, label = line.strip().split("\t")
eval_dataset_dict["label"].append(float(label))
eval_dataset_dict["sentence"].append(sentence)
eval_dataset_dict["dataset"].append("meeting")
return train_dataset_dict, eval_dataset_dict
train_dataset_dict, eval_dataset_dict = load_local_data(data_pth)
train_dataset = MsDataset(Dataset.from_dict(train_dataset_dict)).to_hf_dataset()
eval_dataset = MsDataset(Dataset.from_dict(eval_dataset_dict)).to_hf_dataset()
print (train_dataset)
max_epochs = 5
lr = 2e-5
batch_size = 32
def cfg_modify_fn(cfg):
cfg.task = Tasks.text_classification
cfg.train.max_epochs = max_epochs
cfg.train.optimizer.lr = lr
cfg.train.dataloader = {
"batch_size_per_gpu": batch_size,
"workers_per_gpu": 1
}
cfg.evaluation.metrics = [Metrics.seq_cls_metric]
cfg.train.lr_scheduler = {
'type': 'LinearLR',
'start_factor': 1.0,
'end_factor': 0.0,
'total_iters':
int(len(train_dataset) / batch_size) * cfg.train.max_epochs,
'options': {
'by_epoch': False
}
}
cfg.train.hooks[-1] = {
'type': 'EvaluationHook',
'by_epoch': True,
'interval': 1
}
cfg['dataset'] = {
'train': {
'labels': ['否', '是', 'None'],
'first_sequence': 'sentence',
'label': 'label',
}
}
return cfg
# map float to index
def map_labels(examples):
map_dict = {0: "否", 1: "是"}
examples['label'] = map_dict[int(examples['label'])]
return examples
train_dataset = train_dataset.map(map_labels)
eval_dataset = eval_dataset.map(map_labels)
kwargs = dict(
model=model_id,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
work_dir=WORK_DIR,
cfg_modify_fn=cfg_modify_fn)
trainer = build_trainer(name='nlp-base-trainer', default_args=kwargs)
trainer.train()
# ---------------------------- Evaluation ---------------------------------
for i in range(max_epochs):
eval_results = trainer.evaluate(f'{WORK_DIR}/epoch_{i+1}.pth')
print(f'epoch {i} evaluation result:')
print(eval_results)
# ---------------------------- Inference ---------------------------------
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys
output_list = []
text_classification = pipeline(Tasks.text_classification, model=f'{WORK_DIR}/output')
with open(f'{data_pth}/dev.txt', "r") as f:
for line in f:
input_text = line.strip().split("\t")[0]
output = text_classification(input_text)
scores = output["scores"]
if scores[1] > scores[2]:
label = output["labels"][1]
else:
label = output["labels"][2]
output_list.append(input_text + "\t" + label.replace("是", "1").replace("否", "0"))
with open(f'{WORK_DIR}/test_predict_result.txt', "w") as f:
f.write("\n".join(output_list))
模型在竞赛dev集上评估结果
Pos F1 |
---|
69.43 |
如果我们的模型对您有帮助,请您引用我们的文章:
@article{wang2019structbert,
title={Structbert: Incorporating language structures into pre-training for deep language understanding},
author={Wang, Wei and Bi, Bin and Yan, Ming and Wu, Chen and Bao, Zuyi and Xia, Jiangnan and Peng, Liwei and Si, Luo},
journal={arXiv preprint arXiv:1908.04577},
year={2019}
}