ICASSP2023 MUG Challenge Track5 行动项抽取Baseline

赛事及背景介绍

随着数字化经济的进一步发展，越来越多的企业开始将现代信息网络作为数据资源的主要载体，并通过网络通信技术进行数据传输。同时，疫情也促使越来越多行业逐步将互联网作为主要的信息交流和分享的方式。以往的研究表明，会议记录的口语语言处理（SLP）技术如关键词提取和摘要，对于信息的提取、组织和排序至关重要，可以显著提高用户对重要信息的掌握效率。

本项目源自于ICASSP2023信号处理大挑战的通用会议理解及生成挑战赛（MUG challenge），赛事构建并发布了目前为止规模最大的中文会议数据集，并基于会议人工转写结果进行了多项SLP任务的标注；目标是推动SLP在会议文本处理场景的研究并应对其中的多项关键挑战，包括人人交互场景下多样化的口语现象、会议场景下的长篇章文档建模等。

赛事报名页面：

https://modelscope.cn/competition/12/summary - 话题分割

https://modelscope.cn/competition/13/summary - 抽取式摘要

https://modelscope.cn/competition/14/summary - 话题子标题生成

https://modelscope.cn/competition/17/summary - 行动项抽取

https://modelscope.cn/competition/18/summary - 关键词抽取

基线模型训练及推理：

https://github.com/alibaba-damo-academy/SpokenNLP

基于StructBERT的中文Base预训练模型介绍

StructBERT的中文Large预训练模型是使用wikipedia数据和masked language model任务训练的中文自然语言理解预训练模型。可以用于下游的nlu自然语言理解任务训练

模型描述

我们通过引入语言结构信息的方式，将BERT扩展为了一个新模型–StructBERT。我们通过引入两个辅助任务来让模型学习字级别的顺序信息和句子级别的顺序信息，从而更好的建模语言结构。详见论文StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

期望模型使用方式以及适用范围

本模型主要用于给输入文档生成摘要内容。用户可以自行尝试各种输入文档。具体调用方式请参考代码示例。

如何使用

在安装完成ModelScope-library之后即可使用text-generation的能力

预测代码范例

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys

text_classification = pipeline(Tasks.text_classification, model='damo/nlp_structbert_alimeeting_action-classification_chinese-base')
output = text_classification("今天会议的第一个结论是明天先收集用户的需求。")

模型局限性以及可能的偏差

模型在会议行动项抽取相关数据集上训练，在会议等类似内容上性能较好，其他垂直领域效果可能会有所下降。

训练数据介绍

本模型是由竞赛数据集训练得到的，具体数据可以参考右侧标签栏

模型训练流程

用户可以基于这个StructBERT预训练底座模型进一步优化，训练代码如下，更详细代码请参考 alimeeting4mug

from datasets import Dataset
import os.path as osp
from modelscope.trainers import build_trainer
from modelscope.msdatasets import MsDataset
from modelscope.utils.hub import read_config
from modelscope.metainfo import Metrics
from modelscope.utils.constant import Tasks



# ---------------------------- Train ---------------------------------

model_id = 'damo/nlp_structbert_backbone_base_std'
WORK_DIR = 'workspace'
data_pth = "meeting_action"
def load_local_data(data_pth):
    train_dataset_dict = {"label": [], "sentence": [], "dataset": []}
    with open(osp.join(data_pth, "train.txt"), "r") as f:
        for line in f:
            sentence, label = line.strip().split("\t")
            train_dataset_dict["label"].append(float(label))
            train_dataset_dict["sentence"].append(sentence)
            train_dataset_dict["dataset"].append("meeting")
    eval_dataset_dict = {"label": [], "sentence": [], "dataset": []}
    with open(osp.join(data_pth, "dev.txt"), "r") as f:
        for line in f:
            sentence, label = line.strip().split("\t")
            eval_dataset_dict["label"].append(float(label))
            eval_dataset_dict["sentence"].append(sentence)
            eval_dataset_dict["dataset"].append("meeting")
    return train_dataset_dict, eval_dataset_dict
train_dataset_dict, eval_dataset_dict = load_local_data(data_pth)
train_dataset = MsDataset(Dataset.from_dict(train_dataset_dict)).to_hf_dataset()
eval_dataset = MsDataset(Dataset.from_dict(eval_dataset_dict)).to_hf_dataset()
print (train_dataset)

max_epochs = 5
lr = 2e-5
batch_size = 32
def cfg_modify_fn(cfg):
    cfg.task = Tasks.text_classification

    cfg.train.max_epochs = max_epochs
    cfg.train.optimizer.lr = lr
    cfg.train.dataloader = {
        "batch_size_per_gpu": batch_size,
        "workers_per_gpu": 1
    }

    cfg.evaluation.metrics = [Metrics.seq_cls_metric]
    cfg.train.lr_scheduler = {
        'type': 'LinearLR',
        'start_factor': 1.0,
        'end_factor': 0.0,
        'total_iters':
            int(len(train_dataset) / batch_size) * cfg.train.max_epochs,
        'options': {
            'by_epoch': False
        }
    }
    cfg.train.hooks[-1] = {
        'type': 'EvaluationHook',
        'by_epoch': True,
        'interval': 1
    }
    cfg['dataset'] = {
        'train': {
            'labels': ['否', '是', 'None'],
            'first_sequence': 'sentence',
            'label': 'label',
        }
    }
    return cfg

# map float to index
def map_labels(examples):
    map_dict = {0: "否", 1: "是"}
    examples['label'] = map_dict[int(examples['label'])]
    return examples

train_dataset = train_dataset.map(map_labels)
eval_dataset = eval_dataset.map(map_labels)

kwargs = dict(
    model=model_id,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    work_dir=WORK_DIR,
    cfg_modify_fn=cfg_modify_fn)


trainer = build_trainer(name='nlp-base-trainer', default_args=kwargs)

trainer.train()

# ---------------------------- Evaluation ---------------------------------

for i in range(max_epochs):
    eval_results = trainer.evaluate(f'{WORK_DIR}/epoch_{i+1}.pth')
    print(f'epoch {i} evaluation result:')
    print(eval_results)

# ---------------------------- Inference ---------------------------------
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys

output_list = []
text_classification = pipeline(Tasks.text_classification, model=f'{WORK_DIR}/output')
with open(f'{data_pth}/dev.txt', "r") as f:
    for line in f:
        input_text = line.strip().split("\t")[0]
        output = text_classification(input_text)
        scores = output["scores"]
        if scores[1] > scores[2]:
            label = output["labels"][1]
        else:
            label = output["labels"][2]
        output_list.append(input_text + "\t" + label.replace("是", "1").replace("否", "0"))
with open(f'{WORK_DIR}/test_predict_result.txt', "w") as f:
    f.write("\n".join(output_list))

数据评估及结果

模型在竞赛dev集上评估结果

Pos F1
69.43