StructBERT行动项抽取-中文口语-会议领域
基于大量口语数据预训练StructBert的会议行动项抽取基线模型
  • 模型资讯
  • 模型资料

基于StructBERT的中文口语base预训练模型介绍

StructBERT-spoken-base 预训练模型是使用structbert-base中文自然语言理解预训练模型在大量口语数据上使用MLM+口语化与训练目标继续进行预训练得到的。可以用于下游的nlu自然语言理解任务训练

模型描述

我们通过引入语言结构信息的方式,将BERT扩展为了一个新模型–StructBERT。我们通过引入两个辅助任务来让模型学习字级别的顺序信息和句子级别的顺序信息,从而更好的建模语言结构。详见论文StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

在StructBert的训练目标之外,我们增加了新的训练目标以提升对于口语特点的兼容性:从同一句话中随机选词替换,然后根据上下文预测原词。

期望模型使用方式以及适用范围

本模型主要用于自动对输入的句子判别是否含有行动项。具体调用方式请参考代码示例。

如何使用

在安装完成ModelScope-library之后即可使用text-classification的能力

预测代码范例

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys

text_classification = pipeline(Tasks.text_classification, model='damo/nlp_structbert_alimeeting_action-classification_chinese-base')
output = text_classification("今天会议的第一个结论是明天先收集用户的需求。")

模型局限性以及可能的偏差

模型在会议行动项抽取相关数据集上训练,在会议等类似内容上性能较好,其他垂直领域效果可能会有所下降。

训练数据介绍

本模型是由竞赛数据集训练得到的, 具体数据可以参考右侧标签栏

模型训练流程

用户可以基于这个StructBERT预训练底座模型进一步优化,训练代码如下,更详细代码请参考 alimeeting4mug

from datasets import Dataset
import os.path as osp
from modelscope.trainers import build_trainer
from modelscope.msdatasets import MsDataset
from modelscope.utils.hub import read_config
from modelscope.metainfo import Metrics
from modelscope.utils.constant import Tasks



# ---------------------------- Train ---------------------------------

model_id = 'damo/nlp_structbert_backbone_base_std'
WORK_DIR = 'workspace'
data_pth = "meeting_action"
def load_local_data(data_pth):
    train_dataset_dict = {"label": [], "sentence": [], "dataset": []}
    with open(osp.join(data_pth, "train.txt"), "r") as f:
        for line in f:
            sentence, label = line.strip().split("\t")
            train_dataset_dict["label"].append(float(label))
            train_dataset_dict["sentence"].append(sentence)
            train_dataset_dict["dataset"].append("meeting")
    eval_dataset_dict = {"label": [], "sentence": [], "dataset": []}
    with open(osp.join(data_pth, "dev.txt"), "r") as f:
        for line in f:
            sentence, label = line.strip().split("\t")
            eval_dataset_dict["label"].append(float(label))
            eval_dataset_dict["sentence"].append(sentence)
            eval_dataset_dict["dataset"].append("meeting")
    return train_dataset_dict, eval_dataset_dict
train_dataset_dict, eval_dataset_dict = load_local_data(data_pth)
train_dataset = MsDataset(Dataset.from_dict(train_dataset_dict)).to_hf_dataset()
eval_dataset = MsDataset(Dataset.from_dict(eval_dataset_dict)).to_hf_dataset()
print (train_dataset)

max_epochs = 5
lr = 2e-5
batch_size = 32
def cfg_modify_fn(cfg):
    cfg.task = Tasks.text_classification

    cfg.train.max_epochs = max_epochs
    cfg.train.optimizer.lr = lr
    cfg.train.dataloader = {
        "batch_size_per_gpu": batch_size,
        "workers_per_gpu": 1
    }

    cfg.evaluation.metrics = [Metrics.seq_cls_metric]
    cfg.train.lr_scheduler = {
        'type': 'LinearLR',
        'start_factor': 1.0,
        'end_factor': 0.0,
        'total_iters':
            int(len(train_dataset) / batch_size) * cfg.train.max_epochs,
        'options': {
            'by_epoch': False
        }
    }
    cfg.train.hooks[-1] = {
        'type': 'EvaluationHook',
        'by_epoch': True,
        'interval': 1
    }
    cfg['dataset'] = {
        'train': {
            'labels': ['否', '是', 'None'],
            'first_sequence': 'sentence',
            'label': 'label',
        }
    }
    return cfg

# map float to index
def map_labels(examples):
    map_dict = {0: "否", 1: "是"}
    examples['label'] = map_dict[int(examples['label'])]
    return examples

train_dataset = train_dataset.map(map_labels)
eval_dataset = eval_dataset.map(map_labels)

kwargs = dict(
    model=model_id,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    work_dir=WORK_DIR,
    cfg_modify_fn=cfg_modify_fn)


trainer = build_trainer(name='nlp-base-trainer', default_args=kwargs)

trainer.train()

# ---------------------------- Evaluation ---------------------------------

for i in range(max_epochs):
    eval_results = trainer.evaluate(f'{WORK_DIR}/epoch_{i+1}.pth')
    print(f'epoch {i} evaluation result:')
    print(eval_results)

# ---------------------------- Inference ---------------------------------
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys

output_list = []
text_classification = pipeline(Tasks.text_classification, model=f'{WORK_DIR}/output')
with open(f'{data_pth}/dev.txt', "r") as f:
    for line in f:
        input_text = line.strip().split("\t")[0]
        output = text_classification(input_text)
        scores = output["scores"]
        if scores[1] > scores[2]:
            label = output["labels"][1]
        else:
            label = output["labels"][2]
        output_list.append(input_text + "\t" + label.replace("是", "1").replace("否", "0"))
with open(f'{WORK_DIR}/test_predict_result.txt', "w") as f:
    f.write("\n".join(output_list))

数据评估及结果

模型在AlimeetingMUG数据集phase 1测试集上评估结果

Pos F1
71.08

相关论文以及引用信息

如果我们的模型对您有帮助,请您引用我们的文章:

@article{wang2019structbert,
  title={Structbert: Incorporating language structures into pre-training for deep language understanding},
  author={Wang, Wei and Bi, Bin and Yan, Ming and Wu, Chen and Bao, Zuyi and Xia, Jiangnan and Peng, Liwei and Si, Luo},
  journal={arXiv preprint arXiv:1908.04577},
  year={2019}
}