StructBERT-spoken-base 预训练模型是使用structbert-base中文自然语言理解预训练模型在大量口语数据上使用MLM+口语化与训练目标继续进行预训练得到的。可以用于下游的nlu自然语言理解任务训练
我们通过引入语言结构信息的方式,将BERT扩展为了一个新模型–StructBERT。我们通过引入两个辅助任务来让模型学习字级别的顺序信息和句子级别的顺序信息,从而更好的建模语言结构。详见论文StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
在StructBert的训练目标之外,我们增加了新的训练目标以提升对于口语特点的兼容性:从同一句话中随机选词替换,然后根据上下文预测原词。
本模型主要用于自动对输入的句子判别是否含有行动项。具体调用方式请参考代码示例。
在安装完成ModelScope-library之后即可使用text-classification的能力
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys
text_classification = pipeline(Tasks.text_classification, model='damo/nlp_structbert_alimeeting_action-classification_chinese-base')
output = text_classification("今天会议的第一个结论是明天先收集用户的需求。")
模型在会议行动项抽取相关数据集上训练,在会议等类似内容上性能较好,其他垂直领域效果可能会有所下降。
本模型是由竞赛数据集训练得到的, 具体数据可以参考右侧标签栏
用户可以基于这个StructBERT预训练底座模型进一步优化,训练代码如下,更详细代码请参考 alimeeting4mug
from datasets import Dataset
import os.path as osp
from modelscope.trainers import build_trainer
from modelscope.msdatasets import MsDataset
from modelscope.utils.hub import read_config
from modelscope.metainfo import Metrics
from modelscope.utils.constant import Tasks
# ---------------------------- Train ---------------------------------
model_id = 'damo/nlp_structbert_backbone_base_std'
WORK_DIR = 'workspace'
data_pth = "meeting_action"
def load_local_data(data_pth):
train_dataset_dict = {"label": [], "sentence": [], "dataset": []}
with open(osp.join(data_pth, "train.txt"), "r") as f:
for line in f:
sentence, label = line.strip().split("\t")
train_dataset_dict["label"].append(float(label))
train_dataset_dict["sentence"].append(sentence)
train_dataset_dict["dataset"].append("meeting")
eval_dataset_dict = {"label": [], "sentence": [], "dataset": []}
with open(osp.join(data_pth, "dev.txt"), "r") as f:
for line in f:
sentence, label = line.strip().split("\t")
eval_dataset_dict["label"].append(float(label))
eval_dataset_dict["sentence"].append(sentence)
eval_dataset_dict["dataset"].append("meeting")
return train_dataset_dict, eval_dataset_dict
train_dataset_dict, eval_dataset_dict = load_local_data(data_pth)
train_dataset = MsDataset(Dataset.from_dict(train_dataset_dict)).to_hf_dataset()
eval_dataset = MsDataset(Dataset.from_dict(eval_dataset_dict)).to_hf_dataset()
print (train_dataset)
max_epochs = 5
lr = 2e-5
batch_size = 32
def cfg_modify_fn(cfg):
cfg.task = Tasks.text_classification
cfg.train.max_epochs = max_epochs
cfg.train.optimizer.lr = lr
cfg.train.dataloader = {
"batch_size_per_gpu": batch_size,
"workers_per_gpu": 1
}
cfg.evaluation.metrics = [Metrics.seq_cls_metric]
cfg.train.lr_scheduler = {
'type': 'LinearLR',
'start_factor': 1.0,
'end_factor': 0.0,
'total_iters':
int(len(train_dataset) / batch_size) * cfg.train.max_epochs,
'options': {
'by_epoch': False
}
}
cfg.train.hooks[-1] = {
'type': 'EvaluationHook',
'by_epoch': True,
'interval': 1
}
cfg['dataset'] = {
'train': {
'labels': ['否', '是', 'None'],
'first_sequence': 'sentence',
'label': 'label',
}
}
return cfg
# map float to index
def map_labels(examples):
map_dict = {0: "否", 1: "是"}
examples['label'] = map_dict[int(examples['label'])]
return examples
train_dataset = train_dataset.map(map_labels)
eval_dataset = eval_dataset.map(map_labels)
kwargs = dict(
model=model_id,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
work_dir=WORK_DIR,
cfg_modify_fn=cfg_modify_fn)
trainer = build_trainer(name='nlp-base-trainer', default_args=kwargs)
trainer.train()
# ---------------------------- Evaluation ---------------------------------
for i in range(max_epochs):
eval_results = trainer.evaluate(f'{WORK_DIR}/epoch_{i+1}.pth')
print(f'epoch {i} evaluation result:')
print(eval_results)
# ---------------------------- Inference ---------------------------------
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys
output_list = []
text_classification = pipeline(Tasks.text_classification, model=f'{WORK_DIR}/output')
with open(f'{data_pth}/dev.txt', "r") as f:
for line in f:
input_text = line.strip().split("\t")[0]
output = text_classification(input_text)
scores = output["scores"]
if scores[1] > scores[2]:
label = output["labels"][1]
else:
label = output["labels"][2]
output_list.append(input_text + "\t" + label.replace("是", "1").replace("否", "0"))
with open(f'{WORK_DIR}/test_predict_result.txt', "w") as f:
f.write("\n".join(output_list))
模型在AlimeetingMUG数据集phase 1测试集上评估结果
Pos F1 |
---|
71.08 |
如果我们的模型对您有帮助,请您引用我们的文章:
@article{wang2019structbert,
title={Structbert: Incorporating language structures into pre-training for deep language understanding},
author={Wang, Wei and Bi, Bin and Yan, Ming and Wu, Chen and Bao, Zuyi and Xia, Jiangnan and Peng, Liwei and Si, Luo},
journal={arXiv preprint arXiv:1908.04577},
year={2019}
}