StructBERT的中文Large预训练模型是使用wikipedia数据和masked language model任务训练的中文自然语言理解预训练模型 (预训练基座模型见 https://modelscope.cn/models/damo/nlp_structbert_spoken_chinese-base/summary )。可以用于下游的nlu自然语言理解任务训练。
StructBERT-spoken-base 预训练模型是使用structbert-base中文自然语言理解预训练模型在大量口语数据上使用MLM+口语化与训练目标继续进行预训练得到的。可以用于下游的nlu自然语言理解任务训练。 在预训练模型的基础上我们利用问句识别任务(判断句子是否为为问句)。
本模型主要用于给在输入文档中发现问句。用户可以自行尝试各种输入文档。具体调用方式请参考代码示例。
在安装完成ModelScope-library之后即可使用text-generation的能力
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys
text_classification = pipeline(Tasks.text_classification, model='damo/nlp_structbert_alimeeting_action-classification_chinese-base')
output = text_classification("今天会议的第一个结论是明天先收集用户的需求吗?")
模型在相关访谈数据集上训练,在会议等类似内容上性能较好,其他垂直领域效果可能会有所下降。
用户可以基于这个StructBERT预训练底座模型进一步优化,训练代码如下,更详细代码请参考 alimeeting4mug
from datasets import Dataset
import os.path as osp
from modelscope.trainers import build_trainer
from modelscope.msdatasets import MsDataset
from modelscope.utils.hub import read_config
from modelscope.metainfo import Metrics
from modelscope.utils.constant import Tasks
# ---------------------------- Train ---------------------------------
model_id = 'damo/nlp_structbert_backbone_base_std'
WORK_DIR = 'workspace'
data_pth = "meeting_action"
def load_local_data(data_pth):
train_dataset_dict = {"label": [], "sentence": [], "dataset": []}
with open(osp.join(data_pth, "train.txt"), "r") as f:
for line in f:
sentence, label = line.strip().split("\t")
train_dataset_dict["label"].append(float(label))
train_dataset_dict["sentence"].append(sentence)
train_dataset_dict["dataset"].append("meeting")
eval_dataset_dict = {"label": [], "sentence": [], "dataset": []}
with open(osp.join(data_pth, "dev.txt"), "r") as f:
for line in f:
sentence, label = line.strip().split("\t")
eval_dataset_dict["label"].append(float(label))
eval_dataset_dict["sentence"].append(sentence)
eval_dataset_dict["dataset"].append("meeting")
return train_dataset_dict, eval_dataset_dict
train_dataset_dict, eval_dataset_dict = load_local_data(data_pth)
train_dataset = MsDataset(Dataset.from_dict(train_dataset_dict)).to_hf_dataset()
eval_dataset = MsDataset(Dataset.from_dict(eval_dataset_dict)).to_hf_dataset()
print (train_dataset)
max_epochs = 5
lr = 2e-5
batch_size = 32
def cfg_modify_fn(cfg):
cfg.task = Tasks.text_classification
cfg.train.max_epochs = max_epochs
cfg.train.optimizer.lr = lr
cfg.train.dataloader = {
"batch_size_per_gpu": batch_size,
"workers_per_gpu": 1
}
cfg.evaluation.metrics = [Metrics.seq_cls_metric]
cfg.train.lr_scheduler = {
'type': 'LinearLR',
'start_factor': 1.0,
'end_factor': 0.0,
'total_iters':
int(len(train_dataset) / batch_size) * cfg.train.max_epochs,
'options': {
'by_epoch': False
}
}
cfg.train.hooks[-1] = {
'type': 'EvaluationHook',
'by_epoch': True,
'interval': 1
}
cfg['dataset'] = {
'train': {
'labels': ['否', '是', 'None'],
'first_sequence': 'sentence',
'label': 'label',
}
}
return cfg
# map float to index
def map_labels(examples):
map_dict = {0: "否", 1: "是"}
examples['label'] = map_dict[int(examples['label'])]
return examples
train_dataset = train_dataset.map(map_labels)
eval_dataset = eval_dataset.map(map_labels)
kwargs = dict(
model=model_id,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
work_dir=WORK_DIR,
cfg_modify_fn=cfg_modify_fn)
trainer = build_trainer(name='nlp-base-trainer', default_args=kwargs)
trainer.train()
# ---------------------------- Evaluation ---------------------------------
for i in range(max_epochs):
eval_results = trainer.evaluate(f'{WORK_DIR}/epoch_{i+1}.pth')
print(f'epoch {i} evaluation result:')
print(eval_results)
# ---------------------------- Inference ---------------------------------
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys
output_list = []
text_classification = pipeline(Tasks.text_classification, model=f'{WORK_DIR}/output')
with open(f'{data_pth}/dev.txt', "r") as f:
for line in f:
input_text = line.strip().split("\t")[0]
output = text_classification(input_text)
scores = output["scores"]
if scores[1] > scores[2]:
label = output["labels"][1]
else:
label = output["labels"][2]
output_list.append(input_text + "\t" + label.replace("是", "1").replace("否", "0"))
with open(f'{WORK_DIR}/test_predict_result.txt', "w") as f:
f.write("\n".join(output_list))
如果我们的模型对您有帮助,请您引用我们的文章:
@article{wang2019structbert,
title={Structbert: Incorporating language structures into pre-training for deep language understanding},
author={Wang, Wei and Bi, Bin and Yan, Ming and Wu, Chen and Bao, Zuyi and Xia, Jiangnan and Peng, Liwei and Si, Luo},
journal={arXiv preprint arXiv:1908.04577},
year={2019}
}