文档主题分割是指将文档分割成一系列连续的、主题一致的片段。近些年涌现出一些基于深度学习的主题分割算法,通过将主题分割定义为句子级别的二分类任务,使用BERT等预训练语言模型在领域内数据微调,取得了很好的效果。
但BERT等预训练语言模型的时间复杂度是O(n2),随着输入序列长度的增加,模型在推理长文档时会面临速度慢、占用空间大等问题,尽管层次建模(从字符到句子再到预测结果)能一定程度上降低时空消耗,但时间复杂度仍是O(n2),并且造成一定的性能损失。
因此我们将阿里巴巴达摩院研发的PoNet模型应用到英文长文档主题分割任务,在准确性和效率之间寻找平衡。PoNet使用 pooling 机制替代 Transformer 模型中的 self-attention 来进行上下文的建模,主要包括三种不同粒度的 pooling 网络,分别是全局的 pooling 模块(GA),分段的 segment max-pooling 模块(SMP),和局部的 max-pooling 模块(LMP),对应捕捉不同粒度的序列信息,是一种具有线性复杂度O(n)的序列建模模型。
模型用nlp_ponet_fill-mask_chinese-base初始化,在 Wiki727K 数据上进行训练。初始学习率为5e-5,batch_size为8,max_seq_length=4096,加入Focal Loss缓解类别不均衡问题(γ=2,α=0.75)
效果方面,PoNet 在 Wiki727K 测试集的 Positive F1 为 67.13,达到了 BERT 的 98.43%;效率方面每秒可以处理 30256 个字符,是 BERT 的 1.9 倍,在效果和效率之间达到了很好的平衡。具体的:
model | Positive F1 | Pk | WD |
---|---|---|---|
Two-Level LSTM | - | 22.13 | - |
Cross-segment BERT-Base 128-128 | 64.0 | - | - |
Cross-segment BERT-Large 128-128 | 66.0 | - | - |
Cross-segment BERT-Large 256-256 | 67.1 | - | - |
HierBERT | 66.5 | - | - |
TLT-TS | - | 19.41 | - |
SeqModel-BERT-Base | 68.2 | - | - |
PoNet | 67.13 | 19.00 | 20.97 |
GPU机器为Tesla V100 32G,batch_size=1
model | max_seq_length | efficiency (tokens/s) |
---|---|---|
SeqModel-BERT-Base | 512 | 15885 |
PoNet | 4096 | 30256 |
from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
p = pipeline(
task=Tasks.document_segmentation,
model='damo/nlp_ponet_document-segmentation_topic-level_english-base',
model_revision="v1.1.1",
)
doc = ['Actresses (Catalan: Actrius) is a 1997 Catalan language Spanish drama film produced and directed by Ventura Pons and based on the award-winning stage play "E.R."', 'by Josep Maria Benet i Jornet.', 'The film has no male actors, with all roles played by females.', 'The film was produced in 1996.', '"Actrius" screened in 2001 at the Grauman\'s Egyptian Theatre in an American Cinematheque retrospective of the works of its director.', 'The film had first screened at the same location in 1998.', 'It was also shown at the 1997 Stockholm International Film Festival.', 'In "Movie - Film - Review", "Daily Mail" staffer Christopher Tookey wrote that though the actresses were "competent in roles that may have some reference to their own careers", the film "is visually unimaginative, never escapes its stage origins, and is almost totally lacking in revelation or surprising incident".', 'Noting that there were "occasional, refreshing moments of intergenerational bitchiness", they did not "justify comparisons to "All About Eve"", and were "insufficiently different to deserve critical parallels with "Rashomon"".', 'He also wrote that "The Guardian" called the film a "slow, stuffy chamber-piece", and that "The Evening Standard" stated the film\'s "best moments exhibit the bitchy tantrums seething beneath the threesome\'s composed veneers".', 'MRQE wrote "This cinematic adaptation of a theatrical work is true to the original, but does not stray far from a theatrical rendering of the story.']
result = p(documents=doc)
topics = result[OutputKeys.TEXT].split("\n\n")
print(topics)
如果我们的模型对您有帮助,请您引用我们的文章:
@inproceedings{DBLP:journals/corr/abs-2110-02442,
author = {Chao{-}Hong Tan and
Qian Chen and
Wen Wang and
Qinglin Zhang and
Siqi Zheng and
Zhen{-}Hua Ling},
title = {{PoNet}: Pooling Network for Efficient Token Mixing in Long Sequences},
booktitle = {10th International Conference on Learning Representations, {ICLR} 2022,
Virtual Event, April 25-29, 2022},
publisher = {OpenReview.net},
year = {2022},
url = {https://openreview.net/forum?id=9jInD9JjicF},
}