PoNet用于长文档主题分割

模型描述

文档主题分割是指将文档分割成一系列连续的、主题一致的片段。近些年涌现出一些基于深度学习的主题分割算法，通过将主题分割定义为句子级别的二分类任务，使用BERT等预训练语言模型在领域内数据微调，取得了很好的效果。

但BERT等预训练语言模型的时间复杂度是O(n²)，随着输入序列长度的增加，模型在推理长文档时会面临速度慢、占用空间大等问题，尽管层次建模（从字符到句子再到预测结果）能一定程度上降低时空消耗，但时间复杂度仍是O(n²)，并且造成一定的性能损失。

因此我们将阿里巴巴达摩院研发的PoNet模型应用到英文长文档主题分割任务，在准确性和效率之间寻找平衡。PoNet使用 pooling 机制替代 Transformer 模型中的 self-attention 来进行上下文的建模，主要包括三种不同粒度的 pooling 网络，分别是全局的 pooling 模块(GA)，分段的 segment max-pooling 模块(SMP)，和局部的 max-pooling 模块(LMP)，对应捕捉不同粒度的序列信息，是一种具有线性复杂度O(n)的序列建模模型。

使用方式

直接输入长篇未分割文章，得到输出结果

模型局限性以及可能的偏差

模型采用公开语料进行训练，在某些特定领域文本上（如口语）的分割性能可能会有影响。

训练数据

使用公开的英文数据 Wiki727K，Wiki727K共包含727746篇文档，平均文档长度为2000字符左右，按照8:1:1划分训练集、验证集和测试集。

训练方式

模型用nlp_ponet_fill-mask_chinese-base初始化，在 Wiki727K 数据上进行训练。初始学习率为5e-5，batch_size为8，max_seq_length=4096，加入Focal Loss缓解类别不均衡问题（γ=2，α=0.75）

模型效果和效率评估

效果方面，PoNet 在 Wiki727K 测试集的 Positive F1 为 67.13，达到了 BERT 的 98.43%；效率方面每秒可以处理 30256 个字符，是 BERT 的 1.9 倍，在效果和效率之间达到了很好的平衡。具体的：

Wiki727K的测试集结果

model	Positive F1	Pk	WD
Two-Level LSTM	-	22.13	-
Cross-segment BERT-Base 128-128	64.0	-	-
Cross-segment BERT-Large 128-128	66.0	-	-
Cross-segment BERT-Large 256-256	67.1	-	-
HierBERT	66.5	-	-
TLT-TS	-	19.41	-
SeqModel-BERT-Base	68.2	-	-
PoNet	67.13	19.00	20.97

推理效率对比

GPU机器为Tesla V100 32G，batch_size=1

model	max_seq_length	efficiency (tokens/s)
SeqModel-BERT-Base	512	15885
PoNet	4096	30256

代码范例

from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks


p = pipeline(
    task=Tasks.document_segmentation,
    model='damo/nlp_ponet_document-segmentation_topic-level_english-base',
    model_revision="v1.1.1",
    )
doc = ['Actresses (Catalan: Actrius) is a 1997 Catalan language Spanish drama film produced and directed by Ventura Pons and based on the award-winning stage play "E.R."', 'by Josep Maria Benet i Jornet.', 'The film has no male actors, with all roles played by females.', 'The film was produced in 1996.', '"Actrius" screened in 2001 at the Grauman\'s Egyptian Theatre in an American Cinematheque retrospective of the works of its director.', 'The film had first screened at the same location in 1998.', 'It was also shown at the 1997 Stockholm International Film Festival.', 'In "Movie - Film - Review", "Daily Mail" staffer Christopher Tookey wrote that though the actresses were "competent in roles that may have some reference to their own careers", the film "is visually unimaginative, never escapes its stage origins, and is almost totally lacking in revelation or surprising incident".', 'Noting that there were "occasional, refreshing moments of intergenerational bitchiness", they did not "justify comparisons to "All About Eve"", and were "insufficiently different to deserve critical parallels with "Rashomon"".', 'He also wrote that "The Guardian" called the film a "slow, stuffy chamber-piece", and that "The Evening Standard" stated the film\'s "best moments exhibit the bitchy tantrums seething beneath the threesome\'s composed veneers".', 'MRQE wrote "This cinematic adaptation of a theatrical work is true to the original, but does not stray far from a theatrical rendering of the story.']

result = p(documents=doc)
topics = result[OutputKeys.TEXT].split("\n\n")
print(topics)