  • 模型资讯
  • 模型资料




Abstract in English

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model’s performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English.

Our models, alone with the multilingual instruction data, are available at Github and Huggingface.



Model Precision Layers Heads Hidden Max_length LR Batch Type
PolyLM-1.7B bfloat16 24 16 2048 2048 1.0e-4 4M Pretrain Model
PolyLM-13B bfloat16 40 40 5120 2048 6.0e-5 4M Pretrain Model
PolyLM-MultiAlpaca-13B bfloat16 40 40 5120 2048 6.0e-5 4M Chat Model
PolyLM-Assistant-13B bfloat16 40 40 5120 2048 6.0e-5 4M Chat Model



名称 数量 构建方式 备注
code_alpaca 28 GPT 3.5 self-instruct 为了正确展示,对代码做了格式过滤,要求输入、输出中至少有一端可以找出一对```
dolly 15,011 人工编写
flan_v2 100,000 各类NLP任务、CoT任务 从 flan_v2中采样,全量数据非常大
gpt4_alpaca (英文) 52,002 GPT-4 self-instruct
gpt4_alpaca (中文) 48,818 GPT-4 self-instruct
multilingual_alpaca 132,701 GPT-3.5 self-instruct
open_assistant 55,668 人工编写
share_gpt 140,591 ChatGPT聊天记录
gpteacher_codegen 4,535


 git lfs install
 git clone https://www.modelscope.cn/damo/nlp_polylm_assistant_13b_text_generation.git


# git clone https://github.com/modelscope/modelscope
# cd modelscope
# pip install .

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope import snapshot_download

polylm_13b_model_id = 'damo/nlp_polylm_assistant_13b_text_generation'
revision = 'v1.0.0'

model_dir = snapshot_download(polylm_13b_model_id, revision)

input_text = f"Beijing is the capital of China.\nTranslate this sentence from English to Chinese."
input_text = "<|user|>\n" + f"{input_text}\n" + "<|assistant|>\n"

kwargs = {"do_sample": False, "num_beams": 4, "max_new_tokens": 128, "early_stopping": True, "eos_token_id": 2}
pipeline_ins = pipeline(Tasks.text_generation, model=model_dir)

result = pipeline_ins(input_text, **kwargs)



      title={PolyLM: An Open Source Polyglot Large Language Model}, 
      author={Xiangpeng Wei and Haoran Wei and Huan Lin and Tianhao Li and Pei Zhang and Xingzhang Ren and Mei Li and Yu Wan and Zhiwei Cao and Binbin Xie and Tianxiang Hu and Shangjie Li and Binyuan Hui and Bowen Yu and Dayiheng Liu and Baosong Yang and Fei Huang and Jun Xie},