OFASys多模态多任务预训练模型-英文-通用领域-base
OFASys 是一个多模态多任务学习系统,旨在使多模态任务具有声明性、模块化和任务可扩展性。 使用 OFASys,可以轻松地:1.通过定义声明性单行指令快速引入新的多模式任务/数据集;2.开发新的或重用现有的模态特定组件;3.联合训练多个多模态任务,无需手动处理多模态数据整理。
  • 模型资讯
  • 模型资料

集成中



 Documentation  |  Paper |  Blog   |  GitHub  



OFASys是什么

OFASys是一个面向多模态多任务统一学习的开源AI库,由达摩院M6团队开发。在这个系统下,我们训练了一个模型OFA+,首次支持了包括图文、语音、视频、动作等7种模态及其20多种多模态任务的统一训练和推断。

如何玩转OFASys

基础配置

注:OFASys目前还在快速迭代,现在以相对独立的方式实现了ModelScope的接口,因此需要独立安装环境。
  • 首先安装ModelScope和OFASys:
# modelscope的notebook不需要安装modelscope
# !pip install modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
!pip install https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/maas/ofasys/ofasys-0.1.0-py3-none-any.whl
  • 然后导入模型
from ofasys import ms_wrapper
from modelscope.pipelines import pipeline
pipe = pipeline('my-ofasys-task', model="damo/ofasys_multimodal_multitask_pretrain_base_en", model_revision='v1.0.0')

开启任务探索之旅

Image Captioning

instruction = '[IMAGE:img] <BOS> what does the image describe? <EOS> -> <BOS> [TEXT:cap] <EOS>'
data = {'img': "https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/maas/ofasys/ic.jpeg"}
output = pipe(data, instruction=instruction)
print(output.text) # "a man and woman sitting in front of a laptop computer"

Visual Grounding

instruction = '[IMAGE:img] <BOS> which region does the text " [TEXT:cap] " describe? <EOS> -> [BOX:patch_boxes,add_bos,add_eos]'
data = {'img': "https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/maas/ofasys/vg.jpg", "cap": "hand"}
output = pipe(data, instruction=instruction)
output.save_box("output.jpg")

Text Summarization

instruction = '<BOS> what is the summary of article " [TEXT:src] "? <EOS> -> <BOS> [TEXT:tgt] <EOS>'
data = {'src': "poland 's main opposition party tuesday endorsed president lech walesa in an upcoming "
        "presidential run-off election after a reformed communist won the first round of voting ."}
output = pipe(data, instruction=instruction)
print(output.text) # "polish opposition endorses walesa in presidential run-off"

Table-to-Text Generation

instruction = '<BOS> structured knowledge: " [STRUCT:database,uncased] "  . how to describe the tripleset ? <EOS> -> <BOS> [TEXT:tgt] <EOS>'
data = {
     'database': [['Atlanta', 'OFFICIAL_POPULATION', '5,457,831'],
                  ['[TABLECONTEXT]', 'METROPOLITAN_AREA', 'Atlanta'],
                  ['5,457,831', 'YEAR', '2012'],
                  ['[TABLECONTEXT]', '[TITLE]', 'List of metropolitan areas by population'],
                  ['Atlanta', 'COUNTRY', 'United States'],
     ]
 }
output = pipe(data, instruction=instruction, beam_size=1)
print(output.text) # "atlanta is the metropolitan area in the united states in 2012."

Text-to-SQL Generation

instruction = '<BOS> " [TEXT:src] " ; structured knowledge: " [STRUCT:database,max_length=876] " . generating sql code. <EOS> -> <BOS> [TEXT:tgt] <EOS>'
database = [
             ['concert_singer'],
             ['stadium', 'stadium_id , location , name , capacity , highest , lowest , average'],
             ['singer', 'singer_id , name , country , song_name , song_release_year , age , is_male'],
             ['concert', 'concert_id , concert_name , theme , stadium_id , year'],
             ['singer_in_concert', 'concert_id , singer_id']
 ]
data = [
     {'src': 'What are the names, countries, and ages for every singer in descending order of age?', 'database': database},
     {'src': 'What are all distinct countries where singers above age 20 are from?', 'database': database},
     {'src': 'What are the locations and names of all stations with capacity between 5000 and 10000?', 'database': database}
 ]
output = pipe(data, instruction=instruction)
print('\n'.join([o.text for o in output]))
# "select name, country, age from singer order by age desc"
# "select distinct country from singer where age > 20"
# "select location, name from stadium where capacity between 5000 and 10000"

Video Captioning



instruction = '[VIDEO:video] <BOS> what does the video describe? <EOS> -> <BOS> [TEXT:cap] <EOS>'
data = {'video': 'https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/maas/ofasys/video7021.mp4'}
output = pipe(data, instruction=instruction)
print(output.text) # "a baseball player is hitting a ball"

Speech-to-Text Generation

instruction = '[AUDIO:wav] <BOS> what is the text corresponding to the voice? <EOS> -> [TEXT:text,preprocess=text_phone,add_bos,add_eos]'
data = {'wav': 'https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/maas/ofasys/1272-128104-0001.flac'}
output = pipe(data, instruction=instruction)
print(output.text) # "nor is mister klohs manner less interesting than his manner"

Text-to-Image Generation

instruction = 'what is the complete image? caption: [TEXT:text]"? -> [IMAGE,preprocess=image_vqgan,adaptor=image_vqgan]'
data = {'text': "a city with tall buildings and a large green park."}
output = pipe(data, instruction=instruction)
output[0].save_image('0.png')

模型局限性以及可能的偏差

训练数据集自身有局限,有可能产生一些偏差,请用户自行评测后决定如何使用。

相关论文以及引用

如果你觉得OFASys好用,喜欢我们的工作,欢迎引用:

@article{bai2022ofasys,
  author    = {
      Jinze Bai and 
      Rui Men and 
      Hao Yang and 
      Xuancheng Ren and 
      Kai Dang and 
      Yichang Zhang and 
      Xiaohuan Zhou and 
      Peng Wang and 
      Sinan Tan and 
      An Yang and 
      Zeyu Cui and 
      Yu Han and 
      Shuai Bai and 
      Wenbin Ge and 
      Jianxin Ma and 
      Junyang Lin and 
      Jingren Zhou and 
      Chang Zhou},
  title     = {OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models},
  journal   = {CoRR},
  volume    = {abs/2212.04408},
  year      = {2022}
}