UniT: Multimodal Multitask Learning with a Unified Transformer

Ronghang Hu; Amanpreet Singh

UniT：統合トランスフォーマーを使用したマルチモーダルマルチタスク学習

ユニファイドトランスフォーマーモデルであるUniTを提案し、オブジェクト検出から自然言語理解、マルチモーダル推論に至るまで、さまざまなドメインで最も顕著なタスクを同時に学習します。トランスフォーマーエンコーダーデコーダーアーキテクチャに基づいて、UniTモデルは、エンコーダーを使用して各入力モダリティをエンコードし、エンコードされた入力表現とそれに続くタスク固有の出力ヘッドを介して共有デコーダーを使用して各タスクを予測します。モデル全体は、各タスクからの損失を伴うエンドツーエンドで共同でトレーニングされます。トランスフォーマーを使用したマルチタスク学習に関するこれまでの取り組みと比較して、タスク固有のモデルを個別に微調整するのではなく、すべてのタスクで同じモデルパラメーターを共有し、さまざまなドメインではるかに多様なタスクを処理します。私たちの実験では、8つのデータセットで7つのタスクを共同で学習し、大幅に少ないパラメーターで各タスクで強力なパフォーマンスを実現します。私たちのコードは、https：//mmf.shのMMFで入手できます。

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.

updated: Wed Aug 18 2021 06:28:52 GMT+0000 (UTC)

published: Mon Feb 22 2021 04:45:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト