Unifying Vision-and-Language Tasks via Text Generation

Jaemin Cho; Jie Lei; Hao Tan; Mohit Bansal

テキスト生成による視覚と言語のタスクの統合

視覚と言語の学習のための既存の方法は、通常、各タスクのタスク固有のアーキテクチャと目的を設計する必要があります。たとえば、視覚的な質問応答用のマルチラベル回答分類子、表現理解を参照するための領域スコアラー、画像キャプション用の言語デコーダーなどです。これらの煩わしさを軽減するために、この作業では、さまざまなタスクを学習する統合フレームワークを提案します。同じ言語モデリングの目的を持つ単一のアーキテクチャ、つまりマルチモーダル条件付きテキスト生成で、モデルは視覚的およびテキスト入力に基づいてテキストでラベルを生成することを学習します。視覚的な質問応答、参照表現の理解、視覚的な常識的な推論など、7つの人気のある視覚と言語のベンチマークで、そのほとんどは以前は識別タスクとしてモデル化されていましたが、生成的アプローチ（単一の統合アーキテクチャを使用）は最近のタスクと同等のパフォーマンスに達します-特定の最先端のビジョンおよび言語モデル。さらに、私たちの生成的アプローチは、まれな回答を持つ質問に対してより優れた一般化能力を示しています。また、フレームワークにより、単一のパラメーターセットを使用して、単一のアーキテクチャでマルチタスク学習が可能になり、個別に最適化された単一タスクモデルと同様のパフォーマンスが達成されることを示します。私たちのコードはhttps://github.com/j-min/VL-T5で公開されています

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5

updated: Sun May 23 2021 23:12:46 GMT+0000 (UTC)

published: Thu Feb 04 2021 17:59:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト