What BERT Sees: Cross-Modal Transfer for Visual Question Generation

Thomas Scialom; Patrick Bordes; Paul-Alexis Dray; Jacopo Staiano; Patrick Gallinari

BERTが見るもの：視覚的な質問生成のためのクロスモーダル転送

事前にトレーニングされた言語モデルは、最近NLPタスクの大幅な進歩に貢献しています。最近、BERTのマルチモーダルバージョンが開発され、主にVQAなどの分類タスクに適用される、整列されたテキストデータと画像データの膨大なコーパスに依存する重い事前トレーニングが使用されています。このホワイトペーパーでは、補足データに対して行われる事前トレーニングを回避することにより、BERTの視覚的機能をすぐに評価することに関心があります。私たちは、各モダリティの影響を研究することを可能にする、根拠のある対話にとって非常に興味深いタスクである視覚的質問生成を研究することを選択します（入力は視覚的および/またはテキストである可能性があるため）。さらに、BERTは主にエンコーダーとして設計されているため、タスクの生成の側面には適応が必要です。モノモーダル表現またはマルチモーダル表現のいずれかを活用できる、テキスト生成用のBERTベースのアーキテクチャであるBERT-genを紹介します。さまざまな構成で報告された結果は、利用可能なデータが少ない場合でも、BERT-genがマルチモーダルデータとテキスト生成に適応するための固有の能力を示しており、費用のかかる事前トレーニングを回避します。提案されたモデルは、2つの確立されたVQGデータセットの最先端を大幅に改善しています。

Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.

updated: Wed Dec 16 2020 15:48:35 GMT+0000 (UTC)

published: Tue Feb 25 2020 12:44:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト