Distilled Dual-Encoder Model for Vision-Language Understanding

Zekun Wang; Wenhui Wang; Haichao Zhu; Ming Liu; Bing Qin; Furu Wei

視覚言語理解のための蒸留デュアルエンコーダモデル

視覚的推論や視覚的質問応答などの視覚言語理解タスクのためのデュアルエンコーダモデルをトレーニングするためのクロスモーダル注意蒸留フレームワークを提案します。デュアルエンコーダモデルは、フュージョンエンコーダモデルよりも推論速度が速く、推論中に画像とテキストの事前計算を可能にします。ただし、デュアルエンコーダモデルで使用される浅い相互作用モジュールは、複雑な視覚言語理解タスクを処理するには不十分です。画像とテキストの深い相互作用を学習するために、クロスモーダル注意蒸留を導入します。これは、フュージョンエンコーダモデルの画像からテキストおよびテキストから画像への注意分布を使用して、デュアルエンコーダのトレーニングをガイドします。モデル。さらに、事前トレーニングと微調整の両方の段階でクロスモーダル注意蒸留を適用すると、さらに改善されることを示します。実験結果は、蒸留されたデュアルエンコーダーモデルが、融合エンコーダーモデルよりもはるかに速い推論速度を享受しながら、視覚的推論、視覚的含意、および視覚的質問応答タスクに対して競争力のあるパフォーマンスを達成することを示しています。私たちのコードとモデルはhttps://github.com/kugwzk/Distilled-DualEncoderで公開されます。

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

updated: Thu Dec 16 2021 09:21:18 GMT+0000 (UTC)

published: Thu Dec 16 2021 09:21:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト