MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang; Xiaowei Hu; Pengchuan Zhang; Xiujun Li; Lijuan Wang; Lei Zhang; Jianfeng Gao; Zicheng Liu

MiniVLM：より小さくより速い視覚言語モデル

最近のビジョン言語（VL）の研究では、トランスフォーマーモデルを使用して大規模な画像とテキストのペアから一般的な表現を学習し、ダウンストリームのVLタスクを微調整することで目覚ましい進歩が見られました。既存の研究は、事前にトレーニングされた大規模なモデルで高精度を達成することに焦点を当ててきましたが、軽量モデルの構築は実際には大きな価値がありますが、あまり検討されていません。このホワイトペーパーでは、より小さく高速なVLモデルであるMiniVLMを提案します。これは、より大きな対応物のように、さまざまなダウンストリームタスクで優れたパフォーマンスで微調整できます。 MiniVLMは、ビジョン特徴抽出モジュールとトランスフォーマーベースのビジョン言語融合モジュールの2つのモジュールで構成されています。 1ステージのEfficientDetネットワークに触発された2ステージのEfficientFeature Extractor（TEE）を設計し、ベースラインモデルと比較して視覚的特徴抽出の時間コストを95％大幅に削減します。 MiniLM構造を採用し、さまざまなコンパクトBERTモデルを比較した後、トランスモジュールの計算コストを削減します。さらに、最先端のキャプションモデルによって疑似ラベル付けされた7M Open Imagesデータを追加することにより、MiniVLMの事前トレーニングを改善します。また、強力なタグ付けモデルから取得した高品質の画像タグを使用して事前トレーニングを行い、クロスモダリティの調整を強化します。大きなモデルは、微調整や推論にオーバーヘッドを加えることなくオフラインで使用されます。上記の設計上の選択により、MiniVLMはモデルサイズを73％削減し、推論時間のコストを94％削減すると同時に、複数のVLタスクで94〜97％の精度を維持できます。 MiniVLMが、最先端のアプリケーションでの最先端のVL研究の使用を容易にするのに役立つことを願っています。

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by 95%, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding 7M Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by 73% and the inference time cost by 94% while being able to retain 94-97% of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

updated: Sun Dec 13 2020 03:02:06 GMT+0000 (UTC)

published: Sun Dec 13 2020 03:02:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト