VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Shraman Pramanick; Li Jing; Sayan Nag; Jiachen Zhu; Hardik Shah; Yann LeCun; Rama Chellappa

VoLTA: 弱教師付き局所特徴アラインメントによる視覚言語変換器

ビジョン言語事前トレーニング (VLP) は、最近、さまざまなユニモーダルおよびマルチモーダルダウンストリームアプリケーションで非常に効果的であることが証明されています。ただし、既存のエンドツーエンド VLP メソッドのほとんどは、高解像度の画像テキストボックスデータを使用して、オブジェクトの検出、セグメンテーション、参照表現の理解などのきめ細かい領域レベルのタスクを適切に実行します。残念ながら、正確なバウンディングボックスの注釈を含むこのような高解像度の画像を収集して大規模な監視に使用するには、コストがかかります。この作業では、画像キャプションデータのみを利用するが、高価なボックスの使用を排除する、きめ細かい領域レベルの画像理解を実現する新しい VLP パラダイムである VoLTA (弱教師付き局所特徴アライメントを備えた視覚言語変換) を提案します。注釈。 VoLTA は、ローカルイメージパッチとテキストトークンに対して、グラフ最適化トランスポートベースの弱い教師付き配置を採用して、明示的で自己正規化された解釈可能な低レベルマッチング基準を生成します。さらに、VoLTA は事前トレーニング中にマルチモーダルフュージョンをユニモーダルバックボーンに深くプッシュし、フュージョン固有のトランスフォーマーレイヤーを削除して、メモリ要件をさらに削減します。幅広いビジョンおよびビジョン言語のダウンストリームタスクに関する広範な実験により、大まかなダウンストリームパフォーマンスを損なうことなく、きめの細かいアプリケーションでの VoLTA の有効性が実証されました。

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

updated: Wed Feb 15 2023 05:34:21 GMT+0000 (UTC)

published: Sun Oct 09 2022 01:49:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト