MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Aishwarya Kamath; Mannat Singh; Yann LeCun; Ishan Misra; Gabriel Synnaeve; Nicolas Carion

MDETR-エンドツーエンドのマルチモーダル理解のための変調検出

マルチモーダル推論システムは、事前にトレーニングされたオブジェクト検出器を使用して、画像から関心領域を抽出します。ただし、この重要なモジュールは通常、ブラックボックスとして使用され、ダウンストリームタスクとは独立して、オブジェクトと属性の固定語彙でトレーニングされます。このため、このようなシステムでは、自由形式のテキストで表現された視覚的概念のロングテールをキャプチャすることが困難になります。この論文では、キャプションや質問など、生のテキストクエリを条件とする画像内のオブジェクトを検出するエンドツーエンドの変調検出器であるMDETRを提案します。モデルの初期段階で2つのモダリティを融合することにより、トランスフォーマーベースのアーキテクチャを使用してテキストと画像を共同で推論します。テキスト内のフレーズと画像内のオブジェクトの間に明示的な配置がある既存のマルチモーダルデータセットからマイニングされた、130万のテキストと画像のペアでネットワークを事前トレーニングします。次に、フレーズのグラウンディング、表現の理解とセグメンテーションの参照など、いくつかのダウンストリームタスクを微調整し、人気のあるベンチマークで最先端の結果を達成します。また、数ショットの設定で微調整した場合に、特定のラベルセットのオブジェクト検出器としてのモデルの有用性を調査します。事前トレーニングアプローチが、ラベル付けされたインスタンスが非常に少ないオブジェクトカテゴリのロングテールを処理する方法を提供することを示します。私たちのアプローチは、視覚的な質問応答のために簡単に拡張でき、GQAとCLEVRで競争力のあるパフォーマンスを実現します。コードとモデルはhttps://github.com/ashkamath/mdetrで入手できます。

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at https://github.com/ashkamath/mdetr.

updated: Mon Apr 26 2021 17:55:33 GMT+0000 (UTC)

published: Mon Apr 26 2021 17:55:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト