X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

Zhaowei Cai; Gukyeong Kwon; Avinash Ravichandran; Erhan Bas; Zhuowen Tu; Rahul Bhotika; Stefano Soatto

X-DETR：インスタンスごとのビジョン言語タスクのための多用途のアーキテクチャ

この論文では、画像全体ではなくオブジェクトと整列するために自由形式の言語が必要とされる、挑戦的なインスタンスごとの視覚言語タスクを研究します。これらのタスクに対処するために、X-DETRを提案します。このアーキテクチャには、オブジェクト検出器、言語エンコーダー、および視覚と言語の位置合わせという3つの主要なコンポーネントがあります。ビジョンと言語ストリームは最後まで独立しており、効率的な内積演算を使用して調整されます。ネットワーク全体がエンドツーエンドでトレーニングされているため、検出器は既製のコンポーネントではなく、視覚言語タスク用に最適化されています。ペアのオブジェクト言語注釈の限られたサイズを克服するために、他の弱いタイプの監視を活用して知識の範囲を拡大します。 X-DETRのこのシンプルで効果的なアーキテクチャは、複数のインスタンスごとのビジョン言語タスクに対して優れた精度と高速性を示します。たとえば、トレーニング中にLVISアノテーションを使用せずに、毎秒約20フレームで1.2KカテゴリのLVIS検出で16.4APを示します。

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ~20 frames per second without using any LVIS annotation during training.

updated: Tue Apr 12 2022 08:34:42 GMT+0000 (UTC)

published: Tue Apr 12 2022 08:34:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト