YORO -- Lightweight End to End Visual Grounding

Chih-Hui Ho; Srikar Appalaraju; Bhavan Jasani; R. Manmatha; Nuno Vasconcelos

YORO -- 軽量なエンドツーエンドのビジュアルグラウンディング

Visual Grounding (VG) タスク用のマルチモーダルトランスエンコーダーのみのアーキテクチャである YORO を紹介します。このタスクには、自然言語を介して参照されるオブジェクトを画像内でローカライズすることが含まれます。精度のために速度を犠牲にする多段階アプローチを使用する文献の最近の傾向とは異なり、YORO は、CNN バックボーンなしで単一段階設計を採用することにより、速度と精度の間のより良いトレードオフを求めています。 YORO は、単一の変換エンコーダーを使用して、自然言語クエリ、画像パッチ、および学習可能な検出トークンを使用し、参照されたオブジェクトの座標を予測します。テキストと視覚オブジェクト間の位置合わせを支援するために、新しいパッチテキスト位置合わせ損失が提案されています。アーキテクチャ設計の選択に関するアブレーションを使用して、5 つの異なるデータセットに対して広範な実験が行われます。 YORO は、リアルタイムの推論をサポートし、このクラスのすべてのアプローチ (シングルステージメソッド) よりも大幅に優れていることが示されています。これは最速の VG モデルでもあり、文献で最高の速度/精度のトレードオフを実現します。

We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. YORO consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object, using a single transformer encoder. To assist the alignment between text and visual objects, a novel patch-text alignment loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations on architecture design choices. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins. It is also the fastest VG model and achieves the best speed/accuracy trade-off in the literature.

updated: Tue Nov 15 2022 05:34:40 GMT+0000 (UTC)

published: Tue Nov 15 2022 05:34:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト