Detect Only What You Specify : Object Detection with Linguistic Target

Moyuru Yamada

指定したものだけを検出する: 言語ターゲットを使用したオブジェクト検出

オブジェクト検出は、特定のイメージ内の対象オブジェクトごとに境界ボックスとカテゴリラベルのセットを予測するコンピュータービジョンタスクです。カテゴリは「犬」や「人」などの言語記号に関連しており、それらの間には関係があるはずです。ただし、オブジェクト検出器はカテゴリを分類することを学習するだけで、言語記号としては扱いません。マルチモーダルモデルでは、事前トレーニング済みのオブジェクト検出器を使用して画像からオブジェクトの特徴を抽出することがよくありますが、モデルは検出器から分離されており、抽出された視覚的特徴は言語入力によって変化しません。オブジェクト検出を視覚と言語の推論タスクとして再考します。次に、検出ターゲットが自然言語によって与えられ、タスクの目標が与えられた画像内のすべてのターゲットオブジェクトのみを検出することであるターゲット検出タスクを提案します。ターゲットが与えられていない場合、検出はありません。一般的に使用されている最新のオブジェクト検出器には、アンカーなどの多くの手作業で設計されたコンポーネントがあり、テキスト入力を複雑なパイプラインに融合することは困難です。したがって、最近提案されたトランスフォーマーベースの検出器に基づくターゲット検出用の言語ターゲット検出器 (LTD) を提案します。 LTD はエンコーダー/デコーダーアーキテクチャであり、条件付きデコーダーにより、モデルは、言語コンテキストとしてのテキスト入力を使用して、エンコードされた画像について推論できます。 COCOオブジェクト検出データセットでLTDの検出パフォーマンスを評価し、視覚オブジェクトへのテキスト入力接地によりモデルが検出結果を改善することも示します。

Object detection is a computer vision task of predicting a set of bounding boxes and category labels for each object of interest in a given image. The category is related to a linguistic symbol such as 'dog' or 'person' and there should be relationships among them. However the object detector only learns to classify the categories and does not treat them as the linguistic symbols. Multi-modal models often use the pre-trained object detector to extract object features from the image, but the models are separated from the detector and the extracted visual features does not change with their linguistic input. We rethink the object detection as a vision-and-language reasoning task. We then propose targeted detection task, where detection targets are given by a natural language and the goal of the task is to detect only all the target objects in a given image. There are no detection if the target is not given. Commonly used modern object detectors have many hand-designed components like anchor and it is difficult to fuse the textual inputs into the complex pipeline. We thus propose Language-Targeted Detector (LTD) for the targeted detection based on a recently proposed Transformer-based detector. LTD is a encoder-decoder architecture and our conditional decoder allows the model to reason about the encoded image with the textual input as the linguistic context. We evaluate detection performances of LTD on COCO object detection dataset and also show that our model improves the detection results with the textual input grounding to the visual object.

updated: Fri Nov 18 2022 07:28:47 GMT+0000 (UTC)

published: Fri Nov 18 2022 07:28:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト