Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

Mengmeng Xu; Yanghao Li; Cheng-Yang Fu; Bernard Ghanem; Tao Xiang; Juan-Manuel Perez-Rua

ウォレットはどこにありますか?自己中心的なビジュアルクエリローカリゼーションのためのオブジェクト提案セットのモデリング

このホワイトペーパーでは、視覚的な模範からの画像およびビデオデータセット内のオブジェクトのローカライズの問題を扱います。特に、自己中心的なビジュアルクエリローカライゼーションの挑戦的な問題に焦点を当てています。最初に、現在のクエリ条件付きモデル設計とビジュアルクエリデータセットにおける重大な暗黙のバイアスを特定します。次に、フレームとオブジェクトセットの両方のレベルで、そのようなバイアスに直接取り組みます。具体的には、私たちの方法は、トレーニング中に限られた注釈を拡張し、オブジェクトの提案を動的にドロップすることにより、これらの問題を解決します。さらに、クエリ情報を組み込みながら、オブジェクト提案セットのコンテキストを考慮することを可能にする、新しいトランスフォーマーベースのモジュールを提案します。モジュールに Conditioned Contextual Transformer または CocoFormer という名前を付けます。私たちの実験では、提案された適応が自己中心的なクエリ検出を改善し、2D 構成と 3D 構成の両方でより優れた視覚的なクエリローカリゼーションシステムにつながることが示されています。したがって、AP でのフレームレベルの検出パフォーマンスを 26.28% から 31.26 に向上させることができ、それに応じて VQ2D および VQ3D のローカリゼーションスコアが大幅に向上します。改善されたコンテキスト認識クエリオブジェクト検出器は、2 回目の Ego4D チャレンジの VQ2D および VQ3D タスクで 1 位と 2 位にランクされました。これに加えて、SOTA の結果も達成する Few-Shot Detection (FSD) タスクで提案されたモデルの関連性を紹介します。コードは https://github.com/facebookresearch/vq2d_cvpr で入手できます。

This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition to this, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. Our code is available at https://github.com/facebookresearch/vq2d_cvpr.

updated: Fri Nov 18 2022 22:50:50 GMT+0000 (UTC)

published: Fri Nov 18 2022 22:50:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト