Multi-modal Transformers Excel at Class-agnostic Object Detection

Muhammad Maaz; Hanoona Rasheed; Salman Khan; Fahad Shahbaz Khan; Rao Muhammad Anwer; Ming-Hsuan Yang

クラスにとらわれないオブジェクト検出で優れたマルチモーダルトランスフォーマー

オブジェクトを構成するものは何ですか？これは、コンピュータビジョンにおける長年の質問です。この目標に向けて、客観性を評価するために、学習のない学習ベースのアプローチが数多く開発されています。ただし、通常、新しいドメイン間や表示されていないオブジェクトに対しては適切に拡張できません。この論文では、既存の方法には、人間が理解できるセマンティクスによって管理されるトップダウンの監視信号がないことを提唱します。このギャップを埋めるために、整列された画像とテキストのペアでトレーニングされた最近のマルチモーダルビジョントランスフォーマー（MViT）を探索します。さまざまなドメインと新しいオブジェクトにわたる広範な実験により、画像内の一般的なオブジェクトをローカライズするMViTの最先端のパフォーマンスが示されています。これらの調査結果に基づいて、特定の言語クエリを指定して提案を適応的に生成できるマルチスケール機能処理と変形可能な自己注意を使用して、効率的で柔軟なMViTアーキテクチャを開発します。オープンワールドのオブジェクト検出、顕著なオブジェクトとカモフラージュのオブジェクト検出、監視ありおよび自己監視付きの検出タスクなど、さまざまなアプリケーションにおけるMViT提案の重要性を示します。さらに、MViTは、わかりやすいテキストクエリとの対話性を強化します。コード：https：//git.io/J1HPY。

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and for unseen objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. To bridge this gap, we explore recent Multi-modal Vision Transformers (MViT) that have been trained with aligned image-text pairs. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on these findings, we develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs offer enhanced interactability with intelligible text queries. Code: https://git.io/J1HPY.

updated: Mon Nov 22 2021 18:59:29 GMT+0000 (UTC)

published: Mon Nov 22 2021 18:59:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト