ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition

Zhitong Xiong; Yuan Yuan; Qi Wang

質問：RGB-Dシーン認識のための主要なローカル機能を適応的に選択する

屋内シーン画像には通常、散乱オブジェクトとさまざまなシーンレイアウトが含まれているため、RGB-Dシーン分類は困難な作業になります。既存の方法には、空間的変動が大きいシーン画像を分類するための制限がまだあります。したがって、画像ラベルのみを使用してローカルパッチレベルの特徴を効果的に抽出する方法は、RGB-Dシーン認識の未解決の問題です。本論文では、RGB-Dシーン認識のための効率的なフレームワークを提案します。これは、重要な局所的特徴を適応的に選択して、シーン画像の大きな空間的変動性をキャプチャします。具体的には、微分可能なローカル特徴選択（DLFS）モジュールを設計します。このモジュールは、適切な数の主要なローカルシーン関連の特徴を抽出できます。 DLFSモジュールを使用すると、空間的に相関するマルチモーダルRGB-D機能から、識別可能なローカルテーマレベルおよびオブジェクトレベルの表現を選択できます。 RGBと深度モダリティ間の相関関係を利用して、ローカルフィーチャを選択するためのより多くの手がかりを提供します。識別可能な局所的特徴が選択されることを確実にするために、変分相互情報最大化損失が提案される。さらに、DLFSモジュールを簡単に拡張して、さまざまなスケールのローカル機能を選択できます。ローカルオーダーレスおよびグローバル構造化マルチモーダル機能を連結することにより、提案されたフレームワークは、パブリックRGB-Dシーン認識データセットで最先端のパフォーマンスを実現できます。

Indoor scene images usually contain scattered objects and various scene layouts, which make RGB-D scene classification a challenging task. Existing methods still have limitations for classifying scene images with great spatial variability. Thus, how to extract local patch-level features effectively using only image labels is still an open problem for RGB-D scene recognition. In this paper, we propose an efficient framework for RGB-D scene recognition, which adaptively selects important local features to capture the great spatial variability of scene images. Specifically, we design a differentiable local feature selection (DLFS) module, which can extract the appropriate number of key local scenerelated features. Discriminative local theme-level and object-level representations can be selected with the DLFS module from the spatially-correlated multi-modal RGB-D features. We take advantage of the correlation between RGB and depth modalities to provide more cues for selecting local features. To ensure that discriminative local features are selected, the variational mutual information maximization loss is proposed. Additionally, the DLFS module can be easily extended to select local features of different scales. By concatenating the local-orderless and global structured multi-modal features, the proposed framework can achieve state-of-the-art performance on public RGB-D scene recognition datasets.

updated: Thu Oct 14 2021 20:26:58 GMT+0000 (UTC)

published: Thu Oct 14 2021 20:26:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト