Semantic-guided context modeling for indoor scene recognition

Chuanxin Song; Hanbo Wu; Xin Ma; Yibin Li

屋内シーン認識のための意味ガイド付きコンテキストモデリング

シーン画像の意味コンテキストを探索することは、屋内シーン認識には不可欠です。ただし、クラス内の空間レイアウトが多様で、クラス間オブジェクトが共存しているため、さまざまな画像特性を適応させるためにコンテキスト上の関係をモデル化することは大きな課題です。屋内シーン認識のための既存のコンテキストモデリング手法には、次の 2 つの制限があります。 1) トレーニング中に、色などの空間に依存しない情報が、空間コンテキストを表現するネットワークの能力の最適化を妨げる可能性があります。 2) これらの方法では、異なるシーン間で共存するオブジェクトの違いが見落とされることが多く、シーン認識のパフォーマンスが低下します。これらの制限に対処するために、私たちは、セマンティックセグメンテーションに基づいてオブジェクトの空間関係と共起を同時にモデル化する新しいアプローチである SpaCoNet を提案します。まず、セマンティック空間関係モジュール (SSRM) は、シーン内のオブジェクト間の空間関係を調査するように設計されています。このモジュールは、セマンティックセグメンテーションの助けを借りて、画像から空間情報を分離し、無関係な特徴の影響を効果的に回避します。次に、SSRM の空間コンテキスト特徴と RGB 特徴抽出器のディープ特徴の両方を使用して、異なるシーン間で共存するオブジェクトを区別します。最後に、上記の識別特徴を利用して、自己注意メカニズムを使用してオブジェクト間の長距離共起関係を調査し、さらに屋内シーン認識のための意味ガイド付き特徴表現を生成します。 3 つの公的に利用可能なデータセットに関する実験結果は、提案された方法の有効性と一般性を示しています。コードはブラインドレビュープロセスの完了後に公開されます。

Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for indoor scene recognition exhibit two limitations: 1) During training, space-independent information, such as color, may hinder optimizing the network's capacity to represent the spatial context. 2) These methods often overlook the differences in coexisting objects across different scenes, suppressing the performance of scene recognition. To address these limitations, we propose SpaCoNet, a novel approach that simultaneously models the Spatial relation and Co-occurrence of objects based on semantic segmentation. Firstly, the semantic spatial relation module (SSRM) is designed to explore the spatial relations among objects within a scene. With the help of semantic segmentation, this module decouples the spatial information from the image, effectively avoiding the influence of irrelevant features. Secondly, both spatial context features from SSRM and deep features from RGB feature extractor are used to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features mentioned above, we employ the self-attention mechanism to explore the long-range co-occurrence relationships among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three publicly available datasets demonstrate the effectiveness and generality of the proposed method. The code will be made publicly available after the blind-review process is completed.

updated: Mon May 22 2023 03:04:22 GMT+0000 (UTC)

published: Mon May 22 2023 03:04:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト