Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Ruiping Liu; Jiaming Zhang; Kunyu Peng; Junwei Zheng; Ke Cao; Yufan Chen; Kailun Yang; Rainer Stiefelhagen

オープンシーンの理解: 視覚障害を持つ人々を支援するために、グラウンデッド状況認識とセグメントエニシングが融合

グラウンデッド状況認識 (GSR) は、状況に応じて直感的な方法で視覚的なシーンを認識および解釈することができ、画像に描かれた顕著なアクティビティ (動詞) と関連するエンティティ (役割) を生成します。この研究では、視覚障害者 (PVI) を支援する際の GSR の応用に焦点を当てます。ただし、自信を持って周囲を移動し、情報に基づいた意思決定を行うには、検出された物体の正確な位置情報が必要になることがよくあります。初めて、境界ボックスの代わりに関係するエンティティのピクセル単位の密なセグメンテーションマスクを生成することを目的とした Open Scene Understanding (OpenSU) システムを提案します。具体的には、効率的なセグメントエニシングモデル (SAM) を追加採用することで、GSR 上に OpenSU システムを構築します。さらに、特徴抽出とエンコーダ/デコーダ構造間の相互作用を強化するために、GSR のパフォーマンスを向上させるための固体の純粋なトランスバックボーンを使用して OpenSU システムを構築します。収束を加速するために、GSR デコーダ内のすべての活性化関数を GELU に置き換え、それによってトレーニング時間を短縮します。定量分析では、私たちのモデルは SWiG データセット上で最先端のパフォーマンスを実現します。さらに、専用の支援技術データセットとアプリケーションのデモンストレーションでのフィールドテストを通じて、提案された OpenSU システムを使用して、シーンの理解を強化し、視覚障害を持つ人々の自立した移動を促進することができます。私たちのコードは https://github.com/RuipingL/OpenSU で入手できます。

Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at https://github.com/RuipingL/OpenSU.

updated: Sat Jul 15 2023 09:41:27 GMT+0000 (UTC)

published: Sat Jul 15 2023 09:41:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト