Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Yui Iioka; Yu Yoshida; Yuiga Wada; Shumpei Hatanaka; Komei Sugiura

操作命令からオブジェクトをセグメンテーションするためのマルチモーダル拡散セグメンテーションモデル

この研究では、自然言語の命令 (例: 「リビングルームに行って、壁にあるラジオアートに最も近い枕を手に入れてください」) を理解して、対象となる日常オブジェクトのセグメンテーションマスクを生成するモデルを開発することを目的としています。このタスクは、(1) 命令内の複数のオブジェクトの参照表現の理解、(2) 複数のフレーズの中から文のターゲットフレーズの予測、(3) ピクセル単位の生成が必要なため、困難です。バウンディングボックスではなくセグメンテーションマスク。言語ベースのセグメンテーション方法に関する研究が行われています。ただし、複雑な文の場合は無関係な領域がマスクされることがあります。この論文では、第 1 段階でマスクを生成し、第 2 段階でそれを改良するマルチモーダル拡散セグメンテーションモデル (MDSM) を提案します。クロスモーダル並列特徴抽出メカニズムを導入し、クロスモーダル特徴を処理するために拡散確率モデルを拡張します。モデルを検証するために、よく知られている Matterport3D および REVERIE データセットに基づいて新しいデータセットを構築しました。このデータセットは、ピクセル単位のセグメンテーションマスクに加えて、さまざまなターゲットオブジェクトを特徴とする実際の屋内環境画像を伴う複雑な参照表現を含む命令で構成されています。 MDSM のパフォーマンスは、平均 IoU +10.13 という大幅な差でベースライン手法のパフォーマンスを上回りました。

In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.

updated: Mon Jul 17 2023 16:07:07 GMT+0000 (UTC)

published: Mon Jul 17 2023 16:07:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト