CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Nur Muhammad Mahi Shafiullah; Chris Paxton; Lerrel Pinto; Soumith Chintala; Arthur Szlam

CLIP-Fields: ロボット記憶のための弱教師付きセマンティックフィールド

人間の直接の監督なしでトレーニングできる暗黙的なシーンモデルであるCLIP-Fieldsを提案します。このモデルは、空間位置からセマンティック埋め込みベクトルへのマッピングを学習します。このマッピングは、セグメンテーション、インスタンスの識別、空間に対するセマンティック検索、ビューのローカリゼーションなど、さまざまなタスクに使用できます。最も重要なことは、マッピングは、CLIP、Detic、Sentence-BERT などの Web 画像および Web テキストでトレーニングされたモデルのみからの監督でトレーニングできることです。 Mask-RCNN のようなベースラインと比較すると、私たちの方法は、例のほんの一部で、HM3D データセットの少数のインスタンス識別またはセマンティックセグメンテーションで優れています。最後に、シーンメモリとして CLIP フィールドを使用して、ロボットが現実世界の環境でセマンティックナビゲーションを実行できることを示します。私たちのコードとデモは、https://mahis.life/clip-fields から入手できます。

We propose CLIP-Fields, an implicit scene model that can be trained with no direct human supervision. This model learns a mapping from spatial locations to semantic embedding vectors. The mapping can then be used for a variety of tasks, such as segmentation, instance identification, semantic search over space, and view localization. Most importantly, the mapping can be trained with supervision coming only from web-image and web-text trained models such as CLIP, Detic, and Sentence-BERT. When compared to baselines like Mask-RCNN, our method outperforms on few-shot instance identification or semantic segmentation on the HM3D dataset with only a fraction of the examples. Finally, we show that using CLIP-Fields as a scene memory, robots can perform semantic navigation in real-world environments. Our code and demonstrations are available here: https://mahis.life/clip-fields

updated: Tue Nov 01 2022 22:56:39 GMT+0000 (UTC)

published: Tue Oct 11 2022 17:57:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト