Generative Partial Visual-Tactile Fused Object Clustering

Tao Zhang; Yang Cong; Gan Sun; Jiahua Dong; Yuyang Liu; Zhengming Ding

生成的部分視覚触覚融合オブジェクトクラスタリング

オブジェクトクラスタリングの視覚触覚融合センシングは、触覚モダリティの関与がクラスタリングパフォーマンスを効果的に改善できるため、最近大きな進歩を遂げました。ただし、欠測データ（つまり、部分データ）の問題は、データ収集プロセス中のオクルージョンとノイズが原因で常に発生します。この問題は、異種モダリティの課題に対するほとんどの既存の部分マルチビュークラスタリング手法では十分に解決されていません。これらの方法を単純に採用すると、必然的に悪影響が生じ、パフォーマンスがさらに低下します。上記の課題を解決するために、オブジェクトクラスタリング用のGenerative Partial Visual-Tactile Fused（つまり、GPVTF）フレームワークを提案します。より具体的には、最初に部分的な視覚的および触覚的データからそれぞれ部分的な視覚的および触覚的特徴抽出を行い、抽出された特徴をモダリティ固有の特徴部分空間にエンコードします。次に、条件付きクロスモーダルクラスタリング生成的敵対的ネットワークを開発して、一方のモダリティ条件付けをもう一方のモダリティに合成します。これにより、欠落しているサンプルを補正し、敵対的学習によって視覚的および触覚的モダリティを自然に調整できます。最後に、2つの疑似ラベルベースのKL発散損失を使用して、対応するモダリティ固有のエンコーダーを更新します。 3つの公開視覚触覚データセットでの広範な比較実験は、私たちの方法の有効性を証明しています。

Visual-tactile fused sensing for object clustering has achieved significant progresses recently, since the involvement of tactile modality can effectively improve clustering performance. However, the missing data (i.e., partial data) issues always happen due to occlusion and noises during the data collecting process. This issue is not well solved by most existing partial multi-view clustering methods for the heterogeneous modality challenge. Naively employing these methods would inevitably induce a negative effect and further hurt the performance. To solve the mentioned challenges, we propose a Generative Partial Visual-Tactile Fused (i.e., GPVTF) framework for object clustering. More specifically, we first do partial visual and tactile features extraction from the partial visual and tactile data, respectively, and encode the extracted features in modality-specific feature subspaces. A conditional cross-modal clustering generative adversarial network is then developed to synthesize one modality conditioning on the other modality, which can compensate missing samples and align the visual and tactile modalities naturally by adversarial learning. To the end, two pseudo-label based KL-divergence losses are employed to update the corresponding modality-specific encoders. Extensive comparative experiments on three public visual-tactile datasets prove the effectiveness of our method.

updated: Mon Dec 28 2020 02:37:03 GMT+0000 (UTC)

published: Mon Dec 28 2020 02:37:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト