Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

Songsong Xiong; Georgios Tziafas; Hamidreza Kasaei

ハイブリッドマルチモーダルビジョントランスフォーマー-CNN モデルを使用した、きめの細かい 3D オブジェクト認識の強化

小売店、レストラン、家庭などの人間中心の環境で動作するロボットは、多くの場合、異なるコンテキストで類似のオブジェクトを高い精度で区別する必要があります。ただし、カテゴリ内の非類似度が高く、カテゴリ間の非類似度が低いため、ロボティクスではきめの細かい物体認識が依然として課題となっています。さらに、きめの細かい 3D データセットの数が限られているため、この問題に効果的に対処する上で重大な問題が生じます。このホワイトペーパーでは、ハイブリッドマルチモーダルビジョントランスフォーマー (ViT) と畳み込みニューラルネットワーク (CNN) アプローチを提案して、細粒度視覚分類 (FGVC) のパフォーマンスを向上させます。 FGVC 3D データセットの不足に対処するために、2 つの合成データセットを生成しました。最初のデータセットは、レストランに関連する 20 のカテゴリと合計 100 のインスタンスで構成され、2 番目のデータセットには 120 の靴のインスタンスが含まれています。私たちのアプローチは両方のデータセットで評価され、その結果は、レストランと靴のデータセットでそれぞれ 94.50% と 93.51% の認識精度を達成し、CNN のみと ViT のみの両方のベースラインよりも優れていることを示しています。さらに、研究コミュニティが FGVC RGB-D データセットを利用できるようにして、さらなる実験と進歩を可能にしました。さらに、提案した方法をロボットフレームワークとうまく統合し、シミュレートされたロボットシナリオと現実世界のロボットシナリオの両方で、きめの細かい認識ツールとしての可能性を示しました。

Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that it outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50 % and 93.51 % on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we successfully integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.

updated: Mon Mar 06 2023 15:45:44 GMT+0000 (UTC)

published: Mon Oct 03 2022 13:34:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト