MVT: Multi-view Vision Transformer for 3D Object Recognition

Shuo Chen; Tan Yu; Ping Li

MVT：3Dオブジェクト認識用のマルチビュービジョントランスフォーマー

画像認識におけるCNNによって達成された大きな成功に触発されて、ビューベースの方法は3Dオブジェクト理解のために投影されたビューをモデル化するためにCNNを適用し、優れたパフォーマンスを達成しました。それにもかかわらず、マルチビューCNNモデルは、異なるビューからのパッチ間の通信をモデル化できないため、3Dオブジェクト認識での有効性が制限されます。画像認識におけるビジョントランスフォーマーによって得られた最近の成功に触発されて、3Dオブジェクト認識用のマルチビュービジョントランスフォーマー（MVT）を提案します。 Transformerブロックの各パッチ機能にはグローバルな受信フィールドがあるため、異なるビューからのパッチ間の通信を自然に実現します。その間、それはそのCNNの対応物と比較してはるかに少ない誘導バイアスを取ります。有効性と効率性の両方を考慮して、MVTのグローバルローカル構造を開発します。 ModelNet40とModelNet10の2つの公開ベンチマークでの実験は、MVTの競争力のあるパフォーマンスを示しています。

Inspired by the great success achieved by CNN in image recognition, view-based methods applied CNNs to model the projected views for 3D object understanding and achieved excellent performance. Nevertheless, multi-view CNN models cannot model the communications between patches from different views, limiting its effectiveness in 3D object recognition. Inspired by the recent success gained by vision Transformer in image recognition, we propose a Multi-view Vision Transformer (MVT) for 3D object recognition. Since each patch feature in a Transformer block has a global reception field, it naturally achieves communications between patches from different views. Meanwhile, it takes much less inductive bias compared with its CNN counterparts. Considering both effectiveness and efficiency, we develop a global-local structure for our MVT. Our experiments on two public benchmarks, ModelNet40 and ModelNet10, demonstrate the competitive performance of our MVT.

updated: Mon Oct 25 2021 16:23:25 GMT+0000 (UTC)

published: Mon Oct 25 2021 16:23:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト