Fast GraspNeXt: A Fast Self-Attention Neural Network Architecture for Multi-task Learning in Computer Vision Tasks for Robotic Grasping on the Edge

Alexander Wong; Yifan Wu; Saad Abbasi; Saeejith Nair; Yuhao Chen; Mohammad Javad Shafiee

Fast GraspNeXt: コンピュータービジョンにおけるマルチタスク学習のための高速自己注意ニューラルネットワークアーキテクチャエッジでのロボットによる把持のためのタスク

マルチタスク学習は、ロボットによる把持を目的としたディープラーニング主導のビジョンシステムのパフォーマンスを向上させる上で、かなりの見込みがあることを示しています。ただし、アーキテクチャと計算が非常に複雑なため、通常、実際の製造環境や倉庫環境のロボットアームで利用される組み込みデバイスへの展開には適していない可能性があります。そのため、製造環境で広く採用されるためには、コンピュータービジョンタスク用に調整された非常に効率的なマルチタスクディープニューラルネットワークアーキテクチャの設計が強く望まれています。これに動機付けられて、ロボットによる把持のためのコンピュータービジョンタスクに組み込まれたマルチタスク学習用に調整された高速自己注意ニューラルネットワークアーキテクチャである Fast GraspNeXt を提案します。 Fast GraspNeXt を構築するために、マルチタスク学習パフォーマンスと組み込み推論効率の強力なバランスを実現するためにカスタマイズされた一連のアーキテクチャ制約を備えた生成ネットワークアーキテクチャ検索戦略を活用します。 MetaGraspNet ベンチマークデータセットの実験結果は、他の効率的なマルチタスクネットワークアーキテクチャと比較して、Fast GraspNeXt ネットワーク設計が複数のコンピュータービジョンタスクで最高のパフォーマンス (平均精度 (AP)、精度、および平均二乗誤差 (MSE)) を達成することを示しています。 NVIDIA Jetson TX2 組み込みプロセッサでは、わずか 17.8M のパラメーター (約 >5 倍小さい)、259 GFLOP (約 >5 倍低い)、および >3.15 倍高速です。

Multi-task learning has shown considerable promise for improving the performance of deep learning-driven vision systems for the purpose of robotic grasping. However, high architectural and computational complexity can result in poor suitability for deployment on embedded devices that are typically leveraged in robotic arms for real-world manufacturing and warehouse environments. As such, the design of highly efficient multi-task deep neural network architectures tailored for computer vision tasks for robotic grasping on the edge is highly desired for widespread adoption in manufacturing environments. Motivated by this, we propose Fast GraspNeXt, a fast self-attention neural network architecture tailored for embedded multi-task learning in computer vision tasks for robotic grasping. To build Fast GraspNeXt, we leverage a generative network architecture search strategy with a set of architectural constraints customized to achieve a strong balance between multi-task learning performance and embedded inference efficiency. Experimental results on the MetaGraspNet benchmark dataset show that the Fast GraspNeXt network design achieves the highest performance (average precision (AP), accuracy, and mean squared error (MSE)) across multiple computer vision tasks when compared to other efficient multi-task network architecture designs, while having only 17.8M parameters (about >5x smaller), 259 GFLOPs (as much as >5x lower) and as much as >3.15x faster on a NVIDIA Jetson TX2 embedded processor.

updated: Fri Apr 21 2023 18:07:14 GMT+0000 (UTC)

published: Fri Apr 21 2023 18:07:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト