Efficient Multi-Task Scene Analysis with RGB-D Transformers

Söhnke Benedikt Fischedick; Daniel Seichter; Robin Schmidt; Leonard Rabes; Horst-Michael Gross

RGB-D トランスフォーマーを使用した効率的なマルチタスクシーン分析

シーン分析は、移動ロボットなどの自律システムが現実世界の環境で動作できるようにするために不可欠です。ただし、シーンを包括的に理解するには、パノプティックセグメンテーション、インスタンスの向きの推定、シーンの分類などの複数のタスクを解決する必要があります。モバイルプラットフォーム上のコンピューティング機能とバッテリー機能が限られている場合、これらのタスクを解決するのは困難です。この課題に対処するために、RGB-D Transformer ベースのエンコーダを使用して前述のタスクを同時に実行する、EMSAFormer と呼ばれる効率的なマルチタスクシーン分析アプローチを導入します。私たちのアプローチは、以前に公開された EMSANet に基づいています。ただし、EMSANet のデュアル CNN ベースのエンコーダを単一の Transformer ベースのエンコーダに置き換えることができることを示します。これを達成するために、RGB と深度データの両方からの情報を単一のエンコーダーに効果的に組み込む方法を調査します。ロボットハードウェアでの推論を高速化するために、EMSAFormer アプローチの高度な最適化を可能にするカスタム NVIDIA TensorRT 拡張機能を提供します。一般的に使用される屋内データセット NYUv2、SUNRGB-D、および ScanNet での広範な実験を通じて、私たちのアプローチが最先端のパフォーマンスを達成しながら、NVIDIA Jetson AGX Orin 32 GB で最大 39.1 FPS の推論を可能にすることを示しました。

Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.

updated: Thu Jun 08 2023 14:41:56 GMT+0000 (UTC)

published: Thu Jun 08 2023 14:41:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト