The CORSMAL benchmark for the prediction of the properties of containers

Alessio Xompero; Santiago Donaher; Vladimir Iashin; Francesca Palermo; Gökhan Solak; Claudio Coppola; Reina Ishikawa; Yuichi Nagao; Ryo Hachiuma; Qi Liu; Fan Feng; Chuanlin Lan; Rosa H. M. Chan; Guilherme Christmann; Jyun-Ting Song; Gonuguntla Neeharika; Chinnakotla Krishna Teja Reddy; Dinesh Jain; Bakhtawar Ur Rehman; Andrea Cavallaro

コンテナの特性を予測するためのCORSMALベンチマーク

コンテナの重量と人が操作する内容物の量を非接触で推定することは、人からロボットへの安全な引き渡しの重要な前提条件です。ただし、容器と内容物の不透明性と透明性、および材料、形状、サイズのばらつきにより、この推定は困難です。この論文では、コンテナの容量、およびその内容物の種類、質量、量を推定するための音響および視覚のベンチマークを行うためのさまざまな方法とオープンフレームワークを紹介します。フレームワークには、データセット、特定のタスク、およびパフォーマンス測定値が含まれています。このフレームワークと、関連する作品から設計された音声のみまたは視覚のみのベースラインを使用した方法の詳細な比較分析を実施します。この分析に基づいて、音声のみの分類器と視聴覚分類器は、リカレントニューラルネットワークまたは多数決戦略のいずれかと組み合わせた、さまざまなタイプの畳み込みニューラルネットワークを使用したコンテンツのタイプと量の推定に適していると結論付けることができます。一方、コンピュータービジョン手法は、回帰および幾何学的アプローチを使用してコンテナーの容量を決定するのに適しています。オーディオのみを使用してコンテンツタイプとレベルを分類すると、それぞれ最大81％と97％の加重平均F1スコアが達成されます。視覚のみのアプローチでコンテナ容量を推定し、視聴覚多段階アプローチで充填質量を推定すると、最大65％の加重平均容量と質量スコアに達します。これらの結果は、新しいメソッドの設計にはまだ改善の余地があることを示しています。これらの新しい方法は、オープンフレームワークによって提供される個々のリーダーボードでランク付けおよび比較できます。

The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the filling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework.

updated: Thu Apr 21 2022 11:17:22 GMT+0000 (UTC)

published: Tue Jul 27 2021 10:36:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト