Robots Understanding Contextual Information in Human-Centered Environments using Weakly Supervised Mask Data Distillation

Daniel Dworakowski; Goldie Nejat

弱教師ありマスクデータ蒸留を使用して人間中心の環境でコンテキスト情報を理解するロボット

標識、記号、オブジェクトなどの人間の環境におけるコンテキスト情報は、ロボットが探索やナビゲーションに使用するための重要な情報を提供します。これらの環境で取得された複雑な画像からコンテキスト情報を識別してセグメント化するために、畳み込みニューラルネットワーク（CNN）などのデータ駆動型の方法が使用されます。ただし、これらの方法では、人間がラベル付けした大量のデータが必要であり、取得に時間がかかり、時間がかかります。弱教師ありメソッドは、疑似セグメンテーションラベル（PSL）を生成することにより、この制限に対処します。このホワイトペーパーでは、コンテキストセグメンテーションのタスク用に特別にトレーニングされていないCNNを使用してPSLを自律的に生成するための新しい弱教師ありマスクデータ蒸留（WeSuperMaDD）アーキテクチャを紹介します。つまり、オブジェクト分類、画像キャプションなどのトレーニングを受けたCNNです。WeSuperMaDDは、疎で制限された多様性データから学習した画像特徴を使用して、PSLを一意に生成します。人間中心の環境（モール、食料品店）でのロボットナビゲーションタスクで一般的です。私たちが提案するアーキテクチャは、コストの制約を満たす最も少ない前景ピクセルを持つPSLを自動的に検索する新しいマスクリファインメントシステムを使用します。これにより、手作りのヒューリスティックルールが不要になります。広範な実験により、複数の屋内/屋外環境でさまざまなスケール、フォント、およびパースペクティブのテキストを含むデータセットのPSLを生成する際のWeSuperMaDDのパフォーマンスの検証に成功しました。 Naive、GrabCut、およびPyramidの方法と比較すると、ラベルとセグメンテーションの品質が大幅に向上していることがわかりました。さらに、WeSuperMaDDアーキテクチャを使用してトレーニングされたコンテキストセグメンテーションCNNは、ナイーブPSLでトレーニングされたものと比較して精度の測定可能な改善を達成しました。また、私たちの方法は、トレーニングにセグメンテーションラベルを必要とせずに、実際のデータセットでの既存の最先端のテキスト検出およびセグメンテーション方法と同等のパフォーマンスを示しました。

Contextual information in human environments, such as signs, symbols, and objects provide important information for robots to use for exploration and navigation. To identify and segment contextual information from complex images obtained in these environments, data-driven methods such as Convolutional Neural Networks (CNNs) are used. However, these methods require large amounts of human labeled data which are slow and time-consuming to obtain. Weakly supervised methods address this limitation by generating pseudo segmentation labels (PSLs). In this paper, we present the novel Weakly Supervised Mask Data Distillation (WeSuperMaDD) architecture for autonomously generating PSLs using CNNs not specifically trained for the task of context segmentation; i.e., CNNs trained for object classification, image captioning, etc. WeSuperMaDD uniquely generates PSLs using learned image features from sparse and limited diversity data; common in robot navigation tasks in human-centred environments (malls, grocery stores). Our proposed architecture uses a new mask refinement system which automatically searches for the PSL with the fewest foreground pixels that satisfies cost constraints. This removes the need for handcrafted heuristic rules. Extensive experiments successfully validated the performance of WeSuperMaDD in generating PSLs for datasets with text of various scales, fonts, and perspectives in multiple indoor/outdoor environments. A comparison with Naive, GrabCut, and Pyramid methods found a significant improvement in label and segmentation quality. Moreover, a context segmentation CNN trained using the WeSuperMaDD architecture achieved measurable improvements in accuracy compared to one trained with Naive PSLs. Our method also had comparable performance to existing state-of-the-art text detection and segmentation methods on real datasets without requiring segmentation labels for training.

updated: Tue Dec 15 2020 13:24:31 GMT+0000 (UTC)

published: Tue Dec 15 2020 13:24:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト