Weakly Supervised Instance Attention for Multisource Fine-Grained Object Recognition

Bulut Aygunes; Ramazan Gokberk Cinbis; Selim Aksoy

マルチソースのきめ細かいオブジェクト認識のための弱く監視されたインスタンスの注意

補完的なスペクトル、空間、および構造情報を活用するマルチソース画像分析は、オブジェクトを多くの同様のサブカテゴリの1つに分類することを目的としたきめ細かいオブジェクト認識に役立ちます。ただし、比較的小さなオブジェクトを含むマルチソースタスクの場合、最小の登録エラーでさえ、分類プロセスに高い不確実性をもたらす可能性があります。この問題は、入力画像が、正確な場所を知らなくても、特定のクラスラベルを持つオブジェクトが近隣に存在する、予想されるオブジェクトの場所の周囲のより大きな近隣に対応する、弱教師あり学習の観点からアプローチします。提案された方法は、オブジェクトの共同ローカリゼーションと分類のために並列ブランチを持つ単一ソースのディープインスタンスアテンションモデルを使用し、このモデルをマルチソース設定に拡張します。確率レベル、ロジットレベル、機能レベル、ピクセルレベルの4つの異なるレベルの複数のソース。すべてのレベルの融合が最先端と比較してより高い精度を提供し、機能レベルの融合の最高のパフォーマンス方法により、40種類の樹木の認識で53％の精度が得られ、改善に対応することを示します。 RGB、マルチスペクトル、およびLiDARデータを使用した場合、最高のパフォーマンスを発揮するベースラインを5.7％上回ります。また、さまざまなパラメーターの複雑さの設定で各モデルを評価することにより、詳細な比較を提供します。モデルの容量を増やすと、デフォルトの容量設定よりも6.3％向上します。

Multisource image analysis that leverages complementary spectral, spatial, and structural information benefits fine-grained object recognition that aims to classify an object into one of many similar subcategories. However, for multisource tasks that involve relatively small objects, even the smallest registration errors can introduce high uncertainty in the classification process. We approach this problem from a weakly supervised learning perspective in which the input images correspond to larger neighborhoods around the expected object locations where an object with a given class label is present in the neighborhood without any knowledge of its exact location. The proposed method uses a single-source deep instance attention model with parallel branches for joint localization and classification of objects, and extends this model into a multisource setting where a reference source that is assumed to have no location uncertainty is used to aid the fusion of multiple sources in four different levels: probability level, logit level, feature level, and pixel level. We show that all levels of fusion provide higher accuracies compared to the state-of-the-art, with the best performing method of feature-level fusion resulting in 53% accuracy for the recognition of 40 different types of trees, corresponding to an improvement of 5.7% over the best performing baseline when RGB, multispectral, and LiDAR data are used. We also provide an in-depth comparison by evaluating each model at various parameter complexity settings, where the increased model capacity results in a further improvement of 6.3% over the default capacity setting.

updated: Sun May 23 2021 17:51:14 GMT+0000 (UTC)

published: Sun May 23 2021 17:51:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト