Bag of Tricks for Long-Tail Visual Recognition of Animal Species in Camera-Trap Images

Fagner Cunha; Eulanda M. dos Santos; Juan G. Colonna

カメラトラップ画像内の動物種のロングテール視覚認識のためのトリックのバッグ

カメラトラップは、野生生物を監視する方法であり、大量の写真を収集します。通常、各種について収集された画像の数は、ロングテール分布に従います。つまり、いくつかのクラスには多数のインスタンスがあり、多くの種ではわずかな割合しかありません。ほとんどの場合、これらの希少種は生態学者にとって関心のあるものですが、これらのモデルはトレーニングに大量の画像を必要とするため、ディープラーニングモデルを使用する際には無視されることがよくあります。この作業では、平方根サンプリングブランチ (SSB) と呼ばれるシンプルで効果的なフレームワークが提案されています。これは、平方根サンプリングとインスタンスサンプリングを使用してトレーニングされた 2 つの分類ブランチを組み合わせて、ロングテールの視覚認識を改善します。このタスクを処理するための最先端の方法: 平方根サンプリング、クラスバランスフォーカルロス、およびバランスグループソフトマックス。より一般的な結論を得るために、ロングテール視覚認識を処理する方法を、コンピュータービジョンモデルの 4 つのファミリ (ResNet、MobileNetV3、EfficientNetV2、および Swin Transformer) と、異なる特性を持つ 4 つのカメラトラップデータセットで体系的に評価しました。最初に、最新のトレーニングトリックを使用した堅牢なベースラインが準備され、その後、ロングテール認識を改善する方法が適用されました。私たちの実験では、平方根サンプリングが少数派クラスのパフォーマンスを約 15% 改善した方法であることが示されています。ただし、これは大多数のクラスの精度を少なくとも 3% 低下させるという代償を払っていました。提案されたフレームワーク (SSB) は、他の方法と競合することを実証し、テールクラスのほとんどのケースで最高または 2 番目に良い結果を達成しました。ただし、平方根サンプリングとは異なり、ヘッドクラスのパフォーマンスの損失は最小限であったため、評価されたすべての方法の中で最良のトレードオフが達成されました。

Camera traps are a method for monitoring wildlife and they collect a large number of pictures. The number of images collected of each species usually follows a long-tail distribution, i.e., a few classes have a large number of instances, while a lot of species have just a small percentage. Although in most cases these rare species are the ones of interest to ecologists, they are often neglected when using deep-learning models because these models require a large number of images for the training. In this work, a simple and effective framework called Square-Root Sampling Branch (SSB) is proposed, which combines two classification branches that are trained using square-root sampling and instance sampling to improve long-tail visual recognition, and this is compared to state-of-the-art methods for handling this task: square-root sampling, class-balanced focal loss, and balanced group softmax. To achieve a more general conclusion, the methods for handling long-tail visual recognition were systematically evaluated in four families of computer vision models (ResNet, MobileNetV3, EfficientNetV2, and Swin Transformer) and four camera-trap datasets with different characteristics. Initially, a robust baseline with the most recent training tricks was prepared and, then, the methods for improving long-tail recognition were applied. Our experiments show that square-root sampling was the method that most improved the performance for minority classes by around 15%; however, this was at the cost of reducing the majority classes' accuracy by at least 3%. Our proposed framework (SSB) demonstrated itself to be competitive with the other methods and achieved the best or the second-best results for most of the cases for the tail classes; but, unlike the square-root sampling, the loss in the performance of the head classes was minimal, thus achieving the best trade-off among all the evaluated methods.

updated: Mon Mar 06 2023 21:26:26 GMT+0000 (UTC)

published: Fri Jun 24 2022 18:30:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト