Pose Recognition in the Wild: Animal pose estimation using Agglomerative Clustering and Contrastive Learning

Samayan Bhattacharya; Sk Shahnawaz

野生のポーズ認識：凝集クラスタリングと対照学習を使用した動物のポーズ推定

動物の姿勢推定は、生物学、動物学、水産養殖への応用により、最近脚光を浴びています。ディープラーニング手法は、人間の姿勢推定に効果的に適用されています。ただし、これらの方法を動物の姿勢推定に適用する際の主なボトルネックは、十分な量のラベル付きデータが利用できないことです。公開されているラベルのないデータは十分にありますが、動物ごとに大量のデータにラベルを付けることは経済的に非現実的です。さらに、動物界には多種多様な体型があるため、ドメイン間での知識の伝達は効果がありません。人間の脳は大量のラベル付きデータを必要とせずに動物のポーズを認識できるという事実を考えると、教師なし学習を利用して、利用可能なラベルなしデータから動物のポーズ認識の問題に取り組むことは合理的です。本稿では、ラベルのないデータから複数の動物のポーズを認識できる新しいアーキテクチャを紹介します。これを行うには、（1）各画像から背景情報を削除し、動物の体のエッジ検出アルゴリズムを採用します。（2）エッジピクセルの動きを追跡し、体の部分をセグメント化するために凝集クラスタリングを実行します。（3）対照的な学習を採用します。離れた体の部分を一緒にグループ化することを思いとどまらせるため。したがって、基礎となる解剖学的構造ではなく、視覚的な行動に基づいて、動物の体の部分を区別することができます。したがって、人間がラベルを付けたデータよりも効果的なデータの分類を実現できます。 TigDogおよびWLD（WildLife Documentary）データセットでモデルをテストします。ここでは、最先端のアプローチを大幅に上回っています。また、モデルの一般化能力を実証するために、他の公開データでモデルのパフォーマンスを調査します。

Animal pose estimation has recently come into the limelight due to its application in biology, zoology, and aquaculture. Deep learning methods have effectively been applied to human pose estimation. However, the major bottleneck to the application of these methods to animal pose estimation is the unavailability of sufficient quantities of labeled data. Though there are ample quantities of unlabelled data publicly available, it is economically impractical to label large quantities of data for each animal. In addition, due to the wide variety of body shapes in the animal kingdom, the transfer of knowledge across domains is ineffective. Given the fact that the human brain is able to recognize animal pose without requiring large amounts of labeled data, it is only reasonable that we exploit unsupervised learning to tackle the problem of animal pose recognition from the available, unlabelled data. In this paper, we introduce a novel architecture that is able to recognize the pose of multiple animals fromunlabelled data. We do this by (1) removing background information from each image and employing an edge detection algorithm on the body of the animal, (2) Tracking motion of the edge pixels and performing agglomerative clustering to segment body parts, (3) employing contrastive learning to discourage grouping of distant body parts together. Hence we are able to distinguish between body parts of the animal, based on their visual behavior, instead of the underlying anatomy. Thus, we are able to achieve a more effective classification of the data than their human-labeled counterparts. We test our model on the TigDog and WLD (WildLife Documentary) datasets, where we outperform state-of-the-art approaches by a significant margin. We also study the performance of our model on other public data to demonstrate the generalization ability of our model.

updated: Tue Nov 16 2021 07:00:31 GMT+0000 (UTC)

published: Tue Nov 16 2021 07:00:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト