Self-Supervised Learning for Fine-Grained Visual Categorization

Muhammad Maaz; Hanoona Abdul Rasheed; Dhanalaxmi Gaddam

きめ細かい視覚分類のための自己教師あり学習

自己教師あり学習（SSL）の最近の研究では、分類タスクのために画像から有用な意味表現を学習する能力が示されています。私たちの仕事を通して、私たちは細粒度視覚分類（FGVC）のためのSSLの有用性を研究します。 FGVCは、一般的なカテゴリ内で視覚的に類似したサブカテゴリのオブジェクトを区別することを目的としています。データセット内のクラス間は小さいがクラス内の変動が大きいため、これは困難な作業です。このようなきめ細かいデータの注釈付きラベルの可用性が限られているため、SSLの必要性が高まり、追加の監視により、追加の注釈のコストなしで学習を促進できます。私たちのベースラインは、トレーニング中にランダムな作物の増強を利用し、テスト中に中央の作物の増強を利用することにより、CUB-200-2011データセットで86.36％のトップ1分類精度を達成します。この作業では、さまざまな口実タスク、具体的には、回転、口実不変表現学習（PIRL）、およびFGVCの脱構築と構築学習（DCL）の有用性を探ります。補助タスクとしての回転は、モデルがグローバルな機能を学習するように促進し、微妙な詳細に焦点を当てることからモデルをそらします。ジグソーパッチを使用するPIRLは、識別可能なローカル領域に焦点を合わせようとしますが、それらを正確にローカライズするのに苦労します。 DCLは、ローカルの識別機能の学習に役立ち、87.41％のトップ1精度を達成することにより、ベースラインを上回ります。脱構築学習は、モデルにローカルオブジェクトパーツに焦点を合わせるように強制しますが、再構築学習は、パーツ間の相関を学習するのに役立ちます。私たちは、調査結果を推論するために広範な実験を行います。私たちのコードはhttps://github.com/mmaaz60/ssl_for_fgvcで入手できます。

Recent research in self-supervised learning (SSL) has shown its capability in learning useful semantic representations from images for classification tasks. Through our work, we study the usefulness of SSL for Fine-Grained Visual Categorization (FGVC). FGVC aims to distinguish objects of visually similar sub categories within a general category. The small inter-class, but large intra-class variations within the dataset makes it a challenging task. The limited availability of annotated labels for such a fine-grained data encourages the need for SSL, where additional supervision can boost learning without the cost of extra annotations. Our baseline achieves 86.36% top-1 classification accuracy on CUB-200-2011 dataset by utilizing random crop augmentation during training and center crop augmentation during testing. In this work, we explore the usefulness of various pretext tasks, specifically, rotation, pretext invariant representation learning (PIRL), and deconstruction and construction learning (DCL) for FGVC. Rotation as an auxiliary task promotes the model to learn global features, and diverts it from focusing on the subtle details. PIRL that uses jigsaw patches attempts to focus on discriminative local regions, but struggles to accurately localize them. DCL helps in learning local discriminating features and outperforms the baseline by achieving 87.41% top-1 accuracy. The deconstruction learning forces the model to focus on local object parts, while reconstruction learning helps in learning the correlation between the parts. We perform extensive experiments to reason our findings. Our code is available at https://github.com/mmaaz60/ssl_for_fgvc.

updated: Tue May 18 2021 19:16:05 GMT+0000 (UTC)

published: Tue May 18 2021 19:16:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト