Towards Fine-grained Image Classification with Generative Adversarial Networks and Facial Landmark Detection

Mahdi Darvish; Mahsa Pouramini; Hamid Bahador

生成的敵対的ネットワークと顔のランドマーク検出によるきめ細かい画像分類に向けて

カテゴリを区別するには複雑で局所的な違いを学習する必要があるため、きめ細かい分類は依然として困難な作業です。画像内のオブジェクトのポーズ、スケール、および位置の多様性は、問題をさらに困難にします。最近のVisionTransformerモデルは高性能を実現していますが、大量の入力データが必要です。この問題に対処するために、GANベースのデータ拡張を最大限に活用して、追加のデータセットインスタンスを生成しました。 Oxford-IIIT Petsは、この実験で選択したデータセットでした。スケール、ポーズ、照明が異なる37品種の猫と犬で構成されているため、分類作業の難しさが増しています。さらに、最近のGenerative Adversarial Network（GAN）であるStyleGAN2-ADAモデルのパフォーマンスを強化して、トレーニングセットへの過剰適合を防ぎながら、よりリアルな画像を生成しました。これは、動物の顔のランドマークを予測するためにカスタマイズされたバージョンのMobileNetV2をトレーニングすることによって行いました。次に、それに応じて画像をトリミングしました。最後に、合成画像を元のデータセットと組み合わせ、提案された方法を標準のGAN拡張と比較し、トレーニングデータのさまざまなサブセットによる拡張なしと比較しました。最近のVisionTransformer（ViT）モデルで、きめ細かい画像分類の精度を評価することにより、作業を検証しました。

Fine-grained classification remains a challenging task because distinguishing categories needs learning complex and local differences. Diversity in the pose, scale, and position of objects in an image makes the problem even more difficult. Although the recent Vision Transformer models achieve high performance, they need an extensive volume of input data. To encounter this problem, we made the best use of GAN-based data augmentation to generate extra dataset instances. Oxford-IIIT Pets was our dataset of choice for this experiment. It consists of 37 breeds of cats and dogs with variations in scale, poses, and lighting, which intensifies the difficulty of the classification task. Furthermore, we enhanced the performance of the recent Generative Adversarial Network (GAN), StyleGAN2-ADA model to generate more realistic images while preventing overfitting to the training set. We did this by training a customized version of MobileNetV2 to predict animal facial landmarks; then, we cropped images accordingly. Lastly, we combined the synthetic images with the original dataset and compared our proposed method with standard GANs augmentation and no augmentation with different subsets of training data. We validated our work by evaluating the accuracy of fine-grained image classification on the recent Vision Transformer (ViT) Model.

updated: Sat Aug 28 2021 06:32:42 GMT+0000 (UTC)

published: Sat Aug 28 2021 06:32:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト