Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation

Hensel Donato Jahja; Novanto Yudistira; Sutrisno

転送学習とデータ拡張を備えたVisionTransformerを使用したマスク使用状況の認識

COVID-19のパンデミックは、さまざまなレベルの社会を混乱させました。マスクの使用は、マスクを使用している人の画像を識別することによってCOVID-19の蔓延を防ぐために不可欠です。マスクを正しく使用しているのは23.1％だけですが、人工ニューラルネットワーク（ANN）は、適切なマスクの使用を分類して、Covid-19ウイルスの拡散を遅らせるのに役立ちます。ただし、マスクの使用を正しく分類できるANNをトレーニングするには、大規模なデータセットが必要です。 MaskedFace-Netは、Mask、Mask Chin、Mask Mouth Chin、MaskNoseMouthの4つのクラスラベルを持つ137016デジタル画像で構成される適切なデータセットです。マスク分類トレーニングは、ImageNet-21kで事前にトレーニングされた重みを使用し、ランダムに拡張された転移学習法を備えたVision Transformers（ViT）アーキテクチャを利用します。さらに、20エポックのトレーニングのハイパーパラメーター、学習率0.03の確率的勾配降下（SGD）オプティマイザー、バッチサイズ64、ガウス累積分布（GeLU）活性化関数、およびクロスエントロピー損失関数は、ViTの3つのアーキテクチャ、つまりBase-16、Large-16、およびHuge-14のトレーニングに適用されるために使用されます。さらに、増強学習と転移学習の有無の比較が行われます。この研究では、ViTHuge-14を使用した転移学習と増強が最良の分類であることがわかりました。 MaskedFace-Netデータセットでこの方法を使用すると、調査はトレーニングデータで0.9601、検証データで0.9412、テストデータで0.9534の精度に達します。この調査では、データ拡張と転送学習を使用してViTモデルをトレーニングすると、畳み込みベースの残余ネットワーク（ResNet）よりも、マスク使用の分類が改善されることが示されています。

The COVID-19 pandemic has disrupted various levels of society. The use of masks is essential in preventing the spread of COVID-19 by identifying an image of a person using a mask. Although only 23.1% of people use masks correctly, Artificial Neural Networks (ANN) can help classify the use of good masks to help slow the spread of the Covid-19 virus. However, it requires a large dataset to train an ANN that can classify the use of masks correctly. MaskedFace-Net is a suitable dataset consisting of 137016 digital images with 4 class labels, namely Mask, Mask Chin, Mask Mouth Chin, and Mask Nose Mouth. Mask classification training utilizes Vision Transformers (ViT) architecture with transfer learning method using pre-trained weights on ImageNet-21k, with random augmentation. In addition, the hyper-parameters of training of 20 epochs, an Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.03, a batch size of 64, a Gaussian Cumulative Distribution (GeLU) activation function, and a Cross-Entropy loss function are used to be applied on the training of three architectures of ViT, namely Base-16, Large-16, and Huge-14. Furthermore, comparisons of with and without augmentation and transfer learning are conducted. This study found that the best classification is transfer learning and augmentation using ViT Huge-14. Using this method on MaskedFace-Net dataset, the research reaches an accuracy of 0.9601 on training data, 0.9412 on validation data, and 0.9534 on test data. This research shows that training the ViT model with data augmentation and transfer learning improves classification of the mask usage, even better than convolutional-based Residual Network (ResNet).

updated: Tue Mar 22 2022 08:50:41 GMT+0000 (UTC)

published: Tue Mar 22 2022 08:50:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト