Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images

Reenul Reedha; Eric Dericquebourg; Raphael Canals; Adel Hafiane

高解像度UAV画像の雑草と作物の分類のためのビジョントランスフォーマー

作物と雑草の監視は、今日の農業と食料生産にとって重要な課題です。データ取得および計算技術の最近の進歩のおかげで、農業は、高収量および高品質の作物生産に対応するために、よりスマートで精密な農業へと進化しています。無人航空機（UAV）画像の分類と認識は、作物の監視にとって重要なフェーズです。畳み込みニューラルネットワーク（CNN）に依存する深層学習モデルの進歩により、農業分野での画像分類で高いパフォーマンスが達成されました。このアーキテクチャの成功にもかかわらず、CNNは、高い計算コスト、大きなラベル付きデータセットの必要性など、依然として多くの課題に直面しています...自然言語処理のトランスフォーマーアーキテクチャは、CNNの制限に対処するための代替アプローチになる可能性があります。自己注意パラダイムを利用して、Vision Transformer（ViT）モデルは、畳み込み演算を適用せずに、競争力のある、またはより良い結果を達成できます。この論文では、雑草と作物の植物分類のために、ViTモデルを介した自己注意メカニズムを採用しています：赤ビート、オフタイプビート（緑の葉）、パセリ、ホウレンソウ。私たちの実験では、ラベル付けされたトレーニングデータの小さなセットを使用すると、ViTモデルは最先端のCNNベースのモデルEfficientNetおよびResNetと比較してパフォーマンスが向上し、ViTモデルによって99.8％の最高精度が達成されることが示されています。

Crop and weed monitoring is an important challenge for agriculture and food production nowadays. Thanks to recent advances in data acquisition and computation technologies, agriculture is evolving to a more smart and precision farming to meet with the high yield and high quality crop production. Classification and recognition in Unmanned Aerial Vehicles (UAV) images are important phases for crop monitoring. Advances in deep learning models relying on Convolutional Neural Network (CNN) have achieved high performances in image classification in the agricultural domain. Despite the success of this architecture, CNN still faces many challenges such as high computation cost, the need of large labelled datasets, ... Natural language processing's transformer architecture can be an alternative approach to deal with CNN's limitations. Making use of the self-attention paradigm, Vision Transformer (ViT) models can achieve competitive or better results without applying any convolution operations. In this paper, we adopt the self-attention mechanism via the ViT models for plant classification of weeds and crops: red beet, off-type beet (green leaves), parsley and spinach. Our experiments show that with small set of labelled training data, ViT models perform better compared to state-of-the-art CNN-based models EfficientNet and ResNet, with a top accuracy of 99.8% achieved by the ViT model.

updated: Fri Oct 22 2021 08:34:08 GMT+0000 (UTC)

published: Mon Sep 06 2021 19:58:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト