GMML is All you Need

Sara Atito; Muhammad Awais; Josef Kittler

GMMLはあなたが必要とするすべてです

ビジョントランスフォーマーは、ローカルに限定されているか、グローバルに長距離にあるかにかかわらず、コンテキスト情報を活用する柔軟性があるため、コンピュータービジョンコミュニティに大きな関心を寄せています。ただし、それらはデータを大量に消費することが知られています。これにより、自己監視型のトランスフォーマー事前トレーニングの研究が動機付けられました。これは、ラベルによって伝達されるセマンティック情報をデコードして画像プロパティにリンクする必要はなく、画像データの簡潔な表現を抽出することに直接焦点を当てています。類似性、および迷惑要因に対して不変です。自己学習方法の大部分で使用される自己学習プロセスの主要な手段は、トレーニングデータの複数のビューの生成と、これらのビューを使用して画像の類似性とデータの整合性の概念を定義する口実タスクの作成です。ただし、このアプローチには、コンテキスト情報を抽出する自然な傾向がありません。画像内のすべての概念に存在するコンテキスト情報を抽出する機能を備えたビジョントランスフォーマーを事前トレーニングするための自己監視学習（SSL）メカニズムであるグループマスクモデル学習（GMML）を提案します。 GMMLは、接続されたトークンのグループをランダムに操作し、セマンティックコンセプトの意味のある部分をカバーし、コンセプトの可視部分から非表示のセマンティック情報を復元することでこれを実現します。 GMMLは、新しいデータ拡張プロセスを暗黙的に導入します。既存のSSLアプローチのほとんどとは異なり、GMMLは運動量エンコーダーを必要とせず、現在のほとんどの自己監視学習手法の成果物である大規模なバッチや勾配停止などの注意深い実装の詳細に依存しません。ソースコードは、コミュニティがより大きなコーパスでトレーニングできるように公開されています：https://github.com/Sara-Ahmed/GMML。

Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. The source code is publicly available for the community to train on bigger corpora: https://github.com/Sara-Ahmed/GMML.

updated: Mon May 30 2022 10:36:55 GMT+0000 (UTC)

published: Mon May 30 2022 10:36:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト