Vision Transformer for Contrastive Clustering

Hua-Bao Ling; Bowen Zhu; Dong Huang; Ding-Hua Chen; Chang-Dong Wang; Jian-Huang Lai

対照的なクラスタリングのためのVisionTransformer

Vision Transformer（ViT）は、視覚表現学習のためにグローバルな長距離依存関係をキャプチャする機能により、畳み込みニューラルネットワーク（CNN）よりも優れていることを示しています。 ViTに加えて、対照学習は最近のもう1つの人気のある研究トピックです。以前の対照学習作業は主にCNNに基づいていますが、最近のいくつかの研究では、ViTと対照学習を組み合わせて自己教師あり学習を強化しようとしています。かなりの進歩にもかかわらず、ViTと対照学習のこれらの組み合わせは、主にインスタンスレベルの対照性に焦点を当てています。これは、グローバルな対照性を見落としがちであり、クラスタリング結果を直接学習する機能もありません（たとえば、画像の場合）。これを考慮して、この論文は、対照的クラスタリングのためのビジョントランスフォーマー（VTCC）と呼ばれる新しいディープクラスタリングアプローチを提示します。これは、私たちの知る限り、トランスフォーマーと画像クラスタリングタスクの対照的な学習を初めて統合します。具体的には、各画像に対して2つのランダムな拡張を実行し、バックボーンとして2つの重み共有ビューを備えたViTエンコーダーを利用します。 ViTの潜在的な不安定性を改善するために、畳み込みステムを組み込んで、各拡張サンプルをパッチのシーケンスに分割します。これは、パッチ投影層での大きな畳み込みの代わりに、複数の積み重ねられた小さな畳み込みを使用します。バックボーンを介してパッチのシーケンスの特徴表現を学習することにより、インスタンスプロジェクターとクラスタープロジェクターをさらに利用して、それぞれインスタンスレベルの対照学習とグローバルクラスタリング構造学習を実行します。 8つの画像データセットでの実験は、最先端のVTCCアプローチの安定性（ゼロからのトレーニング中）と優位性（クラスタリングパフォーマンス）を示しています。

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another popular research topic recently. While previous contrastive learning works are mostly based on CNNs, some recent studies have attempted to combine ViT and contrastive learning for enhanced self-supervised learning. Despite the considerable progress, these combinations of ViT and contrastive learning mostly focus on the instance-level contrastiveness, which often overlook the global contrastiveness and also lack the ability to directly learn the clustering result (e.g., for images). In view of this, this paper presents a novel deep clustering approach termed Vision Transformer for Contrastive Clustering (VTCC), which for the first time, to our knowledge, unifies the Transformer and the contrastive learning for the image clustering task. Specifically, with two random augmentations performed on each image, we utilize a ViT encoder with two weight-sharing views as the backbone. To remedy the potential instability of the ViT, we incorporate a convolutional stem to split each augmented sample into a sequence of patches, which uses multiple stacked small convolutions instead of a big convolution in the patch projection layer. By learning the feature representations for the sequences of patches via the backbone, an instance projector and a cluster projector are further utilized to perform the instance-level contrastive learning and the global clustering structure learning, respectively. Experiments on eight image datasets demonstrate the stability (during the training-from-scratch) and the superiority (in clustering performance) of our VTCC approach over the state-of-the-art.

updated: Sun Jul 10 2022 08:58:47 GMT+0000 (UTC)

published: Sun Jun 26 2022 17:00:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト