MixMask: Revisiting Masking Strategy for Siamese ConvNets

Kirill Vishniakov; Eric Xing; Zhiqiang Shen

MixMask: シャム ConvNets のマスキング戦略の再検討

最近の自己教師あり学習の進歩により、マスクイメージモデリング (MIM) とシャムネットワークが、両方の手法の利点を活用する統合フレームワークに統合されました。ただし、シャム ConvNets で従来の消去ベースのマスキングを適用する場合、いくつかの問題が解決されないままです。これらには、(I) ConvNets がデータを継続的に処理する際に有益でないマスクされた領域をドロップできないため、ViT モデルと比較してトレーニング効率が低くなることが含まれます。 (II) MIM アプローチとは異なる、シャム ConvNets における消去ベースのマスキングとコントラストベースの目的の間の不一致。この論文では、バニラマスキング法で画像内のランダムに消去された領域によって引き起こされる情報の不完全性を防ぐために、MixMask と呼ばれる塗りつぶしベースのマスキング戦略を提案します。さらに、統合アーキテクチャを適応させ、Masked Siamese ConvNets (MSCN) での変換された入力と目的の間の不一致を防ぐために、2 つの異なる混合ビュー間のセマンティック距離の変化を考慮する柔軟な損失関数設計を導入します。 CIFAR-100、Tiny-ImageNet、ImageNet-1K など、さまざまなデータセットで広範な実験を実施しました。結果は、提案されたフレームワークが線形プロービング、半教師あり、および教師ありの微調整で優れた精度を達成し、最先端の MSCN を大幅に上回ることを示しています。さらに、オブジェクト検出とセグメンテーションタスクにおけるアプローチの優位性を示します。ソースコードは https://github.com/LightnessOfBeing/MixMask で入手できます。

Recent advances in self-supervised learning have integrated Masked Image Modeling (MIM) and Siamese Networks into a unified framework that leverages the benefits of both techniques. However, several issues remain unaddressed when applying conventional erase-based masking with Siamese ConvNets. These include (I) the inability to drop uninformative masked regions in ConvNets as they process data continuously, resulting in low training efficiency compared to ViT models; and (II) the mismatch between erase-based masking and the contrastive-based objective in Siamese ConvNets, which differs from the MIM approach. In this paper, we propose a filling-based masking strategy called MixMask to prevent information incompleteness caused by the randomly erased regions in an image in the vanilla masking method. Furthermore, we introduce a flexible loss function design that considers the semantic distance change between two different mixed views to adapt the integrated architecture and prevent mismatches between the transformed input and objective in Masked Siamese ConvNets (MSCN). We conducted extensive experiments on various datasets, including CIFAR-100, Tiny-ImageNet, and ImageNet-1K. The results demonstrate that our proposed framework achieves superior accuracy on linear probing, semi-supervised, and supervised finetuning, outperforming the state-of-the-art MSCN by a significant margin. Additionally, we demonstrate the superiority of our approach in object detection and segmentation tasks. Our source code is available at https://github.com/LightnessOfBeing/MixMask.

updated: Tue Mar 21 2023 16:57:57 GMT+0000 (UTC)

published: Thu Oct 20 2022 17:54:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト