Exploring Long-Sequence Masked Autoencoders

Ronghang Hu; Shoubhik Debnath; Saining Xie; Xinlei Chen

ロングシーケンスマスクオートエンコーダーの調査

Masked Autoencoding (MAE) は、複数のドメインにわたって表現を事前トレーニングするための効果的なアプローチとして登場しました。自然言語の離散トークンとは対照的に、イメージ MAE の入力は連続的であり、追加の仕様に従います。事前トレーニング段階で各入力仕様を体系的に調査し、シーケンスの長さが MAE をさらにスケーリングする重要な軸であることを発見しました。私たちの研究は、マスクサイズをパッチサイズから切り離すだけで、元のレシピへの変更を最小限に抑えたMAEのロングシーケンスバージョンにつながります。オブジェクト検出とセマンティックセグメンテーションでは、ロングシーケンス MAE は、転送中の追加の計算コストなしで、すべての実験セットアップで一貫したゲインを示します。長いシーケンスの事前トレーニングは、検出とセグメンテーションに最も有益であると認識されていますが、標準の画像サイズを維持し、シーケンスの長さだけを増やすことで、ImageNet-1K 分類でも強力な結果を達成しています。私たちの調査結果が、コンピュータービジョンのスケーリングに関する新しい洞察と手段を提供できることを願っています。

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains. In contrast to discrete tokens in natural languages, the input for image MAE is continuous and subject to additional specifications. We systematically study each input specification during the pre-training stage, and find sequence length is a key axis that further scales MAE. Our study leads to a long-sequence version of MAE with minimal changes to the original recipe, by just decoupling the mask size from the patch size. For object detection and semantic segmentation, our long-sequence MAE shows consistent gains across all the experimental setups without extra computation cost during the transfer. While long-sequence pre-training is discerned most beneficial for detection and segmentation, we also achieve strong results on ImageNet-1K classification by keeping a standard image size and only increasing the sequence length. We hope our findings can provide new insights and avenues for scaling in computer vision.

updated: Thu Oct 13 2022 17:50:23 GMT+0000 (UTC)

published: Thu Oct 13 2022 17:50:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト