Enhance the Visual Representation via Discrete Adversarial Training

Xiaofeng Mao; Yuefeng Chen; Ranjie Duan; Yao Zhu; Gege Qi; Shaokai Ye; Xiaodan Li; Rong Zhang; Hui Xue

個別の敵対的トレーニングによる視覚的表現の強化

敵対的な例から防御する最も効果的なアプローチの 1 つとして一般に受け入れられている敵対的トレーニング (AT) は、標準的なパフォーマンスを大幅に損なう可能性があるため、産業規模の生産とアプリケーションでの有用性は限られています。驚くべきことに、この現象は自然言語処理 (NLP) タスクではまったく逆であり、AT は一般化にも役立ちます。 NLP タスクにおける AT のメリットは、離散的で記号的な入力空間から導き出される可能性があることに気付きました。 NLP スタイルの AT から利点を借りるために、Discrete Adversarial Training (DAT) を提案します。 DAT は VQGAN を活用して、画像データを個別のテキストのような入力、つまりビジュアルワードに変換します。次に、象徴的な敵対的摂動を使用して、そのような離散画像の最大リスクを最小限に抑えます。さらに配信の観点から解説し、DATの有効性を実証します。視覚表現を強化するためのプラグアンドプレイ技術として、DAT は、画像分類、オブジェクト検出、自己教師あり学習などの複数のタスクで大幅な改善を実現します。特に、Masked Auto-Encoding (MAE) で事前トレーニングされ、余分なデータなしで DAT によって微調整されたモデルは、ImageNet-C で 31.40 mCE、Stylized-ImageNet で 32.77% のトップ 1 精度を取得し、新しい状態を構築します-最先端の。コードは https://github.com/alibaba/easyrobust で入手できます。

Adversarial Training (AT), which is commonly accepted as one of the most effective approaches defending against adversarial examples, can largely harm the standard performance, thus has limited usefulness on industrial-scale production and applications. Surprisingly, this phenomenon is totally opposite in Natural Language Processing (NLP) task, where AT can even benefit for generalization. We notice the merit of AT in NLP tasks could derive from the discrete and symbolic input space. For borrowing the advantage from NLP-style AT, we propose Discrete Adversarial Training (DAT). DAT leverages VQGAN to reform the image data to discrete text-like inputs, i.e. visual words. Then it minimizes the maximal risk on such discrete images with symbolic adversarial perturbations. We further give an explanation from the perspective of distribution to demonstrate the effectiveness of DAT. As a plug-and-play technique for enhancing the visual representation, DAT achieves significant improvement on multiple tasks including image classification, object detection and self-supervised learning. Especially, the model pre-trained with Masked Auto-Encoding (MAE) and fine-tuned by our DAT without extra data can get 31.40 mCE on ImageNet-C and 32.77% top-1 accuracy on Stylized-ImageNet, building the new state-of-the-art. The code will be available at https://github.com/alibaba/easyrobust.

updated: Fri Sep 16 2022 06:25:06 GMT+0000 (UTC)

published: Fri Sep 16 2022 06:25:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト