CUDA: Convolution-based Unlearnable Datasets

Vinu Sankar Sadasivan; Mahdi Soltanolkotabi; Soheil Feizi

CUDA: 畳み込みベースの学習不可能なデータセット

最新の深層学習モデルの大規模なトレーニングは、Web 上で公開されているデータに大きく依存しています。このようにオンラインデータが不正に使用される可能性があるため、データのプライバシーに関する懸念が生じます。最近の研究では、この問題に対処するために特別に設計された小さなノイズを追加することで、ディープラーニングモデルの学習不可能なデータを作成することを目指しています。ただし、これらの方法は、敵対的トレーニング (AT) に対して脆弱であるか、計算量が多いです。この作業では、新しいモデルフリーの畳み込みベースの非学習データセット (CUDA) 生成手法を提案します。 CUDA は、制御されたクラスごとの畳み込みと、秘密鍵を介してランダムに生成されるフィルターを使用して生成されます。 CUDA は、クリーンなデータを分類するための有益な機能ではなく、ネットワークがフィルターとラベルの関係を学習することを奨励します。最適ベイズ分類器のクリーンデータパフォーマンスを低下させることにより、CUDA がガウス混合データを正常にポイズニングできることを示す理論的分析を開発します。また、さまざまなデータセット (CIFAR-10、CIFAR-100、ImageNet-100、および Tiny-ImageNet) とアーキテクチャ (ResNet-18、VGG-16、Wide ResNet-34-10、DenseNet- 121、DeIT、EfficientNetV2-S、および MobileNetV2)。私たちの実験では、CUDA が、平滑化、異なる予算での AT、転移学習、微調整など、さまざまなデータ拡張やトレーニングアプローチに対して堅牢であることを示しています。たとえば、ImageNet-100 CUDA で ResNet-18 をトレーニングすると、経験的リスク最小化 (ERM)、L_∞ AT、および L_2 AT で、それぞれ 8.96%、40.08%、および 20.58% のクリーンテスト精度のみが達成されます。ここで、クリーンなトレーニングデータに対する ERM は、80.66% のクリーンなテスト精度を達成します。 CUDA は、トレーニングデータセットの一部のみが摂動されている場合でも、ERM で学習不能効果を示します。さらに、CUDA がそれを破るために特別に設計された適応防御に対して堅牢であることも示しています。

Large-scale training of modern deep learning models heavily relies on publicly available data on the web. This potentially unauthorized usage of online data leads to concerns regarding data privacy. Recent works aim to make unlearnable data for deep learning models by adding small, specially designed noises to tackle this issue. However, these methods are vulnerable to adversarial training (AT) and/or are computationally heavy. In this work, we propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA) generation technique. CUDA is generated using controlled class-wise convolutions with filters that are randomly generated via a private key. CUDA encourages the network to learn the relation between filters and labels rather than informative features for classifying the clean data. We develop some theoretical analysis demonstrating that CUDA can successfully poison Gaussian mixture data by reducing the clean data performance of the optimal Bayes classifier. We also empirically demonstrate the effectiveness of CUDA with various datasets (CIFAR-10, CIFAR-100, ImageNet-100, and Tiny-ImageNet), and architectures (ResNet-18, VGG-16, Wide ResNet-34-10, DenseNet-121, DeIT, EfficientNetV2-S, and MobileNetV2). Our experiments show that CUDA is robust to various data augmentations and training approaches such as smoothing, AT with different budgets, transfer learning, and fine-tuning. For instance, training a ResNet-18 on ImageNet-100 CUDA achieves only 8.96%, 40.08%, and 20.58% clean test accuracies with empirical risk minimization (ERM), L_∞ AT, and L_2 AT, respectively. Here, ERM on the clean training data achieves a clean test accuracy of 80.66%. CUDA exhibits unlearnability effect with ERM even when only a fraction of the training dataset is perturbed. Furthermore, we also show that CUDA is robust to adaptive defenses designed specifically to break it.

updated: Tue Mar 07 2023 22:57:23 GMT+0000 (UTC)

published: Tue Mar 07 2023 22:57:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト