Lethal Dose Conjecture on Data Poisoning

Wenxiao Wang; Alexander Levine; Soheil Feizi

データ中毒に関する致死量の推測

データポイズニングは、悪意のある目的で機械学習アルゴリズムのトレーニングセットを歪める敵を考慮します。この作業では、致死量予想と呼ばれるデータポイズニングの基礎に関する 1 つの予想を明らかにします。予想では、次のように述べられています。正確な予測に n 個のクリーンなトレーニングサンプルが必要な場合、サイズ N のトレーニングセットでは、正確性を確保しながら Θ(N/n) の汚染されたサンプルのみを許容できます。理論的には、この予想を複数のケースで検証します。また、分布の識別を通じて、この予想のより一般的な見方を提供します。ディープパーティションアグリゲーション (DPA) とその拡張である有限アグリゲーション (FA) は、データポイズニングに対する証明可能な防御のための最近のアプローチであり、特定の学習者を使用してトレーニングセットのさまざまなサブセットからトレーニングされた多くの基本モデルの多数決によって予測します。この予想は、DPA と FA の両方が (漸近的に) 最適であることを意味します。最もデータ効率の高い学習器があれば、データポイズニングに対する最も堅牢な防御の 1 つに変えることができます。これは、データ効率の高い学習者を見つけることによって、中毒に対するより強力な防御を開発するための実用的なアプローチの概要を示しています。経験的に、概念実証として、基本学習者に異なるデータ拡張を使用するだけで、精度を犠牲にすることなく、CIFAR-10 および GTSRB で認定された DPA の堅牢性をそれぞれ 2 倍および 3 倍にできることを示します。

Data poisoning considers an adversary that distorts the training set of machine learning algorithms for malicious purposes. In this work, we bring to light one conjecture regarding the fundamentals of data poisoning, which we call the Lethal Dose Conjecture. The conjecture states: If n clean training samples are needed for accurate predictions, then in a size-N training set, only Θ(N/n) poisoned samples can be tolerated while ensuring accuracy. Theoretically, we verify this conjecture in multiple cases. We also offer a more general perspective of this conjecture through distribution discrimination. Deep Partition Aggregation (DPA) and its extension, Finite Aggregation (FA) are recent approaches for provable defenses against data poisoning, where they predict through the majority vote of many base models trained from different subsets of training set using a given learner. The conjecture implies that both DPA and FA are (asymptotically) optimal -- if we have the most data-efficient learner, they can turn it into one of the most robust defenses against data poisoning. This outlines a practical approach to developing stronger defenses against poisoning via finding data-efficient learners. Empirically, as a proof of concept, we show that by simply using different data augmentations for base learners, we can respectively double and triple the certified robustness of DPA on CIFAR-10 and GTSRB without sacrificing accuracy.

updated: Tue Oct 18 2022 19:41:22 GMT+0000 (UTC)

published: Fri Aug 05 2022 17:53:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト