Synthetic Dataset Generation for Privacy-Preserving Machine Learning

Efstathia Soufleri; Gobinda Saha; Kaushik Roy

プライバシーを保護する機械学習のための合成データセットの生成

機械学習 (ML) は、コンピュータービジョン、音声認識、物体検出などのさまざまな問題を解決する上で大きな成功を収めています。この成功の主な理由は、ディープニューラルネットワーク (DNN) をトレーニングするための巨大なデータセットが利用できることです。ただし、データセットに医療記録などの機密情報が含まれている場合、データセットを公開することはできず、データのプライバシーが大きな懸念事項になります。暗号化方法は可能な解決策になる可能性がありますが、ML アプリケーションへの展開は分類の精度に深刻な影響を与え、かなりの計算オーバーヘッドが発生します。あるいは、難読化技術を使用することもできますが、視覚的なプライバシーと精度の間で適切なトレードオフを維持することは困難です。この論文では、元のプライベートデータセットから安全な合成データセットを生成する方法を提案します。元のデータセットで事前トレーニングされたバッチ正規化 (BN) レイヤーを含むネットワークが与えられた場合、最初にクラスごとの BN レイヤー統計を記録します。次に、合成データが元の画像のレイヤー単位の統計分布と一致するようにランダムノイズを最適化することにより、合成データセットを生成します。画像分類データセット (CIFAR10、ImageNet) でこの方法を評価し、元の CIFAR10/ImageNet データの代わりに合成データを使用してネットワークをゼロからトレーニングし、同等の分類パフォーマンスを生成できることを示します。さらに、私たちの方法によって提供される視覚的プライバシーを分析するために、画像品質メトリックを使用して、元の画像と合成画像の間の高度な視覚的相違を示します。さらに、提案された方法が、勾配マッチング攻撃、モデル記憶攻撃、および GAN ベースの攻撃を含むさまざまなプライバシー漏洩攻撃の下でデータのプライバシーを保護することを示します。

Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets cannot be publicly released if they contain sensitive information such as medical records, and data privacy becomes a major concern. Encryption methods could be a possible solution, however their deployment on ML applications seriously impacts classification accuracy and results in substantial computational overhead. Alternatively, obfuscation techniques could be used, but maintaining a good trade-off between visual privacy and accuracy is challenging. In this paper, we propose a method to generate secure synthetic datasets from the original private datasets. Given a network with Batch Normalization (BN) layers pretrained on the original dataset, we first record the class-wise BN layer statistics. Next, we generate the synthetic dataset by optimizing random noise such that the synthetic data match the layer-wise statistical distribution of original images. We evaluate our method on image classification datasets (CIFAR10, ImageNet) and show that synthetic data can be used in place of the original CIFAR10/ImageNet data for training networks from scratch, producing comparable classification performance. Further, to analyze visual privacy provided by our method, we use Image Quality Metrics and show high degree of visual dissimilarity between the original and synthetic images. Moreover, we show that our proposed method preserves data-privacy under various privacy-leakage attacks including Gradient Matching Attack, Model Memorization Attack, and GAN-based Attack.

updated: Tue Jan 17 2023 17:31:02 GMT+0000 (UTC)

published: Thu Oct 06 2022 20:54:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト