Synthetic Dataset Generation for Privacy-Preserving Machine Learning

Efstathia Soufleri; Gobinda Saha; Kaushik Roy

プライバシーを保護する機械学習のための合成データセットの生成

機械学習 (ML) は、コンピュータービジョン、音声認識、物体検出などのさまざまな問題を解決する上で大きな成功を収めています。この成功の主な理由は、ディープニューラルネットワーク (DNN) をトレーニングするための巨大なデータセットが利用できることです。ただし、データセットに医療記録や財務記録などの機密情報が含まれている場合、データセットを公開することはできません。このような場合、データのプライバシーが大きな懸念事項になります。暗号化手法は、この問題に対する可能な解決策を提供しますが、ML アプリケーションへの展開は、分類の精度に深刻な影響を与え、かなりの計算オーバーヘッドをもたらすため、自明ではありません。プライバシーと正確性は困難です。この作業では、元のプライベートデータセットから安全な合成データセットを生成する方法を提案します。私たちの方法では、元のデータセットで事前にトレーニングされたバッチ正規化 (BN) レイヤーを持つネットワークが与えられた場合、最初にレイヤー単位の BN 統計を記録します。次に、BN 統計と事前トレーニング済みモデルを使用して、合成データが元のモデルのレイヤー単位の統計分布と一致するようにランダムノイズを最適化することにより、合成データセットを生成します。画像分類データセット (CIFAR10) でこの方法を評価し、合成データをネットワークのトレーニングに最初から使用して、合理的な分類パフォーマンスを生成できることを示します。

Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets can not be publicly released if they contain sensitive information such as medical or financial records. In such cases, data privacy becomes a major concern. Encryption methods offer a possible solution to this issue, however their deployment on ML applications is non-trivial, as they seriously impact the classification accuracy and result in substantial computational overhead.Alternatively, obfuscation techniques can be used, but maintaining a good balance between visual privacy and accuracy is challenging. In this work, we propose a method to generate secure synthetic datasets from the original private datasets. In our method, given a network with Batch Normalization (BN) layers pre-trained on the original dataset, we first record the layer-wise BN statistics. Next, using the BN statistics and the pre-trained model, we generate the synthetic dataset by optimizing random noises such that the synthetic data match the layer-wise statistical distribution of the original model. We evaluate our method on image classification dataset (CIFAR10) and show that our synthetic data can be used for training networks from scratch, producing reasonable classification performance.

updated: Sat Feb 11 2023 13:40:08 GMT+0000 (UTC)

published: Thu Oct 06 2022 20:54:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト