Does Data Repair Lead to Fair Models? Curating Contextually Fair Data To Reduce Model Bias

Sharat Agarwal; Sumanyu Muku; Saket Anand; Chetan Arora

データ修復は公正なモデルにつながりますか？モデルのバイアスを減らすために文脈的に公正なデータをキュレートする

コンテキスト情報は、ディープニューラルネットワーク（DNN）がより良い表現を学習し、精度を向上させるための貴重な手がかりです。ただし、トレーニングデータセットの共起バイアスは、DNNモデルの現実世界の目に見えないシナリオへの一般化を妨げる可能性があります。たとえば、COCOでは、多くのオブジェクトカテゴリで、女性と比較して男性との共起がはるかに高く、DNNの予測に男性を優先させる可能性があります。最近の研究は、そのようなシナリオでバイアスを処理するためのタスク固有のトレーニング戦略に焦点を当てていますが、利用可能なデータの修正はしばしば無視されます。この論文では、サンプルのサブセットを選択することにより、データセットのコンテキストバイアスに対処するための新しいより一般的なソリューションを提案します。これは、保護された属性のさまざまなクラスとの共起に関して公平です。変動係数を使用したデータ修復アルゴリズムを紹介します。これにより、保護されたクラスの公平で状況に応じたバランスの取れたデータをキュレートできます。これは、タスク、アーキテクチャ、またはトレーニング方法に関係なく、公正なモデルのトレーニングに役立ちます。私たちが提案するソリューションはシンプルで効果的であり、データラベルが存在しない、または段階的に生成されるアクティブラーニングの設定でも使用できます。さまざまなデータセットにわたるオブジェクト検出とマルチラベル画像分類のタスクに対するアルゴリズムの有効性を示します。一連の実験を通じて、コンテキスト的に公正なデータをキュレートすることで、モデルの全体的なパフォーマンスを損なうことなく、保護されたクラスの真の陽性率をグループ間でバランスさせることにより、モデルの予測を公正にすることができることを検証します。

Contextual information is a valuable cue for Deep Neural Networks (DNNs) to learn better representations and improve accuracy. However, co-occurrence bias in the training dataset may hamper a DNN model's generalizability to unseen scenarios in the real world. For example, in COCO, many object categories have a much higher co-occurrence with men compared to women, which can bias a DNN's prediction in favor of men. Recent works have focused on task-specific training strategies to handle bias in such scenarios, but fixing the available data is often ignored. In this paper, we propose a novel and more generic solution to address the contextual bias in the datasets by selecting a subset of the samples, which is fair in terms of the co-occurrence with various classes for a protected attribute. We introduce a data repair algorithm using the coefficient of variation, which can curate fair and contextually balanced data for a protected class(es). This helps in training a fair model irrespective of the task, architecture or training methodology. Our proposed solution is simple, effective, and can even be used in an active learning setting where the data labels are not present or being generated incrementally. We demonstrate the effectiveness of our algorithm for the task of object detection and multi-label image classification across different datasets. Through a series of experiments, we validate that curating contextually fair data helps make model predictions fair by balancing the true positive rate for the protected class across groups without compromising on the model's overall performance.

updated: Wed Oct 20 2021 06:00:03 GMT+0000 (UTC)

published: Wed Oct 20 2021 06:00:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト