Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Patrick Schramowski; Christopher Tauchmann; Kristian Kersting

マシンは、データセットの質問16に回答し、不適切なコンテンツを反映するのに役立ちますか？

現在の機械学習の多くの根底にある大規模なデータセットは、攻撃的、侮辱的、脅迫的、または不安を引き起こす可能性があるなど、不適切なコンテンツに関する深刻な問題を引き起こします。これには、データシートの使用など、データセットのドキュメントを増やす必要があります。彼らは、他のトピックの中でも、データセットの構成について熟考することを奨励しています。ただし、これまでのところ、このドキュメントは手動で作成されているため、特に大きな画像データセットの場合、面倒でエラーが発生しやすくなります。ここでは、データセットの質問16に答えて、マシンが不適切なコンテンツを反映するのに役立つかどうかについて、ほぼ間違いなく「循環」の質問をします。この目的のために、事前にトレーニングされたトランスモデルに保存されている情報を使用して、文書化プロセスを支援することを提案します。具体的には、社会的道徳的価値のデータセットに基づく迅速な調整により、CLIPが不適切な可能性のあるコンテンツを特定し、人的労力を削減します。次に、視覚言語モデルを使用して生成されたキャプションに基づいて、ワードクラウドを使用して見つかった不適切な画像を文書化します。この方法で作成された2つの人気のある大規模なコンピュータービジョンデータセット（ImageNetとOpenImages）のドキュメントは、マシンがデータセット作成者が不適切な画像コンテンツに関する質問16に答えるのに実際に役立つことを示唆しています。

Large datasets underlying much of current machine learning raise serious issues concerning inappropriate content such as offensive, insulting, threatening, or might otherwise cause anxiety. This calls for increased dataset documentation, e.g., using datasheets. They, among other topics, encourage to reflect on the composition of the datasets. So far, this documentation, however, is done manually and therefore can be tedious and error-prone, especially for large image datasets. Here we ask the arguably "circular" question of whether a machine can help us reflect on inappropriate content, answering Question 16 in Datasheets. To this end, we propose to use the information stored in pre-trained transformer models to assist us in the documentation process. Specifically, prompt-tuning based on a dataset of socio-moral values steers CLIP to identify potentially inappropriate content, therefore reducing human labor. We then document the inappropriate images found using word clouds, based on captions generated using a vision-language model. The documentations of two popular, large-scale computer vision datasets -- ImageNet and OpenImages -- produced this way suggest that machines can indeed help dataset creators to answer Question 16 on inappropriate image content.

updated: Mon Feb 14 2022 13:00:31 GMT+0000 (UTC)

published: Mon Feb 14 2022 13:00:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト