Backdooring Textual Inversion for Concept Censorship

Yutong wu; Jie Zhang; Florian Kerschbaum; Tianwei Zhang

コンセプト検閲のためのバックドアによるテキスト反転

近年、AIGC (AI Generated Content) が成功を収めています。ユーザーは、事前にトレーニングされた拡散モデルを利用して、高品質の画像を生成したり、自然言語のプロンプトのみを使用して既存の画像を自由に変更したりできます。さらに興味深いことに、新たに登場したパーソナライゼーション技術により、ほんの数枚の画像を参照として、特定の希望の画像を作成できるようになりました。ただし、このような高度な技術が悪意のあるユーザーによって悪用された場合、フェイクニュースの拡散や個人の評判の毀損など、重大な脅威が引き起こされます。したがって、パーソナライゼーションモデルの開発と進歩のためには、パーソナライゼーションモデルを規制する（つまり、概念検閲）必要があります。このペーパーでは、軽量な性質と優れたパフォーマンスにより普及しつつある Textual Inversion (TI) と呼ばれるパーソナライゼーション技術に焦点を当てます。 TI は、特定のオブジェクトに関する詳細情報を含む単語埋め込みを作成します。ユーザーは、Civitai などの公開 Web サイトから単語埋め込みを簡単にダウンロードし、パーソナライゼーションのために微調整することなく、独自の安定した拡散モデルに追加できます。 TI モデルの概念検閲を達成するために、Textual Inversion 埋め込みにバックドアを挿入することにより、バックドア技術を永久に活用することを提案します。簡単に説明すると、TI のトレーニング中にトリガーとしていくつかのデリケートな単語を選択しますが、通常の使用では検閲されます。次の生成段階で、トリガーが最終プロンプトとしてパーソナライズされた埋め込みと組み合わされる場合、モデルは、目的の悪意のある概念を含む画像ではなく、事前に定義されたターゲット画像を出力します。私たちのアプローチの有効性を実証するために、一般的なオープンソースのテキストから画像へのモデルである安定拡散について広範な実験を実施しました。私たちのコード、データ、結果は https://concept-sensitive.github.io で入手できます。

Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.

updated: Mon Aug 21 2023 13:39:04 GMT+0000 (UTC)

published: Mon Aug 21 2023 13:39:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト