An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild

Aviv Gabbay; Niv Cohen; Yedid Hoshen

画像は千の言葉よりも価値がある：野生の解きほぐしに向けて

教師なし解きほぐしは、モデルとデータに誘導バイアスがなければ理論的に不可能であることが示されています。別のアプローチとして、最近の方法は、変動の要因を解きほぐし、それらの識別可能性を可能にするために、限定された監督に依存しています。真の生成要因に注釈を付ける必要があるのは限られた数の観測のみですが、実際の画像分布を表すすべての変動要因を列挙することは不可能であると主張します。この目的のために、部分的にのみラベル付けされた一連の因子を解きほぐし、明示的に指定されていない残差因子の相補的なセットを分離する方法を提案します。合成ベンチマークで実証されたこの困難な設定での成功により、既製の画像記述子を活用して、最小限の手作業で実際の画像ドメイン（人間の顔など）の属性のサブセットに部分的に注釈を付けることができます。具体的には、最近の言語画像埋め込みモデル（CLIP）を使用して、関心のある属性のセットにゼロショットで注釈を付け、最先端のもつれを解いた画像操作の結果を示します。

Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.

updated: Mon Oct 25 2021 17:43:12 GMT+0000 (UTC)

published: Tue Jun 29 2021 17:54:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト