Disentangling Variational Autoencoders

Rafael Pastrana

変分オートエンコーダーのもつれを解く

変分オートエンコーダー (VAE) は、高次元データの入力セットを低次元の潜在空間に射影する事後推論のための確率的機械学習フレームワークです。 VAE で学習した潜在空間は、創造的な分野で新しいデータ駆動型設計プロセスを開発する刺激的な機会を提供します。特に、入力データを美的に連想させるが、トレーニング中には見えなかった複数の斬新な設計の生成を自動化します。ただし、学習された潜在空間は通常、まとまりがなく、絡み合っています。単一の次元に沿って潜在空間をトラバースしても、データの単一の視覚的属性は変化しません。潜在構造の欠如は、デザイナーが潜在空間から生成される新しいデザインの視覚的属性を意図的に制御することを妨げます。この論文では、潜在空間のもつれの解消を調査する実験的研究を紹介します。文献から 3 つの異なる VAE モデルを実装し、公開されている 60,000 枚の手書き数字の画像のデータセットでそれらをトレーニングします。感度分析を実行して、データの対数限界尤度の下限を最大化するために必要な少数の潜在次元を見つけます。さらに、デコードされた画像の再構成の品質と潜在空間のもつれの解消のレベルとの間のトレードオフを調査します。 3 つの潜在的な寸法を、数字の 3 つの解釈可能な視覚的特性 (線の太さ、傾き、幅) に自動的に合わせることができます。私たちの実験は、i)潜在変数に対する事前分布と変分分布の間のカルバック・ライブラー発散の寄与をエビデンス下限への寄与を増加させること、およびii)入力画像クラスを条件付けすることで、VAEを使用した絡み合っていない潜在空間の学習を強化することを示唆しています。

A variational autoencoder (VAE) is a probabilistic machine learning framework for posterior inference that projects an input set of high-dimensional data to a lower-dimensional, latent space. The latent space learned with a VAE offers exciting opportunities to develop new data-driven design processes in creative disciplines, in particular, to automate the generation of multiple novel designs that are aesthetically reminiscent of the input data but that were unseen during training. However, the learned latent space is typically disorganized and entangled: traversing the latent space along a single dimension does not result in changes to single visual attributes of the data. The lack of latent structure impedes designers from deliberately controlling the visual attributes of new designs generated from the latent space. This paper presents an experimental study that investigates latent space disentanglement. We implement three different VAE models from the literature and train them on a publicly available dataset of 60,000 images of hand-written digits. We perform a sensitivity analysis to find a small number of latent dimensions necessary to maximize a lower bound to the log marginal likelihood of the data. Furthermore, we investigate the trade-offs between the quality of the reconstruction of the decoded images and the level of disentanglement of the latent space. We are able to automatically align three latent dimensions with three interpretable visual properties of the digits: line weight, tilt and width. Our experiments suggest that i) increasing the contribution of the Kullback-Leibler divergence between the prior over the latents and the variational distribution to the evidence lower bound, and ii) conditioning input image class enhances the learning of a disentangled latent space with a VAE.

updated: Mon Nov 14 2022 19:22:41 GMT+0000 (UTC)

published: Mon Nov 14 2022 19:22:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト