Enhancing variational generation through self-decomposition

Andrea Asperti; Laura Bugo; Daniele Filippini

自己分解による変分生成の強化

この記事では、分割変分オートエンコーダー（SVAE）の概念を紹介します。この概念の出力xは、生成された2つの画像x_1、x_2の加重和σ\ odotx_1 +（1-σ）\ odot x_2として取得され、σは学習済みです。構成マップ。合成画像x_1、x_2、およびσマップは、モデルによって自動的に合成されます。ネットワークは通常の変分オートエンコーダーとしてトレーニングされ、トレーニングと再構成された画像の間に負の対数尤度損失があります。 x_1、x_2、またはσに追加の損失は必要ありません。人間による調整も必要ありません。分解は非決定論的ですが、構文または意味のいずれかに大まかに分類できる2つの主要なスキームに従います。最初のケースでは、マップは隣接するピクセル間の強い相関関係を利用する傾向があり、画像を2つの補完的な高周波サブ画像に分割します。 2番目のケースでは、マップは通常、オブジェクトの輪郭に焦点を合わせ、画像をコンテンツの興味深いバリエーションに分割し、より目立つ特徴を備えています。この場合、経験的観察によれば、x_1とx_2のフレシェ開始距離（FID）は通常xのそれよりも低く（したがってより良い）、前者の平均であることに明らかに苦しんでいます。ある意味で、SVAEは、特定のサンプルに対する再構成損失を最小限に抑えることを目的として、代替案間で平均化する固有の傾向とは対照的に、変分オートエンコーダーに選択を強制します。 FIDメトリックによると、Mnist、Cifar10、CelebAなどの一般的なデータセットでテストされたこの手法により、以前のすべての純粋なバリエーションアーキテクチャ（正規化フローに依存しない）よりも優れたパフォーマンスを発揮できます。

In this article we introduce the notion of Split Variational Autoencoder (SVAE), whose output x is obtained as a weighted sum σ\odot x_1 + (1-σ) \odot x_2 of two generated images x_1,x_2, and σ is a learned compositional map. The composing images x_1,x_2, as well as the σ-map are automatically synthesized by the model. The network is trained as a usual Variational Autoencoder with a negative loglikelihood loss between training and reconstructed images. No additional loss is required for x_1,x_2 or σ, neither any form of human tuning. The decomposition is nondeterministic, but follows two main schemes, that we may roughly categorize as either syntactic or semantic. In the first case, the map tends to exploit the strong correlation between adjacent pixels, splitting the image in two complementary high frequency sub-images. In the second case, the map typically focuses on the contours of objects, splitting the image in interesting variations of its content, with more marked and distinctive features. In this case, according to empirical observations, the Fréchet Inception Distance (FID) of x_1 and x_2 is usually lower (hence better) than that of x, that clearly suffers from being the average of the former. In a sense, a SVAE forces the Variational Autoencoder to make choices, in contrast with its intrinsic tendency to average between alternatives with the aim to minimize the reconstruction loss towards a specific sample. According to the FID metric, our technique, tested on typical datasets such as Mnist, Cifar10 and CelebA, allows us to outperform all previous purely variational architectures (not relying on normalization flows).

updated: Thu Jul 14 2022 10:57:30 GMT+0000 (UTC)

published: Sun Feb 06 2022 08:49:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト