Closed-Loop Transcription via Convolutional Sparse Coding

Xili Dai; Ke Chen; Shengbang Tong; Jingyuan Zhang; Xingjian Gao; Mingyang Li; Druv Pai; Yuexiang Zhai; XIaojun Yuan; Heung-Yeung Shum; Lionel M. Ni; Yi Ma

畳み込みスパースコーディングによるクローズドループトランスクリプション

自動符号化は、自然画像の生成モデルを学習するためのフレームワークとして、経験的に大きな成功を収めています。オートエンコーダーは、解釈が難しい一般的なディープネットワークをエンコーダーまたはデコーダーとして使用することが多く、学習した表現には明確な構造がありません。この作業では、画像分布が多段階のスパースデコンボリューションから生成されるという明示的な仮定を行います。エンコーダーとして使用する対応する逆マップは、多段階の畳み込みスパースコーディング (CSC) であり、各ステージは、対応する (凸化された) スパースコーディングプログラムを解くための最適化アルゴリズムを展開することから得られます。実際の画像と生成された画像の間の分布距離を最小化する際の計算上の困難を回避するために、学習したスパース表現のレート削減を最適化する最近のクローズドループトランスクリプション (CTRL) フレームワークを利用します。概念的には、私たちの方法は、拡散モデルなどのスコアマッチング方法と高レベルの接続を持っています。経験的に、私たちのフレームワークは、公正な条件下での既存の自動エンコードおよび生成方法と比較して、ImageNet-1K などの大規模なデータセットで競争力のあるパフォーマンスを示しています。より単純なネットワークとより少ない計算リソースでも、私たちの方法は再生成された画像で高い視覚的品質を示します。さらに驚くべきことに、学習されたオートエンコーダーは、目に見えないデータセットでうまく機能します。私たちの方法には、より構造化された解釈可能な表現、より安定した収束、大規模なデータセットへのスケーラビリティなど、いくつかの副次的な利点があります。私たちの方法は、おそらく、複数の畳み込みスパースコーディング/デコードレイヤーの連結が、大規模な自然画像データセットの分布をモデル化するための解釈可能で効果的なオートエンコーダーにつながることを実証した最初のものです。

Autoencoding has achieved great empirical success as a framework for learning generative models for natural images. Autoencoders often use generic deep networks as the encoder or decoder, which are difficult to interpret, and the learned representations lack clear structure. In this work, we make the explicit assumption that the image distribution is generated from a multi-stage sparse deconvolution. The corresponding inverse map, which we use as an encoder, is a multi-stage convolution sparse coding (CSC), with each stage obtained from unrolling an optimization algorithm for solving the corresponding (convexified) sparse coding program. To avoid computational difficulties in minimizing distributional distance between the real and generated images, we utilize the recent closed-loop transcription (CTRL) framework that optimizes the rate reduction of the learned sparse representations. Conceptually, our method has high-level connections to score-matching methods such as diffusion models. Empirically, our framework demonstrates competitive performance on large-scale datasets, such as ImageNet-1K, compared to existing autoencoding and generative methods under fair conditions. Even with simpler networks and fewer computational resources, our method demonstrates high visual quality in regenerated images. More surprisingly, the learned autoencoder performs well on unseen datasets. Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, and scalability to large datasets. Our method is arguably the first to demonstrate that a concatenation of multiple convolution sparse coding/decoding layers leads to an interpretable and effective autoencoder for modeling the distribution of large-scale natural image datasets.

updated: Sat Feb 18 2023 14:40:07 GMT+0000 (UTC)

published: Sat Feb 18 2023 14:40:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト