Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant VAE

Alireza Nasiri; Tristan Bepler

教師なしオブジェクト表現平行移動と回転グループの同変量 VAE を使用した学習

多くのイメージングモダリティでは、関心のあるオブジェクトはさまざまな場所とポーズで発生する可能性があります (つまり、2 次元または 3 次元で平行移動と回転の対象となります) が、オブジェクトの場所とポーズはそのセマンティクス (つまりオブジェクトの本質) を変更しません。 .つまり、衛星画像での飛行機の特定の位置と回転、自然画像での椅子の 3 次元回転、またはクライオ電子顕微鏡写真での粒子の回転は、これらのオブジェクトの本質的な性質を変更しません。ここでは、ポーズと位置に対して不変なオブジェクトのセマンティック表現を完全に教師なしで学習する問題を考えます。この問題に対する以前のアプローチの欠点に対処するには、TARGET-VAE、翻訳および回転グループ同変変分オートエンコーダーフレームワークを導入します。 TARGET-VAE は、次の 3 つのコアイノベーションを組み合わせています。1) 回転および並進グループ等価エンコーダアーキテクチャ、2) 潜在的な回転、並進、および回転並進不変のセマンティックオブジェクト表現に対する構造的に解きほぐされた分布。推論ネットワーク、および 3) 空間的に等変な生成ネットワーク。包括的な実験では、TARGET-VAE が監督なしでもつれを解かれた表現を学習し、以前の方法の病理を大幅に改善し、回避することを示します。回転と平行移動によって大幅に破損した画像でトレーニングすると、TARGET-VAE によって学習されたセマンティック表現は、一貫して配置されたオブジェクトで学習されたものと同様になり、セマンティック潜在空間でのクラスタリングが劇的に改善されます。さらに、TARGET-VAE は、非常に正確な教師なしの姿勢と位置の推定を実行できます。 TARGET-VAE のような方法は、教師なしオブジェクトの生成、姿勢予測、オブジェクト検出の将来のアプローチを支えるものになると期待しています。

In many imaging modalities, objects of interest can occur in a variety of locations and poses (i.e. are subject to translations and rotations in 2d or 3d), but the location and pose of an object does not change its semantics (i.e. the object's essence). That is, the specific location and rotation of an airplane in satellite imagery, or the 3d rotation of a chair in a natural image, or the rotation of a particle in a cryo-electron micrograph, do not change the intrinsic nature of those objects. Here, we consider the problem of learning semantic representations of objects that are invariant to pose and location in a fully unsupervised manner. We address shortcomings in previous approaches to this problem by introducing TARGET-VAE, a translation and rotation group-equivariant variational autoencoder framework. TARGET-VAE combines three core innovations: 1) a rotation and translation group-equivariant encoder architecture, 2) a structurally disentangled distribution over latent rotation, translation, and a rotation-translation-invariant semantic object representation, which are jointly inferred by the approximate inference network, and 3) a spatially equivariant generator network. In comprehensive experiments, we show that TARGET-VAE learns disentangled representations without supervision that significantly improve upon, and avoid the pathologies of, previous methods. When trained on images highly corrupted by rotation and translation, the semantic representations learned by TARGET-VAE are similar to those learned on consistently posed objects, dramatically improving clustering in the semantic latent space. Furthermore, TARGET-VAE is able to perform remarkably accurate unsupervised pose and location inference. We expect methods like TARGET-VAE will underpin future approaches for unsupervised object generation, pose prediction, and object detection.

updated: Tue Jan 03 2023 19:45:46 GMT+0000 (UTC)

published: Mon Oct 24 2022 02:08:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト