EpipolarNVS: leveraging on Epipolar geometry for single-image Novel View Synthesis

Gaétan Landreau; Mohamed Tamaazousti

EpipolarNVS: 単一画像の新規ビュー合成のためのエピポーラジオメトリの活用

Novel-view Synthesis (NVS) は、一般的な設定に応じて、さまざまなアプローチで取り組むことができます: 単一のソース画像から短いビデオシーケンス、正確なまたはノイズの多いカメラポーズ情報、点群などの 3D ベースの情報など。この作品で私たちが立っているシナリオは、独自のソース画像のみを考慮して、別の視点からの新しい画像を生成します。ただし、このようなトリッキーな状況では、最新の学習ベースのソリューションは、カメラの視点変換を統合するのに苦労することがよくあります。実際、外部情報は低次元ベクトルを介してそのまま渡されることがよくあります。このようなカメラポーズは、オイラー角としてパラメーター化されると、ワンホット表現によって量子化されることさえあるかもしれません。この標準的なエンコーディングの選択により、学習したアーキテクチャが (カメラポーズの観点から) 新しいビューを継続的に推論できなくなります。エピポーラ制約などの 3D 関連の概念を活用することで、相対カメラポーズをより適切にエンコードするエレガントな方法が存在すると主張します。したがって、視点変換を 2D 特徴画像としてエンコードする革新的な方法を紹介します。このようなカメラエンコーディング戦略は、カメラが 2 つのビューの間の空間でどのように移動したかに関して、ネットワークに意味のある洞察を提供します。カメラのポーズ情報を有限数の色付きエピポーラ線としてエンコードすることにより、実験を通じて、私たちの戦略が通常のエンコードよりも優れていることを示します。

Novel-view synthesis (NVS) can be tackled through different approaches, depending on the general setting: a single source image to a short video sequence, exact or noisy camera pose information, 3D-based information such as point clouds etc. The most challenging scenario, the one where we stand in this work, only considers a unique source image to generate a novel one from another viewpoint. However, in such a tricky situation, the latest learning-based solutions often struggle to integrate the camera viewpoint transformation. Indeed, the extrinsic information is often passed as-is, through a low-dimensional vector. It might even occur that such a camera pose, when parametrized as Euler angles, is quantized through a one-hot representation. This vanilla encoding choice prevents the learnt architecture from inferring novel views on a continuous basis (from a camera pose perspective). We claim it exists an elegant way to better encode relative camera pose, by leveraging 3D-related concepts such as the epipolar constraint. We, therefore, introduce an innovative method that encodes the viewpoint transformation as a 2D feature image. Such a camera encoding strategy gives meaningful insights to the network regarding how the camera has moved in space between the two views. By encoding the camera pose information as a finite number of coloured epipolar lines, we demonstrate through our experiments that our strategy outperforms vanilla encoding.

updated: Mon Oct 24 2022 09:54:20 GMT+0000 (UTC)

published: Mon Oct 24 2022 09:54:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト