Estimating Extreme 3D Image Rotation with Transformer Cross-Attention

Shay Dekel; Yosi Keller

Transformer Cross-Attention による極端な 3D 画像回転の推定

大きくて極端な画像回転の推定は、複数のコンピュータービジョンドメインで重要な役割を果たします。この場合、回転した画像は限られた視野または重複しない視野によって関連付けられます。現代のアプローチでは、畳み込みニューラルネットワークを適用して 4D 相関ボリュームを計算し、画像ペア間の相対的な回転を推定します。この作業では、CNN 特徴マップと Transformer-Encoder を利用して、画像ペアの活性化マップ間の相互注意を計算する相互注意ベースのアプローチを提案します。これは、4D と同等の改善であることが示されています。以前の作品で使用された相関ボリューム。提案されたアプローチでは、より高い注意スコアは、回転の視覚的な手がかりをエンコードする画像領域に関連付けられています。私たちのアプローチはエンドツーエンドでトレーニング可能で、単純な回帰損失を最適化します。一般的に使用される画像回転データセットとベンチマークに適用すると、最新のスキームよりも優れていることが実験的に示され、これらのデータセットで新しい最先端の精度が確立されます。コードを公開しています。

The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.

updated: Sun Mar 05 2023 09:07:26 GMT+0000 (UTC)

published: Sun Mar 05 2023 09:07:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト