DCVNet: Dilated Cost Volume Networks for Fast Optical Flow

Huaizu Jiang; Erik Learned-Miller

DCVNet：高速オプティカルフローのための拡張コストボリュームネットワーク

2つの入力画像間で可能な対応の類似性をキャプチャするコストボリュームは、最先端のオプティカルフローアプローチの重要な要素です。コストボリュームを構築するために対応をサンプリングする場合、大きな変位を処理するために大きな近隣半径が必要であり、大きな計算負荷が発生します。これに対処するために、通常、シーケンシャル戦略が採用されます。この場合、半径が小さいローカル近傍での対応サンプリングで十分です。ただし、ディープニューラルネットワークの機能階層上のピラミッド構造またはリカレントニューラルネットワークのいずれかによってインスタンス化されるこのような順次アプローチは、コストボリュームの順次処理の本質的な必要性のために低速です。この論文では、小変位と大変位を同時にキャプチャするための拡張コストボリュームを提案し、順次推定戦略を必要とせずにオプティカルフロー推定を可能にします。コストボリュームを処理してピクセル単位のオプティカルフローを取得するために、既存のアプローチでは2Dまたは分離可能な4D畳み込みを採用しています。これは、GPUメモリの消費量が多い、精度が低い、またはモデルサイズが大きいという問題があることを示しています。したがって、これらの問題に対処するために、コストボリュームフィルタリングに3D畳み込みを使用することを提案します。拡張されたコストボリュームと3D畳み込みを組み合わせることにより、提案されたモデルDCVNetは、リアルタイムの推論（ミッドエンド1080tiGPUで71fps）を示すだけでなく、コンパクトであり、既存のアプローチと同等の精度を実現します。

The cost volume, capturing the similarity of possible correspondences across two input images, is a key ingredient in state-of-the-art optical flow approaches. When sampling for correspondences to build the cost volume, a large neighborhood radius is required to deal with large displacements, introducing a significant computational burden. To address this, a sequential strategy is usually adopted, where correspondence sampling in a local neighborhood with a small radius suffices. However, such sequential approaches, instantiated by either a pyramid structure over a deep neural network's feature hierarchy or by a recurrent neural network, are slow due to the inherent need for sequential processing of cost volumes. In this paper, we propose dilated cost volumes to capture small and large displacements simultaneously, allowing optical flow estimation without the need for the sequential estimation strategy. To process the cost volume to get pixel-wise optical flow, existing approaches employ 2D or separable 4D convolutions, which we show either suffer from high GPU memory consumption, inferior accuracy, or large model size. Therefore, we propose using 3D convolutions for cost volume filtering to address these issues. By combining the dilated cost volumes and 3D convolutions, our proposed model DCVNet not only exhibits real-time inference (71 fps on a mid-end 1080ti GPU) but is also compact and obtains comparable accuracy to existing approaches.

updated: Wed Mar 31 2021 17:59:31 GMT+0000 (UTC)

published: Wed Mar 31 2021 17:59:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト