H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions

Changlin Li; Guangyang Wu; Yanan Sun; Xin Tao; Chi-Keung Tang; Yu-Wing Tai

H-VFI: 動きの大きいビデオの階層的フレーム補間

ニューラルネットワークの急速な発展を利用して、最近のビデオフレーム補間 (VFI) メソッドは顕著な改善を達成しました。ただし、大きなモーションを含む実世界のビデオにはまだ不十分です。大きな動きによって引き起こされる複雑な変形および/またはオクルージョンは、ビデオフレームの補間において非常に困難な問題になります。この論文では、ビデオフレーム補間で大きな動きを処理するためのシンプルで効果的なソリューション、H-VFI を提案します。 H-VFI は、階層型ビデオ補間トランスフォーマー (HVIT) を提供して、複数のスケールで粗いものから細かいものへの戦略で変形可能なカーネルを学習します。学習された変形可能なカーネルは、補間されたフレームを予測するために入力フレームを畳み込む際に利用されます。最小スケールから始めて、H-VFI は、以前に予測されたカーネル、中間補間結果、およびトランスフォーマーからの階層的特徴に基づいて、連続して残差によって変形可能なカーネルを更新します。最終的な出力を調整するためのバイアスとマスクは、補間された結果に基づいて、Transformer ブロックによって予測されます。このようなプログレッシブ近似の利点は、大きなモーションフレームの補間問題をいくつかの比較的単純なサブタスクに分解できることです。これにより、最終結果で非常に正確な予測が可能になります。私たちの論文のもう 1 つの注目すべき貢献は、大規模で高品質なデータセット、YouTube200K で構成されています。このデータセットには、高解像度と高フレームレートでキャプチャされた多種多様なシナリオを描いたビデオが含まれています。複数のフレーム補間ベンチマークに関する広範な実験により、H-VFI が既存の最先端の方法よりも優れていることが検証されており、特に動きの大きいビデオでは顕著です。

Capitalizing on the rapid development of neural networks, recent video frame interpolation (VFI) methods have achieved notable improvements. However, they still fall short for real-world videos containing large motions. Complex deformation and/or occlusion caused by large motions make it an extremely difficult problem in video frame interpolation. In this paper, we propose a simple yet effective solution, H-VFI, to deal with large motions in video frame interpolation. H-VFI contributes a hierarchical video interpolation transformer (HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple scales. The learnt deformable kernel is then utilized in convolving the input frames for predicting the interpolated frame. Starting from the smallest scale, H-VFI updates the deformable kernel by a residual in succession based on former predicted kernels, intermediate interpolated results and hierarchical features from transformer. Bias and masks to refine the final outputs are then predicted by a transformer block based on interpolated results. The advantage of such a progressive approximation is that the large motion frame interpolation problem can be decomposed into several relatively simpler sub-tasks, which enables a very accurate prediction in the final results. Another noteworthy contribution of our paper consists of a large-scale high-quality dataset, YouTube200K, which contains videos depicting a great variety of scenarios captured at high resolution and high frame rate. Extensive experiments on multiple frame interpolation benchmarks validate that H-VFI outperforms existing state-of-the-art methods especially for videos with large motions.

updated: Mon Nov 21 2022 09:49:23 GMT+0000 (UTC)

published: Mon Nov 21 2022 09:49:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト