Video Frame Interpolation with Transformer

Liying Lu; Ruizheng Wu; Huaijia Lin; Jiangbo Lu; Jiaya Jia

Transformerを使用したビデオフレーム補間

ビデオの中間フレームを合成することを目的としたビデオフレーム補間（VFI）は、過去数年にわたって深い畳み込みネットワークの開発で目覚ましい進歩を遂げました。畳み込みネットワークに基づいて構築された既存の方法は、畳み込み演算の局所性のために、一般に大きな動きを処理するという課題に直面します。この制限を克服するために、Transformerを利用してビデオフレーム間の長距離ピクセル相関をモデル化する新しいフレームワークを導入します。さらに、私たちのネットワークには、クロススケールウィンドウが相互作用する新しいクロススケールウィンドウベースの注意メカニズムが装備されています。この設計は、受容野を効果的に拡大し、マルチスケール情報を集約します。広範な定量的および定性的実験は、私たちの方法がさまざまなベンチマークで新しい最先端の結果を達成することを示しています。

Video frame interpolation (VFI), which aims to synthesize intermediate frames of a video, has made remarkable progress with development of deep convolutional networks over past years. Existing methods built upon convolutional networks generally face challenges of handling large motion due to the locality of convolution operations. To overcome this limitation, we introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Further, our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other. This design effectively enlarges the receptive field and aggregates multi-scale information. Extensive quantitative and qualitative experiments demonstrate that our method achieves new state-of-the-art results on various benchmarks.

updated: Sun May 15 2022 09:30:28 GMT+0000 (UTC)

published: Sun May 15 2022 09:30:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト