E-VFIA : Event-Based Video Frame Interpolation with Attention

Onur Selim Kılıç; Ahmet Akman; A. Aydın Alatan

E-VFIA : 注意を払ったイベントベースのビデオフレーム補間

ビデオフレーム補間 (VFI) は、2 つの連続するオリジナルビデオイメージ間で複数のフレームを合成することを目的とした基本的なビジョンタスクです。ほとんどのアルゴリズムは、キーフレームのみを使用して VFI を達成することを目的としています。これは、通常、キーフレームはシーン内のオブジェクトの軌跡に関する正確な精度を生成しないため、不適切な問題です。一方、イベントベースのカメラは、ビデオのキーフレーム間のより正確な情報を提供します。最近の最先端のイベントベースの方法の中には、イベントデータを利用してオプティカルフロー推定を改善し、ワーピングによってビデオフレームを補間することで、この問題に対処するものがあります。それにもかかわらず、これらの方法はゴースト効果に大きく悩まされます。一方、入力としてフレームのみを使用するカーネルベースの VFI メソッドの一部は、変形可能な畳み込みをトランスフォーマーでバックアップすると、長期的な依存関係を処理する信頼できる方法になり得ることが示されています。軽量のカーネルベースの方法として、イベントベースのビデオフレーム補間 (E-VFIA) を提案します。 E-VFIA は、変形可能な畳み込みによってイベント情報を標準ビデオフレームと融合し、高品質の補間フレームを生成します。提案された方法は、高い時間分解能でイベントを表し、マルチヘッド自己注意メカニズムを使用してイベントベースの情報をより適切にエンコードすると同時に、ぼやけやゴーストの影響を受けにくくします。したがって、より鮮明なフレームを生成します。シミュレーション結果は、提案された手法が現在の最先端の方法 (フレームベースとイベントベースの両方) よりもはるかに小さいモデルサイズで優れていることを示しています。

Video frame interpolation (VFI) is a fundamental vision task that aims to synthesize several frames between two consecutive original video images. Most algorithms aim to accomplish VFI by using only keyframes, which is an ill-posed problem since the keyframes usually do not yield any accurate precision about the trajectories of the objects in the scene. On the other hand, event-based cameras provide more precise information between the keyframes of a video. Some recent state-of-the-art event-based methods approach this problem by utilizing event data for better optical flow estimation to interpolate for video frame by warping. Nonetheless, those methods heavily suffer from the ghosting effect. On the other hand, some of kernel-based VFI methods that only use frames as input, have shown that deformable convolutions, when backed up with transformers, can be a reliable way of dealing with long-range dependencies. We propose event-based video frame interpolation with attention (E-VFIA), as a lightweight kernel-based method. E-VFIA fuses event information with standard video frames by deformable convolutions to generate high quality interpolated frames. The proposed method represents events with high temporal resolution and uses a multi-head self-attention mechanism to better encode event-based information, while being less vulnerable to blurring and ghosting artifacts; thus, generating crispier frames. The simulation results show that the proposed technique outperforms current state-of-the-art methods (both frame and event-based) with a significantly smaller model size.

updated: Wed Mar 01 2023 12:52:16 GMT+0000 (UTC)

published: Mon Sep 19 2022 21:40:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト