Learning Trajectory-Aware Transformer for Video Super-Resolution

Chengxu Liu; Huan Yang; Jianlong Fu; Xueming Qian

ビデオ超解像のための軌道認識トランスフォーマーの学習

ビデオ超解像（VSR）は、低解像度（LR）の対応するフレームから高解像度（HR）フレームのシーケンスを復元することを目的としています。ある程度の進歩はありましたが、ビデオシーケンス全体で時間依存性を効果的に利用するには大きな課題があります。既存のアプローチは通常、限られた隣接フレーム（たとえば、5または7フレーム）からのビデオフレームを整列および集約するため、これらのアプローチでは満足のいく結果が得られません。この論文では、ビデオで効果的な時空間学習を可能にするために、さらに一歩進んでいます。ビデオ超解像（TTVSR）用の新しい軌道認識トランスを提案します。特に、ビデオフレームを、連続したビジュアルトークンで構成されるいくつかの事前に調整された軌道に定式化します。クエリトークンの場合、自己注意は、時空間軌道に沿った関連する視覚的トークンでのみ学習されます。バニラビジョントランスフォーマーと比較して、このような設計は計算コストを大幅に削減し、トランスフォーマーが長距離機能をモデル化できるようにします。さらに、長距離ビデオで頻繁に発生するスケール変更の問題を克服するために、クロススケール機能トークン化モジュールを提案します。実験結果は、広く使用されている4つのビデオ超解像ベンチマークでの広範な定量的および定性的評価により、提案されたTTVSRが最先端のモデルよりも優れていることを示しています。コードと事前トレーニング済みモデルの両方をhttps://github.com/researchmm/TTVSRからダウンロードできます。

Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to overcome scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at https://github.com/researchmm/TTVSR.

updated: Wed Apr 20 2022 04:38:21 GMT+0000 (UTC)

published: Fri Apr 08 2022 03:37:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト