Scalable Neural Video Representations with Learnable Positional Features

Subin Kim; Sihyun Yu; Jaeho Lee; Jinwoo Shin

学習可能な位置機能を備えたスケーラブルなニューラルビデオ表現

座標ベースのニューラル表現 (CNR) を使用した複雑な信号の簡潔な表現は大きな進歩を遂げており、最近のいくつかの取り組みは、ビデオを処理するためにそれらを拡張することに焦点を当てています。ここでの主な課題は、(a) CNR のトレーニングにおける計算の非効率性を軽減し、(b) 高品質のビデオエンコーディングを実現しながら、(c) パラメータ効率を維持する方法です。すべての要件 (a)、(b)、および (c) を同時に満たすために、ビデオを潜在コードとして効果的に償却する「学習可能な位置特徴」を導入することにより、新しい CNR である学習可能な位置特徴 (NVP) を使用したニューラルビデオ表現を提案します。具体的には、最初に 2D 潜在キーフレームの設計に基づく CNR アーキテクチャを提示し、各時空間軸にわたって共通のビデオコンテンツを学習します。これにより、これら 3 つの要件すべてが劇的に改善されます。次に、潜在コードの計算/メモリ効率の高い圧縮手順として、既存の強力な画像およびビデオコーデックを利用することを提案します。一般的な UVG ベンチマークでの NVP の優位性を実証します。従来技術と比較して、NVP はトレーニングが 2 倍高速 (5 分未満) であるだけでなく、8 分の 1 を超えるパラメーターを使用しても、34.07 → 34.57 (PSNR メトリックで測定) としてエンコーディング品質を上回っています。また、ビデオ修復、ビデオフレーム補間など、NVP の興味深いプロパティも示します。

Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07→34.57 (measured with the PSNR metric), even using >8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.

updated: Thu Oct 13 2022 08:15:08 GMT+0000 (UTC)

published: Thu Oct 13 2022 08:15:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト