End-to-end Neural Video Coding Using a Compound Spatiotemporal Representation

Haojie Liu; Ming Lu; Zhiqi Chen; Xun Cao; Zhan Ma; Yao Wang

複合時空間表現を使用したエンドツーエンドのニューラルビデオコーディング

近年、学習したビデオコーディングが急速に進歩しています。ほとんどのアルゴリズムは、フレーム間の冗長性を活用するために、ベクトルベースのモーション表現とリサンプリング（オプティカルフローベースの双一次サンプリングなど）のみに依存しています。非圧縮ビデオのビデオ予測における適応カーネルベースのリサンプリング（たとえば、適応畳み込みおよび変形可能畳み込み）の大きな成功にもかかわらず、そのようなアプローチをフレーム間コーディングのレート歪み最適化と統合することはあまり成功していません。各リサンプリングソリューションは、モーションとテクスチャの特性が異なる領域で独自の利点を提供することを認識し、これら2つのアプローチによって生成された予測を適応的に組み合わせるハイブリッドモーション補正（HMC）メソッドを提案します。具体的には、現在および複数の過去のフレームからの情報を使用して、反復情報集約（RIA）モジュールを介して複合時空間表現（CSTR）を生成します。さらに、1対多デコーダパイプラインを設計して、ベクトルベースのリサンプリング、適応カーネルベースのリサンプリング、補正モード選択マップ、テクスチャ拡張など、CSTRから複数の予測を生成し、それらを適応的に組み合わせて、より正確な相互予測を実現します。実験は、提案されたインターコーディングシステムがより良い動き補償予測を提供でき、オクルージョンと複雑な動きに対してよりロバストであることを示しています。共同でトレーニングされたイントラコーダーと残差コーダーとともに、全体的に学習されたハイブリッドコーダーは、従来のH.264 / AVCおよびH.265 / HEVCと比較して、低遅延シナリオで最先端のコーディング効率を実現します。 PSNRとMS-SSIMの両方のメトリックの観点から、最近公開された学習ベースの方法として。

Recent years have witnessed rapid advances in learnt video coding. Most algorithms have solely relied on the vector-based motion representation and resampling (e.g., optical flow based bilinear sampling) for exploiting the inter frame redundancy. In spite of the great success of adaptive kernel-based resampling (e.g., adaptive convolutions and deformable convolutions) in video prediction for uncompressed videos, integrating such approaches with rate-distortion optimization for inter frame coding has been less successful. Recognizing that each resampling solution offers unique advantages in regions with different motion and texture characteristics, we propose a hybrid motion compensation (HMC) method that adaptively combines the predictions generated by these two approaches. Specifically, we generate a compound spatiotemporal representation (CSTR) through a recurrent information aggregation (RIA) module using information from the current and multiple past frames. We further design a one-to-many decoder pipeline to generate multiple predictions from the CSTR, including vector-based resampling, adaptive kernel-based resampling, compensation mode selection maps and texture enhancements, and combines them adaptively to achieve more accurate inter prediction. Experiments show that our proposed inter coding system can provide better motion-compensated prediction and is more robust to occlusions and complex motions. Together with jointly trained intra coder and residual coder, the overall learnt hybrid coder yields the state-of-the-art coding efficiency in low-delay scenario, compared to the traditional H.264/AVC and H.265/HEVC, as well as recently published learning-based methods, in terms of both PSNR and MS-SSIM metrics.

updated: Thu Aug 05 2021 19:43:32 GMT+0000 (UTC)

published: Thu Aug 05 2021 19:43:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト