Learning to Compress Videos without Computing Motion

Meixu Chen; Todd Goodall; Anjul Patney; Alan C. Bovik

モーションを計算せずにビデオを圧縮する方法を学ぶ

この論文では、H.264やHEVCのような最新のハイブリッドビデオ圧縮コーデックの最も高価な要素である、動き推定を必要としない新しい深層学習ビデオ圧縮アーキテクチャを提案します。私たちのフレームワークは、ビデオモーションに固有の規則性を活用します。これは、変位したフレームの違いをビデオ表現として使用してニューラルネットワークをトレーニングすることでキャプチャします。さらに、LSTMモデルとUNetモデルの両方に基づく新しい時空間再構成ネットワークを提案します。これをLSTM-UNetと呼びます。結合されたネットワークは、時間的および空間的なビデオ情報の両方を効率的にキャプチャできるため、私たちの目的に非常に適しています。新しいビデオ圧縮フレームワークには、変位計算ユニット（DCU）、変位圧縮ネットワーク（DCN）、およびフレーム再構成ネットワーク（FRN）の3つのコンポーネントがあり、これらはすべて、単一の知覚損失関数に対して共同で最適化されています。 DCUは、ハイブリッドコーデックに見られる動き推定の必要性を排除し、より安価です。 DCNでは、RNNベースのネットワークを使用して、変位したフレームの差を圧縮し、フレーム間の時間情報を保持します。 LSTM-UNetは、ビデオの時空間差分表現を学習するためにFRNで使用されます。私たちの実験結果は、モーションレスVIdeoコーデック（MOVI-Codec）と呼ばれる圧縮モデルが、モーションを計算せずにビデオを効率的に圧縮する方法を学習することを示しています。私たちの実験は、MOVI-Codecがビデオコーディング標準H.264を上回り、特に高解像度ビデオで、MS-SSIMによって測定された最新のグローバル標準HEVCコーデックのパフォーマンスを上回っていることを示しています。

In this paper, we propose a new deep learning video compression architecture that does not require motion estimation, which is the most expensive element of modern hybrid video compression codecs like H.264 and HEVC. Our framework exploits the regularities inherent to video motion, which we capture by using displaced frame differences as video representations to train the neural network. In addition, we propose a new space-time reconstruction network based on both an LSTM model and a UNet model, which we call LSTM-UNet. The combined network is able to efficiently capture both temporal and spatial video information, making it highly amenable for our purposes. The new video compression framework has three components: a Displacement Calculation Unit (DCU), a Displacement Compression Network (DCN), and a Frame Reconstruction Network (FRN), all of which are jointly optimized against a single perceptual loss function. The DCU removes the need for motion estimation found in hybrid codecs, and is less expensive. In the DCN, an RNN-based network is utilized to compress displaced frame differences as well as retain temporal information between frames. The LSTM-UNet is used in the FRN to learn space time differential representations of videos. Our experimental results show that our compression model, which we call the MOtionless VIdeo Codec (MOVI-Codec), learns how to efficiently compress videos without computing motion. Our experiments show that MOVI-Codec outperforms the video coding standard H.264 and exceeds the performance of the modern global standard HEVC codec as measured by MS-SSIM, especially on higher resolution videos.

updated: Fri Dec 11 2020 01:11:07 GMT+0000 (UTC)

published: Tue Sep 29 2020 15:49:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト