On the Choice of Perception Loss Function for Learned Video Compression

Sadaf Salehkalaibar; Buu Phan; Jun Chen; Wei Yu; Ashish Khisti

学習済みビデオ圧縮における知覚損失関数の選択について

私たちは、出力が平均二乗誤差 (MSE) 歪み損失とターゲットのリアリズムに対する知覚損失の両方にさらされた場合の、因果的で低遅延の逐次ビデオ圧縮を研究します。従来のアプローチに基づいて、2 つの異なる知覚損失関数 (PLF) を検討します。 1 つ目のメトリックである PLF-JD は、現在のビデオフレームまでのすべてのビデオフレームの結合分布 (JD) を考慮します。一方、2 つ目のメトリックである PLF-FMD は、ソースと再構成の間のフレームごとの周辺分布 (FMD) を考慮します。情報理論分析と深層学習ベースの実験を使用して、PLF の選択が、特に低ビットレートでの再構成に大きな影響を与える可能性があることを実証します。特に、PLF-JD に基づく再構成では、フレーム間の時間的相関をより適切に保存できますが、PLF-FMD と比較して歪みに重大なペナルティが課され、以前の出力フレームで発生したエラーから回復することがさらに困難になります。 PLF の選択は再構成の品質に決定的な影響を与えますが、エンコード中に特定の PLF にコミットすることが必須ではなく、PLF の選択をデコーダに委任できることも示しています。特に、(どちらの PLF も必要とせずに) MSE を最小化するようにシステムをトレーニングすることによって生成されたエンコードされた表現は、ほぼ普遍的なものとなり、デコーダーでの PLF のいずれの選択に対しても最適に近い再構成を生成できます。私たちは、（ワンショット）情報理論分析、ガウス・マルコフソースモデルのレート・歪み・知覚トレードオフの詳細な研究、および移動するMNISTおよびKTHデータセットでの深層学習ベースの実験を使用して結果を検証します。

We study causal, low-latency, sequential video compression when the output is subjected to both a mean squared-error (MSE) distortion loss as well as a perception loss to target realism. Motivated by prior approaches, we consider two different perception loss functions (PLFs). The first, PLF-JD, considers the joint distribution (JD) of all the video frames up to the current one, while the second metric, PLF-FMD, considers the framewise marginal distributions (FMD) between the source and reconstruction. Using information theoretic analysis and deep-learning based experiments, we demonstrate that the choice of PLF can have a significant effect on the reconstruction, especially at low-bit rates. In particular, while the reconstruction based on PLF-JD can better preserve the temporal correlation across frames, it also imposes a significant penalty in distortion compared to PLF-FMD and further makes it more difficult to recover from errors made in the earlier output frames. Although the choice of PLF decisively affects reconstruction quality, we also demonstrate that it may not be essential to commit to a particular PLF during encoding and the choice of PLF can be delegated to the decoder. In particular, encoded representations generated by training a system to minimize the MSE (without requiring either PLF) can be near universal and can generate close to optimal reconstructions for either choice of PLF at the decoder. We validate our results using (one-shot) information-theoretic analysis, detailed study of the rate-distortion-perception tradeoff of the Gauss-Markov source model as well as deep-learning based experiments on moving MNIST and KTH datasets.

updated: Tue May 30 2023 14:24:40 GMT+0000 (UTC)

published: Tue May 30 2023 14:24:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト