CONVIQT: Contrastive Video Quality Estimator

Pavan C. Madhusudana; Neil Birkbeck; Yilin Wang; Balu Adsumilli; Alan C. Bovik

CONVIQT：対照的なビデオ品質推定器

知覚ビデオ品質評価（VQA）は、多くのストリーミングおよびビデオ共有プラットフォームの不可欠なコンポーネントです。ここでは、知覚的に関連するビデオ品質表現を自己監視方式で学習する問題について検討します。歪みタイプの識別と劣化レベルの決定は、空間的特徴を抽出する深層畳み込みニューラルネットワーク（CNN）と、時間情報をキャプチャする反復ユニットを含む深層学習モデルをトレーニングするための補助タスクとして使用されます。モデルは対照的な損失を使用してトレーニングされるため、このトレーニングフレームワークと結果のモデルを対照的なVIdeo Quality EstimaTor（CONVIQT）と呼びます。テスト中、トレーニングされたモデルの重みは凍結され、線形リグレッサは、参照なし（NR）設定で学習された特徴を品質スコアにマッピングします。モデル予測とグラウンドトゥルース品質評価との相関関係を分析することにより、複数のVQAデータベースで提案されたモデルの包括的な評価を行い、最新のNR-VQAモデルと比較した場合、そうではない場合でも競争力のあるパフォーマンスを実現します。それらのデータベースでトレーニングされています。私たちのアブレーション実験は、学習された表現が非常に堅牢であり、合成および現実的な歪み全体で十分に一般化されることを示しています。私たちの結果は、知覚的ベアリングを備えた説得力のある表現は、自己監視学習を使用して取得できることを示しています。この作業で使用される実装は、https：//github.com/pavancm/CONVIQTで利用可能になっています。

Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Distortion type identification and degradation level determination is employed as an auxiliary task to train a deep learning model containing a deep Convolutional Neural Network (CNN) that extracts spatial features, as well as a recurrent unit that captures temporal information. The model is trained using a contrastive loss and we therefore refer to this training framework and resulting model as CONtrastive VIdeo Quality EstimaTor (CONVIQT). During testing, the weights of the trained model are frozen, and a linear regressor maps the learned features to quality scores in a no-reference (NR) setting. We conduct comprehensive evaluations of the proposed model on multiple VQA databases by analyzing the correlations between model predictions and ground-truth quality ratings, and achieve competitive performance when compared to state-of-the-art NR-VQA models, even though it is not trained on those databases. Our ablation experiments demonstrate that the learned representations are highly robust and generalize well across synthetic and realistic distortions. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning. The implementations used in this work have been made available at https://github.com/pavancm/CONVIQT.

updated: Wed Jun 29 2022 15:22:01 GMT+0000 (UTC)

published: Wed Jun 29 2022 15:22:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト