Real-Time Video Super-Resolution by Joint Local Inference and Global Parameter Estimation

Noam Elron; Alex Itskovich; Shahar S. Yuval; Noam Levy

共同ローカル推論とグローバルパラメータ推定によるリアルタイムビデオ超解像

ビデオ超解像（SR）の最先端技術はディープラーニングに基づく手法ですが、実際のビデオではパフォーマンスが低下します（図1を参照）。その理由は、トレーニング画像ペアは通常、高解像度の画像を縮小して低解像度の画像を生成することによって作成されるためです。したがって、ディープモデルは、ダウンスケーリングを元に戻すようにトレーニングされており、実際の画像を超解像するように一般化することはありません。最近のいくつかの出版物は、学習ベースのSRの一般化を改善するための手法を示していますが、すべてリアルタイムアプリケーションには適していません。異なるスケールで2つのデジタルカメラ画像キャプチャプロセスをシミュレートすることにより、トレーニングデータを合成するための新しいアプローチを提示します。私たちの方法は、両方の画像が自然画像の特性を持つ画像ペアを生成します。このデータを使用してSRモデルをトレーニングすると、実際の画像やビデオへの一般化が大幅に向上します。さらに、ディープビデオSRモデルは、ピクセルあたりの操作数が多いという特徴があり、リアルタイムでの適用が禁止されています。低電力エッジデバイスでのビデオSRのリアルタイムアプリケーションを可能にする効率的なCNNアーキテクチャを紹介します。 SRタスクを2つのサブタスクに分割します。入力ビデオのグローバルプロパティを推定し、処理の重みとバイアスを適応させる制御フロー-実際の処理を実行するCNNです。プロセスCNNは入力の統計に合わせて調整されているため、その容量は低く抑えられ、有効性は維持されます。また、ビデオ統計はゆっくりと進化するため、制御フローはビデオフレームレートよりもはるかに低いレートで動作します。これにより、全体的な計算負荷が2桁も削減されます。アルゴリズムの適応性をピクセル処理から切り離すこのフレームワークは、リアルタイムビデオエンハンスメントアプリケーションの大規模なファミリ、たとえば、ビデオノイズ除去、ローカルトーンマッピング、安定化などに適用できます。

The state of the art in video super-resolution (SR) are techniques based on deep learning, but they perform poorly on real-world videos (see Figure 1). The reason is that training image-pairs are commonly created by downscaling a high-resolution image to produce a low-resolution counterpart. Deep models are therefore trained to undo downscaling and do not generalize to super-resolving real-world images. Several recent publications present techniques for improving the generalization of learning-based SR, but are all ill-suited for real-time application. We present a novel approach to synthesizing training data by simulating two digital-camera image-capture processes at different scales. Our method produces image-pairs in which both images have properties of natural images. Training an SR model using this data leads to far better generalization to real-world images and videos. In addition, deep video-SR models are characterized by a high operations-per-pixel count, which prohibits their application in real-time. We present an efficient CNN architecture, which enables real-time application of video SR on low-power edge-devices. We split the SR task into two sub-tasks: a control-flow which estimates global properties of the input video and adapts the weights and biases of a processing-CNN that performs the actual processing. Since the process-CNN is tailored to the statistics of the input, its capacity kept low, while retaining effectivity. Also, since video-statistics evolve slowly, the control-flow operates at a much lower rate than the video frame-rate. This reduces the overall computational load by as much as two orders of magnitude. This framework of decoupling the adaptivity of the algorithm from the pixel processing, can be applied in a large family of real-time video enhancement applications, e.g., video denoising, local tone-mapping, stabilization, etc.

updated: Thu May 06 2021 16:35:09 GMT+0000 (UTC)

published: Thu May 06 2021 16:35:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト