Can SAM Boost Video Super-Resolution?

Zhihe Lu; Zeyu Xiao; Jiawang Bai; Zhiwei Xiong; Xinchao Wang

SAM はビデオの超解像度を向上させることができますか?

ビデオ超解像度 (VSR) の主な課題は、入力フレーム内の大きな動きを処理することであり、そのため、複数のフレームから情報を正確に集約することが困難になります。既存の研究では、効果的な位置合わせと融合のためにフレーム間の対応を確立する前に、変形可能な畳み込みを採用するか、オプティカルフローを推定します。しかし、それらは、それを大幅に強化できる貴重な意味論的情報を考慮に入れていません。また、フローベースの方法はフロー推定モデルの精度に大きく依存しており、2 つの低解像度フレームを考慮すると正確なフローが提供されない可能性があります。この論文では、画像劣化の影響を受けにくい強力な基礎モデルであるセグメントエニシングモデル (SAM) を利用して、強化された VSR のためのより堅牢でセマンティックを意識した事前分布を調査します。 SAM ベースの事前処理を使用するために、意味情報を利用して位置合わせと融合の手順の両方を強化できる、シンプルだが効果的なモジュールである SAM-guidEd refinEment Module (SEEM) を提案します。この軽量のプラグインモジュールは、セマンティック認識機能の生成にアテンションメカニズムを活用するだけでなく、既存のメソッドに簡単かつシームレスに統合できるように特別に設計されています。具体的には、SEEM を EDVR と BasicVSR という 2 つの代表的な手法に適用し、広く使用されている 3 つの VSR データセット (Vimeo-90K、REDS、Vid4) 上で最小限の実装労力で一貫してパフォーマンスを向上させます。さらに重要なことは、提案された SEEM が効率的な調整方法で既存の方法を進歩させ、パフォーマンスとトレーニングパラメーターの数の間のバランスを調整する際の柔軟性が向上することがわかりました。コードは間もなくオープンソースになる予定です。

The primary challenge in video super-resolution (VSR) is to handle large motions in the input frames, which makes it difficult to accurately aggregate information from multiple frames. Existing works either adopt deformable convolutions or estimate optical flow as a prior to establish correspondences between frames for the effective alignment and fusion. However, they fail to take into account the valuable semantic information that can greatly enhance it; and flow-based methods heavily rely on the accuracy of a flow estimate model, which may not provide precise flows given two low-resolution frames. In this paper, we investigate a more robust and semantic-aware prior for enhanced VSR by utilizing the Segment Anything Model (SAM), a powerful foundational model that is less susceptible to image degradation. To use the SAM-based prior, we propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM), which can enhance both alignment and fusion procedures by the utilization of semantic information. This light-weight plug-in module is specifically designed to not only leverage the attention mechanism for the generation of semantic-aware feature but also be easily and seamlessly integrated into existing methods. Concretely, we apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort, on three widely used VSR datasets: Vimeo-90K, REDS and Vid4. More importantly, we found that the proposed SEEM can advance the existing methods in an efficient tuning manner, providing increased flexibility in adjusting the balance between performance and the number of training parameters. Code will be open-source soon.

updated: Thu May 11 2023 02:02:53 GMT+0000 (UTC)

published: Thu May 11 2023 02:02:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト