Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Yubin Hu; Yuze He; Yanghao Li; Jisheng Li; Yuxing Han; Jiangtao Wen; Yong-Jin Liu

圧縮ビデオの解像度を変更することによる効率的なセマンティックセグメンテーション

ビデオセマンティックセグメンテーション (VSS) は、高フレームレートのビデオのフレームごとの予測により、計算コストの高いタスクです。最近の研究では、コンパクトなモデルまたは適応ネットワーク戦略が効率的な VSS のために提案されています。しかし、彼らは、入力側の計算コストに影響を与える重要な要素である入力解像度を考慮していませんでした。この論文では、効率的なVSSを実現するために、圧縮ビデオのAR-Segと呼ばれる解像度変更フレームワークを提案します。 AR-Seg は、非キーフレームに低解像度を使用することで計算コストを削減することを目的としています。ダウンサンプリングによるパフォーマンスの低下を防ぐために、Cross Resolution Feature Fusion (CReFF) モジュールを設計し、新しい Feature Similarity Training (FST) 戦略で監視します。具体的には、CReFF はまず、圧縮されたビデオに格納されたモーションベクトルを使用して、高解像度のキーフレームから低解像度の非キーフレームにフィーチャをワープし、空間的な位置合わせを改善します。次に、ワープされたフィーチャをローカルアテンションメカニズムで選択的に集約します。さらに、提案されたFSTは、明示的な類似性損失と共有デコード層からの暗黙的な制約を通じて、高解像度の特徴を持つ集約された特徴を監視します。 CamVid と Cityscapes での広範な実験では、AR-Seg が最先端のパフォーマンスを実現し、さまざまなセグメンテーションバックボーンと互換性があることが示されています。 CamVid では、AR-Seg は高いセグメンテーション精度を維持しながら、PSPNet18 バックボーンで 67% の計算コスト (GFLOP で測定) を節約します。コード: https://github.com/THU-LYJ-Lab/AR-Seg.

Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.

updated: Mon Mar 13 2023 15:58:15 GMT+0000 (UTC)

published: Mon Mar 13 2023 15:58:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト