AuxAdapt: Stable and Efficient Test-Time Adaptation for Temporally Consistent Video Semantic Segmentation

Yizhe Zhang; Shubhankar Borse; Hong Cai; Fatih Porikli

AuxAdapt：時間的に一貫性のあるビデオセマンティックセグメンテーションのための安定した効率的なテスト時間の適応

ビデオセグメンテーションでは、フレーム全体で時間的に一貫した結果を生成することは、フレームごとの精度を達成することと同じくらい重要です。既存の方法は、オプティカルフローの正則化またはテストデータによる微調整のいずれかに依存して、時間的な一貫性を実現します。ただし、オプティカルフローが常に利用可能で信頼できるとは限りません。その上、計算するのに費用がかかります。テスト時に元のモデルを微調整することは、コストに敏感です。このホワイトペーパーでは、ほとんどのニューラルネットワークモデルの時間的一貫性を向上させるための、効率的で直感的で教師なしのオンライン適応手法であるAuxAdaptについて説明します。オプティカルフローを必要とせず、ビデオを1回通過するだけです。不整合は主に出力のモデルの不確実性から生じるため、モデルがビデオをストリーミングするときにモデルが独自のセグメンテーション決定から学習する適応スキームを提案します。これにより、フレーム全体で同様に見えるピクセルに対して、より信頼性が高く、時間的に一貫したラベリングを生成できます。安定性と効率性のために、この適応を支援するために小さな補助セグメンテーションネットワーク（AuxNet）を活用しています。より具体的には、AuxNetは、MainNetの推定値に独自の推定値を追加することにより、元のセグメンテーションネットワーク（Main-Net）の決定を再調整します。すべてのフレームで、MainNetを固定したまま、AuxNetのみがバックプロパゲーションを介して更新されます。 Cityscapes、CamVid、KITTIなどの標準的なビデオベンチマークで、テスト時間の適応アプローチを広範囲に評価します。結果は、私たちのアプローチがラベルごとに正確で、時間的に一貫性があり、計算効率の高い適応を提供することを示しています（最先端のテスト時間適応方法と比較して5倍以上のオーバーヘッド削減）。

In video segmentation, generating temporally consistent results across frames is as important as achieving frame-wise accuracy. Existing methods rely either on optical flow regularization or fine-tuning with test data to attain temporal consistency. However, optical flow is not always avail-able and reliable. Besides, it is expensive to compute. Fine-tuning the original model in test time is cost sensitive. This paper presents an efficient, intuitive, and unsupervised online adaptation method, AuxAdapt, for improving the temporal consistency of most neural network models. It does not require optical flow and only takes one pass of the video. Since inconsistency mainly arises from the model's uncertainty in its output, we propose an adaptation scheme where the model learns from its own segmentation decisions as it streams a video, which allows producing more confident and temporally consistent labeling for similarly-looking pixels across frames. For stability and efficiency, we leverage a small auxiliary segmentation network (AuxNet) to assist with this adaptation. More specifically, AuxNet readjusts the decision of the original segmentation network (Main-Net) by adding its own estimations to that of MainNet. At every frame, only AuxNet is updated via back-propagation while keeping MainNet fixed. We extensively evaluate our test-time adaptation approach on standard video benchmarks, including Cityscapes, CamVid, and KITTI. The results demonstrate that our approach provides label-wise accurate, temporally consistent, and computationally efficient adaptation (5+ folds overhead reduction comparing to state-of-the-art test-time adaptation methods).

updated: Sun Oct 24 2021 07:07:41 GMT+0000 (UTC)

published: Sun Oct 24 2021 07:07:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト