Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training

Xiao Lu; Yihong Cao; Sheng Liu; Chengjiang Long; Zipei Chen; Xuanyu Zhou; Yimin Yang; Chunxia Xiao

時空間補間整合性トレーニングによるビデオシャドウ検出

監視対象のビデオシャドウ検出方法のために大規模なデータセットに注釈を付けることは困難です。ラベル付けされた画像でトレーニングされたモデルをビデオフレームに直接使用すると、高い汎化誤差と時間的に一貫性のない結果が生じる可能性があります。この論文では、ラベル付き画像と一緒にラベルなしビデオフレームを画像影検出ネットワークトレーニングに合理的に供給するための時空間補間一貫性トレーニング（STICT）フレームワークを提案することにより、これらの課題に対処します。具体的には、空間補間と時間補間の2つの新しい補間スキームを定義する空間および時間ICTを提案します。次に、ピクセル単位の分類タスクの一般化を強化し、時間的一貫性のある予測を促進するために、それに応じて空間的および時間的補間の一貫性制約を導き出します。さらに、画像でのマルチスケールの影の知識学習のためのスケール認識ネットワークを設計し、異なるスケールでの予測間の不一致を最小限に抑えるためのスケール整合性制約を提案します。提案されたアプローチは、ViShaデータセットと自己注釈付きデータセットで広範囲に検証されています。実験結果は、ビデオラベルがなくても、私たちのアプローチは、ほとんどの最先端の監視あり、半監視あり、または監視なしの画像/ビデオシャドウ検出方法や関連タスクの他の方法よりも優れていることを示しています。コードとデータセットはhttps://github.com/yihong-97/STICTで入手できます。

It is challenging to annotate large-scale datasets for supervised video shadow detection methods. Using a model trained on labeled images to the video frames directly may lead to high generalization error and temporal inconsistent results. In this paper, we address these challenges by proposing a Spatio-Temporal Interpolation Consistency Training (STICT) framework to rationally feed the unlabeled video frames together with the labeled images into an image shadow detection network training. Specifically, we propose the Spatial and Temporal ICT, in which we define two new interpolation schemes, i.e., the spatial interpolation and the temporal interpolation. We then derive the spatial and temporal interpolation consistency constraints accordingly for enhancing generalization in the pixel-wise classification task and for encouraging temporal consistent predictions, respectively. In addition, we design a Scale-Aware Network for multi-scale shadow knowledge learning in images, and propose a scale-consistency constraint to minimize the discrepancy among the predictions at different scales. Our proposed approach is extensively validated on the ViSha dataset and a self-annotated dataset. Experimental results show that, even without video labels, our approach is better than most state of the art supervised, semi-supervised or unsupervised image/video shadow detection methods and other methods in related tasks. Code and dataset are available at https://github.com/yihong-97/STICT.

updated: Fri Jun 17 2022 14:29:51 GMT+0000 (UTC)

published: Fri Jun 17 2022 14:29:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト