Domain Adaptive Video Semantic Segmentation via Cross-Domain Moving Object Mixing

Kyusik Cho; Suhyeon Lee; Hongje Seong; Euntai Kim

クロスドメイン移動オブジェクトミキシングによるドメイン適応ビデオセマンティックセグメンテーション

ドメイン適応のためにトレーニングされたネットワークは、転送しやすいクラスに偏りがちです。トレーニング中はターゲットドメインのグラウンドトゥルースラベルを使用できないため、バイアスの問題によって歪んだ予測が発生し、転送が困難なクラスを予測するのを忘れてしまいます。この問題に対処するために、転送が困難なクラスを含むいくつかのオブジェクトをソースドメインのビデオクリップでカットし、それらをターゲットドメインのビデオクリップに貼り付ける Cross-domain Moving Object Mixing (CMOM) を提案します。画像レベルのドメイン適応とは異なり、2 つの異なるビデオで動くオブジェクトを混在させるには、時間的なコンテキストを維持する必要があります。したがって、非現実的な動きが発生しないように、連続したビデオフレームと混合するように CMOM を設計します。さらに、ターゲットドメインの機能の識別性を高めるために、時間的コンテキストを使用した機能の配置 (FATC) を提案します。 FATC は、グラウンドトゥルースラベルでトレーニングされた堅牢なソースドメインの特徴を利用して、信頼できない予測を一時的なコンセンサスでフィルタリングすることにより、教師なしで識別可能なターゲットドメインの特徴を学習します。広範な実験を通じて、提案されたアプローチの有効性を実証します。特に、私たちのモデルは VIPER to Cityscapes-Seq ベンチマークで 53.81% の mIoU に達し、SYNTHIA-Seq to Cityscapes-Seq ベンチマークで 56.31% の mIoU に達し、最先端の方法を大幅に上回っています。

The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER to Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq to Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins.

updated: Fri Nov 04 2022 08:10:33 GMT+0000 (UTC)

published: Fri Nov 04 2022 08:10:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト