Exploiting Temporality for Semi-Supervised Video Segmentation

半教師付きビデオセグメンテーションのためのテンポラリティーの活用

近年、教師付き画像セグメンテーションで顕著な進歩がありました。時間的側面は非常に有益であるにもかかわらず、ビデオのセグメンテーションはあまり検討されていません。セマンティックラベル、たとえば現在のフレームで正確に検出できない場合は、前のフレームからの情報を組み込むことで推測できます。ただし、ビデオのセグメンテーションは、処理する必要があるデータの量と、さらに重要なことに、各フレームのグラウンドトゥルースアノテーションの取得に伴うコストのために困難です。このホワイトペーパーでは、注釈付きのフレームが1つしかないビデオの連続フレームを使用して、ラベル不足の問題に取り組んでいます。ラベル付けされていないデータを簡単に取得できるようにするために、時間情報を活用する、深い、エンドツーエンドのトレーニング可能なモデルを提案します。ネットワークアーキテクチャは、2つのコンポーネントの新しい相互接続に依存しています。空間情報をモデル化する完全な畳み込みネットワークと、畳み込みネットワークの中間レベルで採用され、時間を通じて情報を伝達する時間単位です。この作業の主な貢献は、ネットワークを介した一時的な信号のガイダンスです。エンコーダーとデコーダーの間に一時的なモジュールを配置することのみが最適ではないことを示します（ベースライン）。 CityScapesデータセットに対する広範な実験により、結果のモデルはラベルなしの時間フレームを活用し、フレームごとの画像セグメンテーションとベースラインアプローチの両方を大幅に上回ることができることが示されています。

In recent years, there has been remarkable progress in supervised image segmentation. Video segmentation is less explored, despite the temporal dimension being highly informative. Semantic labels, e.g. that cannot be accurately detected in the current frame, may be inferred by incorporating information from previous frames. However, video segmentation is challenging due to the amount of data that needs to be processed and, more importantly, the cost involved in obtaining ground truth annotations for each frame. In this paper, we tackle the issue of label scarcity by using consecutive frames of a video, where only one frame is annotated. We propose a deep, end-to-end trainable model which leverages temporal information in order to make use of easy to acquire unlabeled data. Our network architecture relies on a novel interconnection of two components: a fully convolutional network to model spatial information and temporal units that are employed at intermediate levels of the convolutional network in order to propagate information through time. The main contribution of this work is the guidance of the temporal signal through the network. We show that only placing a temporal module between the encoder and decoder is suboptimal (baseline). Our extensive experiments on the CityScapes dataset indicate that the resulting model can leverage unlabeled temporal frames and significantly outperform both the frame-by-frame image segmentation and the baseline approach.

updated: Thu Aug 29 2019 15:50:12 GMT+0000 (UTC)

published: Thu Aug 29 2019 15:50:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト