Spatio-Temporal Event Segmentation and Localization for Wildlife Extended Videos

Ramy Mounir; Roman Gula; Jörn Theuerkauf; Sudeep Sarkar

ワイルドライフ拡張ビデオの時空間イベントセグメンテーションとローカリゼーション

オフラインのトレーニングスキームを使用して、研究者は手動で注釈を付けたラベルまたは自己監督のエポックベースのトレーニングを通じて完全または弱い監督を提供することにより、イベントセグメンテーションの問題に取り組みました。ほとんどの作品は、長さが最大で10分の動画を考慮しています。時間の経過とともにオブジェクトの安定した表現を構築することにより、時間的イベントセグメンテーションが可能な自己監視型の知覚予測フレームワークを提示し、数日間にわたる長いビデオでそれを示します。このアプローチは一見シンプルですが非常に効果的です。標準的なディープラーニングバックボーンによって計算された高レベルの機能の予測に依存しています。予測には、予測メカニズムを使用して自己監視された方法でトレーニングされた、注意メカニズムが追加されたLSTMを使用します。自己学習した注意マップは、各フレームのイベント関連オブジェクトを効果的にローカライズおよび追跡します。提案されたアプローチはラベルを必要としません。ビデオを1回パスするだけで、個別のトレーニングセットは必要ありません。非常に長いビデオのデータセットがないため、必要な権限で収集した10日間（254時間）の継続的な野生生物モニタリングデータからのビデオで私たちの方法を示します。アプローチは、昼/夜の条件、雨、鋭い影、風の強い条件など、さまざまな環境条件に対して堅牢であることがわかります。イベントを一時的に特定するタスクでは、フレームレベルのセグメンテーションで80％の再現率が20％の偽陽性率でした。活動レベルでは、50分ごとに1回の誤活動検出で80％の活動想起率がありました。初めてのデータセットと、研究コミュニティが利用できるコードを作成します。

Using offline training schemes, researchers have tackled the event segmentation problem by providing full or weak-supervision through manually annotated labels or self-supervised epoch-based training. Most works consider videos that are at most 10's of minutes long. We present a self-supervised perceptual prediction framework capable of temporal event segmentation by building stable representations of objects over time and demonstrate it on long videos, spanning several days. The approach is deceptively simple but quite effective. We rely on predictions of high-level features computed by a standard deep learning backbone. For prediction, we use an LSTM, augmented with an attention mechanism, trained in a self-supervised manner using the prediction error. The self-learned attention maps effectively localize and track the event-related objects in each frame. The proposed approach does not require labels. It requires only a single pass through the video, with no separate training set. Given the lack of datasets of very long videos, we demonstrate our method on video from 10 days (254 hours) of continuous wildlife monitoring data that we had collected with required permissions. We find that the approach is robust to various environmental conditions such as day/night conditions, rain, sharp shadows, and windy conditions. For the task of temporally locating events, we had an 80% recall rate at 20% false-positive rate for frame-level segmentation. At the activity level, we had an 80% activity recall rate for one false activity detection every 50 minutes. We will make the dataset, which is the first of its kind, and the code available to the research community.

updated: Sun Jul 18 2021 19:35:14 GMT+0000 (UTC)

published: Tue May 05 2020 20:11:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト