Entropy-driven Unsupervised Keypoint Representation Learning in Videos

Ali Younes; Simone Schaub-Meyer; Georgia Chalvatzaki

ビデオでのエントロピー駆動の教師なしキーポイント表現学習

ビデオから有益な表現を抽出することは、下流のさまざまなタスクを効果的に学習するための基本です。私たちは、画像内のピクセルごとの情報を定量化する画像空間エントロピー (ISE) の概念を活用して、ビデオから意味のある表現を教師なしで学習するための新しいアプローチを提案します。私たちは、ピクセル近傍の局所エントロピーとその時間的進化が、顕著な特徴を学習するための貴重な固有の監視信号を生み出すと主張します。このアイデアに基づいて、動的情報送信機として機能するキーポイントの簡潔な表現に視覚的特徴を抽象化し、純粋に教師なしで空間的および時間的に一貫した表現をビデオフレームから直接学習する深層学習モデルを設計します。ローカルエントロピーから計算された 2 つの独自の情報理論損失は、一貫したキーポイント表現を発見するようにモデルを導きます。キーポイントがカバーする空間情報を最大化する損失と、時間の経過に伴うキーポイントの情報伝達を最適化する損失です。私たちは、キーポイント表現をさまざまな下流タスク (オブジェクトダイナミクスの学習など) の強力なベースラインと比較します。私たちの経験的結果は、静的および動的オブジェクト、またはシーンに突然出入りするオブジェクトへの注意などの課題を解決する情報駆動型キーポイントの優れたパフォーマンスを示しています。

Extracting informative representations from videos is fundamental for effectively learning various downstream tasks. We present a novel approach for unsupervised learning of meaningful representations from videos, leveraging the concept of image spatial entropy (ISE) that quantifies the per-pixel information in an image. We argue that local entropy of pixel neighborhoods and their temporal evolution create valuable intrinsic supervisory signals for learning prominent features. Building on this idea, we abstract visual features into a concise representation of keypoints that act as dynamic information transmitters, and design a deep learning model that learns, purely unsupervised, spatially and temporally consistent representations directly from video frames. Two original information-theoretic losses, computed from local entropy, guide our model to discover consistent keypoint representations; a loss that maximizes the spatial information covered by the keypoints and a loss that optimizes the keypoints' information transportation over time. We compare our keypoint representation to strong baselines for various downstream tasks, e.g. , learning object dynamics. Our empirical results show superior performance for our information-driven keypoints that resolve challenges like attendance to static and dynamic objects or objects abruptly entering and leaving the scene.

updated: Tue Jun 06 2023 07:23:21 GMT+0000 (UTC)

published: Fri Sep 30 2022 12:03:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト