Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

Yanghao Li; Tushar Nagarajan; Bo Xiong; Kristen Grauman

Ego-Exo：サードパーソンから一人称のビデオへの視覚的表現の転送

大規模なサードパーソンビデオデータセットを使用して、自己中心的なビデオモデルを事前トレーニングするためのアプローチを紹介します。純粋に自己中心的なデータからの学習は、データセットの規模と多様性が低いために制限されますが、純粋に外心的な（サードパーソン）データを使用すると、ドメインの不一致が大きくなります。私たちのアイデアは、主要な自己中心性固有の特性を予測するサードパーソンビデオの潜在的な信号を発見することです。事前トレーニング中にこれらの信号を知識蒸留損失として組み込むと、サードパーソンビデオデータの規模と多様性の両方から恩恵を受けるモデル、および顕著な自己中心性をキャプチャする表現が得られます。私たちの実験は、私たちのEgo-Exoフレームワークが標準のビデオモデルにシームレスに統合できることを示しています。自己中心的な活動認識のために微調整すると、すべてのベースラインを上回り、Charades-EgoおよびEPIC-Kitchens-100で最先端の結果を達成します。

We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in models that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

updated: Fri Apr 16 2021 06:10:10 GMT+0000 (UTC)

published: Fri Apr 16 2021 06:10:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト