EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Shuhan Tan; Tushar Nagarajan; Kristen Grauman

EgoDistill: 効率的なビデオ理解のための自己中心的な頭の動きの蒸留

自己中心的なビデオ理解モデルの最近の進歩は有望ですが、それらの膨大な計算コストは、多くの現実世界のアプリケーションの障壁となっています。この課題に対処するために、EgoDistill を提案します。これは、ビデオフレームのまばらなセットからのセマンティクスと軽量の IMU 読み取り値からの頭の動きを組み合わせることによって、重い自己中心的なビデオクリップ機能を再構築することを学習する蒸留ベースのアプローチです。さらに、IMU 機能学習のための新しい自己教師付きトレーニング戦略を考案します。私たちの方法は効率の大幅な改善につながり、同等のビデオモデルよりも 200 分の 1 の GFLOP しか必要としません。 Ego4D および EPICKitchens データセットでその有効性を実証します。ここで、私たちの方法は最先端の効率的なビデオ理解方法よりも優れています。

Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.

updated: Thu Jan 05 2023 18:39:23 GMT+0000 (UTC)

published: Thu Jan 05 2023 18:39:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト