Ego-Only: Egocentric Action Detection without Exocentric Transferring

Huiyu Wang; Mitesh Kumar Singh; Lorenzo Torresani

Ego-Only: エゴセントリックな転送を行わないエゴセントリックな行動の検出

私たちは、エゴセントリック (三人称) の転送を一切行わずに、自己中心的 (一人称) ビデオ上で最先端のアクション検出を可能にする最初のアプローチであるエゴオンリーを紹介します。 2 つのドメインを隔てる内容と外観のギャップにもかかわらず、大規模なエキソセントリックな転送は、自己中心的なアクション検出のデフォルトの選択となっています。これは、これまでの研究で、自己中心的なモデルを最初から訓練するのは難しく、外部中心的な表現からの移行が精度の向上につながることが判明したためです。ただし、このホワイトペーパーでは、この一般的な信念を再検討します。 2 つのドメインを隔てる大きなギャップを動機として、エキソセントリックな転移を行わずにエゴセントリックなモデルの効果的なトレーニングを可能にする戦略を提案します。私たちのエゴオンリーのアプローチはシンプルです。時間的セグメンテーション用に微調整されたマスクされたオートエンコーダーを使用してビデオ表現をトレーニングします。学習された特徴は、アクションを検出するために、既製の時間的アクション位置特定手法に供給されます。 Ego4D、EPIC-Kitchens-100、Charades-Ego という 3 つの確立された自己中心的なビデオデータセットに対して、この単純な Ego-Only アプローチによって達成される非常に強力な結果が示され、これにより、エキソセントリックな転送が不要になることがわかりました。動作検出と動作認識の両方において、Ego-Only は、桁違いに多くのラベルを使用するこれまでの最高のエキソセントリック転送方法よりも優れています。 Ego-Only は、エキソセントリックデータを使用せずに、これらのデータセットとベンチマークに新しい最先端の結果を設定します。

We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) transferring. Despite the content and appearance gap separating the two domains, large-scale exocentric transferring has been the default choice for egocentric action detection. This is because prior works found that egocentric models are difficult to train from scratch and that transferring from exocentric representations leads to improved accuracy. However, in this paper, we revisit this common belief. Motivated by the large gap separating the two domains, we propose a strategy that enables effective training of egocentric models without exocentric transferring. Our Ego-Only approach is simple. It trains the video representation with a masked autoencoder finetuned for temporal segmentation. The learned features are then fed to an off-the-shelf temporal action localization method to detect actions. We find that this renders exocentric transferring unnecessary by showing remarkably strong results achieved by this simple Ego-Only approach on three established egocentric video datasets: Ego4D, EPIC-Kitchens-100, and Charades-Ego. On both action detection and action recognition, Ego-Only outperforms previous best exocentric transferring methods that use orders of magnitude more labels. Ego-Only sets new state-of-the-art results on these datasets and benchmarks without exocentric data.

updated: Fri May 19 2023 22:23:48 GMT+0000 (UTC)

published: Tue Jan 03 2023 22:22:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト