Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

Kevin Qinghong Lin; Alex Jinpeng Wang; Mattia Soldan; Michael Wray; Rui Yan; Eric Zhongcong Xu; Difei Gao; Rongcheng Tu; Wenzhe Zhao; Weijie Kong; Chengfei Cai; Hongfa Wang; Dima Damen; Bernard Ghanem; Wei Liu; Mike Zheng Shou

自己中心的なビデオ-言語の事前トレーニング@Ego4Dチャレンジ2022

このレポートでは、自然言語クエリ（NLQ）、モーメントクエリ（MQ）、オブジェクト状態変更分類（OSCC）、PNRローカリゼーション（PNR）を含む4つのEgo4Dチャレンジタスク用のビデオ言語事前トレーニング（VLP）ベースのソリューションkevin2022egovlpを提案します）。特に、最近リリースされたEgo4Dデータセットgrauman2021ego4dを活用して、事前トレーニングデータセット、事前トレーニングの目的、および開発セットからEgocentricVLPを開拓します。上記の3つの設計に基づいて、自己中心的なビデオテキスト表現またはビデオのみの表現をいくつかのビデオダウンストリームタスクに転送できる、事前にトレーニングされたビデオ言語モデルを開発します。私たちの自己中心的なVLPは、NLQで10.46R @ 1＆IoU @ 0.3、MQで10.33 mAP、OSCCで74％Acc、PNRで0.67秒のエラーを達成します。コードはhttps://github.com/showlab/EgoVLPで入手できます。

In this report, we propose a video-language pretraining (VLP) based solution kevin2022egovlp for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset grauman2021ego4d to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation or video-only representation to several video downstream tasks. Our Egocentric VLP achieves 10.46R@1&IoU @0.3 on NLQ, 10.33 mAP on MQ, 74% Acc on OSCC, 0.67 sec error on PNR. The code is available at https://github.com/showlab/EgoVLP.

updated: Mon Jul 04 2022 12:47:16 GMT+0000 (UTC)

published: Mon Jul 04 2022 12:47:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト