GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

Zhijian Hou; Lei Ji; Difei Gao; Wanjun Zhong; Kun Yan; Chao Li; Wing-Kwong Chan; Chong-Wah Ngo; Nan Duan; Mike Zheng Shou

GroundNLQ @ Ego4D 自然言語クエリチャレンジ 2023

このレポートでは、CVPR 2023 の Ego4D Natural Language Queries (NLQ) チャレンジのチャンピオンソリューションを紹介します。基本的に、ビデオを正確にグラウンディングするには、効果的な自己中心的な特徴抽出器と強力なグラウンディングモデルが必要です。これを動機として、私たちは 2 段階の事前トレーニング戦略を活用して、ビデオナレーションで自己中心的な特徴抽出器とグラウンディングモデルをトレーニングし、注釈付きデータでモデルをさらに微調整します。さらに、新しいグラウンディングモデル GroundNLQ を紹介します。これは、効果的なビデオとテキストの融合と、特に長いビデオのさまざまな時間間隔を実現するマルチモーダルマルチスケールグラウンディングモジュールを採用しています。ブラインドテストセットでは、GroundNLQ は R1@IoU=0.3 と R1@IoU=0.5 でそれぞれ 25.67 と 18.18 を達成し、他のすべてのチームを大幅に上回りました。私たちのコードはhttps://github.com/houzhijian/GroundNLQでリリースされます。

In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released athttps://github.com/houzhijian/GroundNLQ.

updated: Tue Jun 27 2023 07:27:52 GMT+0000 (UTC)

published: Tue Jun 27 2023 07:27:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト