Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Teng Wang; Jinrui Zhang; Feng Zheng; Wenhao Jiang; Ran Cheng; Ping Luo

トリミングされていないビデオで多様な理解を実現するための、グラウンデッドな視覚言語表現の学習

近年、共同ビデオ言語学習への注目が高まっています。ただし、既存の作品は主に 1 つまたは複数のトリミングされたビデオクリップ (イベント) に焦点を当てているため、推論中に人間による注釈が付けられたイベント境界が必要になります。この結びつきから抜け出すために、私たちはトリミングされていないビデオのための根拠のある視覚言語学習フレームワークを提案します。このフレームワークは、有益なイベントを自動的に検出し、複数の文の説明と対応するイベントセグメントの間の整合性を効果的に発掘します。大まかなレベルのビデオ言語の調整の代わりに、粒度の細かいセグメントレベルの調整、つまりテキストからイベントへのグラウンディング (TEG) とイベントからテキストへの生成 (ETG) を促進する 2 つのデュアルプレテキストタスクを提示します。 TEG は、共同意味空間内のクロスモーダル距離を推定することにより、一連の文が与えられた場合に考えられるイベントの提案を適応的に根拠付けることを学習します。一方、ETG は、イベントの提案に基づいて一致したテキストを再構築 (生成) し、イベント表現が意味のある意味情報を保持することを促進します。イベントセットとテキストセット間の正確なラベル割り当てを促進するために、曖昧な境界アノテーションによって引き起こされる最適ではないマッチング結果を軽減するための、新しいセマンティクスを意識したコストを提案します。私たちのフレームワークは、視覚に基づいた言語の理解と生成をカバーするタスクに簡単に拡張できます。当社は、ActivityNet Captions、YouCook2、YouMakeup で最先端の高密度ビデオキャプションパフォーマンスを実現し、他のいくつかの言語生成および理解タスクでも競争力のあるパフォーマンスを実現します。私たちの手法は、PIC 4th Challenge の MTVG タスクと MDVC タスクの両方で 1 位を獲得しました。私たちのコードは https://github.com/zjr2000/GVL で公開されています。

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly available at https://github.com/zjr2000/GVL.

updated: Wed May 17 2023 09:47:49 GMT+0000 (UTC)

published: Sat Mar 11 2023 11:00:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト