Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Teng Wang; Jinrui Zhang; Feng Zheng; Wenhao Jiang; Ran Cheng; Ping Luo

トリミングされていないビデオでの多目的な理解のためのグラウンディングされた視覚言語表現の学習

ビデオ言語の共同学習は、近年ますます注目を集めています。ただし、既存の作業は主に単一または複数のトリミングされたビデオクリップ (イベント) に焦点を当てているため、推論中に人間が注釈を付けたイベント境界が必要になります。この関係を断ち切るために、トリミングされていないビデオ用のグラウンディングされた視覚言語学習フレームワークを提案します。これは、有益なイベントを自動的に検出し、複数の文の説明と対応するイベントセグメントとの間の整合性を効果的に発掘します。粗いレベルのビデオ言語アラインメントの代わりに、2 つのデュアルプレテキストタスクを提示して、細かいセグメントレベルのアラインメント、つまり、テキストからイベントへのグラウンディング (TEG) とイベントからテキストへの生成 (ETG) を促進します。 TEG は、共同意味空間でクロスモーダル距離を推定することにより、一連の文が与えられた場合に可能なイベント提案を適応的に根拠付けることを学習します。一方、ETG は、イベント提案が与えられた場合に一致したテキストを再構築 (生成) することを目的としており、イベント表現が意味のあるセマンティック情報を保持するようにします。イベントセットとテキストセットの間の正確なラベル割り当てを促進するために、あいまいな境界注釈によって引き起こされる準最適なマッチング結果を軽減するための新しいセマンティック認識コストを提案します。私たちのフレームワークは、視覚に基づいた言語の理解と生成をカバーするタスクに簡単に拡張できます。 ActivityNet Captions、YouCook2、YouMakeup で最先端の高密度ビデオキャプションパフォーマンスを達成し、他のいくつかの言語生成および理解タスクで競争力のあるパフォーマンスを達成しています。また、PIC 4th Challenge の MTVG と MDVC の両方のタスクで、私たちの手法が 1 位を獲得しました。

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge.

updated: Sat Mar 11 2023 11:00:16 GMT+0000 (UTC)

published: Sat Mar 11 2023 11:00:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト