Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation

Jiabo Huang; Yang Liu; Shaogang Gong; Hailin Jin

ビデオ活動のローカリゼーションにおける文間の時間的および意味的関係

ビデオアクティビティのローカリゼーションは、トリミングされていない、構造化されていないビデオから、言語の説明（文）に対応する最も顕著な視覚セグメントを自動的にローカライズするという実用的な価値のために、最近ますます注目を集めています。教師ありモデルトレーニングでは、文（ビデオモーメント）の各ビデオセグメントの開始時間インデックスと終了時間インデックスの両方の時間的注釈を付ける必要があります。これは非常に費用がかかるだけでなく、あいまいさや主観的な注釈の偏りにも敏感であり、画像のラベル付けよりもはるかに難しい作業です。この作業では、ビデオモーメントプロポーザルの生成とマッチングにCross-Sentence Relations Mining（CRM）を導入することにより、より正確な弱教師ありソリューションを開発します。具体的には、2つの文間の関係制約を調査します。（1）時間的順序付けと（2）ビデオアクティビティの段落記述内の文間の意味的一貫性。既存の弱教師あり手法では、文間の段落コンテキストを考慮せずに、トレーニングで文内のビデオセグメントの相関関係のみを考慮します。これは、視覚的に無差別なビデオモーメントの提案を単独で使用した個々の文のあいまいな表現のために誤解を招く可能性があります。 2つの公開されているアクティビティローカリゼーションデータセットでの実験は、特にビデオアクティビティの説明がより複雑になる場合に、最先端の弱く監視された方法に対する私たちのアプローチの利点を示しています。

Video activity localisation has recently attained increasing attention due to its practical values in automatically localising the most salient visual segments corresponding to their language descriptions (sentences) from untrimmed and unstructured videos. For supervised model training, a temporal annotation of both the start and end time index of each video segment for a sentence (a video moment) must be given. This is not only very expensive but also sensitive to ambiguity and subjective annotation bias, a much harder task than image labelling. In this work, we develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining (CRM) in video moment proposal generation and matching when only a paragraph description of activities without per-sentence temporal annotation is available. Specifically, we explore two cross-sentence relational constraints: (1) Temporal ordering and (2) semantic consistency among sentences in a paragraph description of video activities. Existing weakly-supervised techniques only consider within-sentence video segment correlations in training without considering cross-sentence paragraph context. This can mislead due to ambiguous expressions of individual sentences with visually indiscriminate video moment proposals in isolation. Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods, especially so when the video activity descriptions become more complex.

updated: Fri Jul 23 2021 20:04:01 GMT+0000 (UTC)

published: Fri Jul 23 2021 20:04:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト