Towards Diverse Paragraph Captioning for Untrimmed Videos

Yuqing Song; Shizhe Chen; Qin Jin

トリミングされていないビデオの多様な段落キャプションに向けて

ビデオパラグラフキャプションは、説明的なパラグラフを使用して、トリミングされていないビデオの複数のイベントを説明することを目的としています。既存のアプローチは主に、イベント検出とイベントキャプションという 2 つのステップで問題を解決します。このような 2 段階の方法により、生成された段落の品質は、すでに困難なタスクであるイベント提案の検出の精度に大きく依存します。この論文では、問題のあるイベントの検出段階を回避し、トリミングされていないビデオのパラグラフを直接生成するパラグラフキャプションモデルを提案します。一貫性のある多様なイベントを説明するために、動的ビデオメモリで従来の時間的注意を強化することを提案します。これは、新しいビデオ機能を徐々に公開し、過剰にアクセスされたビデオコンテンツを抑制して、モデルの視覚的焦点を制御します。さらに、言語の観点からパラグラフの多様性を改善するために、多様性主導のトレーニング戦略が提案されています。通常、トリミングされていないビデオには大量であるが冗長なフレームが含まれていることを考慮して、効率を向上させるためにキーフレームを認識することでビデオエンコーダーをさらに強化します。 ActivityNet および Charades データセットの実験結果は、提案されたモデルが、イベント境界アノテーションを使用せずに、精度と多様性の両方のメトリックで最先端のパフォーマンスを大幅に上回ることを示しています。コードは https://github.com/syuqings/video-paragraph でリリースされます。

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs. Existing approaches mainly solve the problem in two steps: event detection and then event captioning. Such two-step manner makes the quality of generated paragraphs highly dependent on the accuracy of event proposal detection which is already a challenging task. In this paper, we propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos. To describe coherent and diverse events, we propose to enhance the conventional temporal attention with dynamic video memories, which progressively exposes new video features and suppresses over-accessed video contents to control visual focuses of the model. In addition, a diversity-driven training strategy is proposed to improve diversity of paragraph on the language perspective. Considering that untrimmed videos generally contain massive but redundant frames, we further augment the video encoder with keyframe awareness to improve efficiency. Experimental results on the ActivityNet and Charades datasets show that our proposed model significantly outperforms the state-of-the-art performance on both accuracy and diversity metrics without using any event boundary annotations. Code will be released at https://github.com/syuqings/video-paragraph.

updated: Sun May 30 2021 09:28:43 GMT+0000 (UTC)

published: Sun May 30 2021 09:28:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト