Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Huanhou Xiao; Jinglun Shi

テキストベースの動的な注意とステップバイステップ学習によるビデオキャプション

自然言語でビデオコンテンツを自動的に記述することは、CVおよびNLPコミュニティで大きな注目を集めています。ほとんどの既存の方法は、一度に1つの単語を予測し、最後に生成された単語を次回入力としてフィードバックしますが、他の生成された単語は完全には活用されません。さらに、従来の方法は、学習状況を考慮せずに各エポックのすべてのトレーニングサンプルを使用してモデルを最適化します。これにより、多くの不必要なトレーニングが発生し、困難なサンプルをターゲットにできません。これらの問題に対処するために、TDAMという名前のテキストベースの動的アテンションモデルを提案します。このモデルは、コンテキストセマンティック情報を改善し、文全体の全体的な制御を強化する動機付きで、生成されたすべての単語に動的アテンションメカニズムを適用します。さらに、テキストベースの動的な注意メカニズムと視覚的な注意メカニズムは、重要な単語に焦点を合わせるためにリンクされています。トレーニング中に互いに恩恵を受けることができます。したがって、モデルは「ゼロから開始する」と「ギャップを確認する」という2つの手順でトレーニングされます。前者はすべてのサンプルを使用してモデルを最適化しますが、後者は制御が不十分なサンプルのみを学習します。人気のあるデータセットMSVDおよびMSR-VTTの実験結果は、非アンサンブルモデルが最先端のビデオキャプションベンチマークよりも優れていることを示しています。

Automatically describing video content with natural language has been attracting much attention in CV and NLP communities. Most existing methods predict one word at a time, and by feeding the last generated word back as input at the next time, while the other generated words are not fully exploited. Furthermore, traditional methods optimize the model using all the training samples in each epoch without considering their learning situations, which leads to a lot of unnecessary training and can not target the difficult samples. To address these issues, we propose a text-based dynamic attention model named TDAM, which imposes a dynamic attention mechanism on all the generated words with the motivation to improve the context semantic information and enhance the overall control of the whole sentence. Moreover, the text-based dynamic attention mechanism and the visual attention mechanism are linked together to focus on the important words. They can benefit from each other during training. Accordingly, the model is trained through two steps: "starting from scratch" and "checking for gaps". The former uses all the samples to optimize the model, while the latter only trains for samples with poor control. Experimental results on the popular datasets MSVD and MSR-VTT demonstrate that our non-ensemble model outperforms the state-of-the-art video captioning benchmarks.

updated: Tue Nov 05 2019 15:14:12 GMT+0000 (UTC)

published: Tue Nov 05 2019 15:14:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト