Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling

Tengpeng Li; Hanli Wang; Bin He; Chang Wen Chen

視覚的ストーリーテリングのためのグループごとのセマンティックを備えた知識豊富なアテンションネットワーク

技術的に挑戦的なトピックとして、視覚的なストーリーテリングは、関連する画像のグループから物語のマルチセンテンスを備えた架空の一貫したストーリーを生成することを目的としています。既存の方法は、画像以外の暗黙の情報を探索することができないため、見かけの画像ベースのコンテンツの直接的かつ厳密な記述を生成することがよくあります。したがって、これらのスキームは、全体論的な表現から一貫した依存関係を捉えることができず、合理的で流暢なストーリーの生成を損ないます。これらの問題に対処するために、グループごとのセマンティックモデルを備えた新しい知識豊富な注意ネットワークが提案されています。 3つの主要な新規コンポーネントは、実用的な利点を明らかにするために、実質的な実験によって設計およびサポートされています。まず、知識が豊富な注意ネットワークは、外部の知識システムから暗黙の概念を抽出するように設計されており、これらの概念の後に、想像的で具体的な表現を特徴付けるカスケードクロスモーダル注意メカニズムが続きます。次に、グローバルに一貫性のあるガイダンスを調査するために、2次プーリングを備えたグループ単位のセマンティックモジュールが開発されています。第三に、エンコーダーデコーダー構造を備えた統合された1ステージのストーリー生成モデルが提案され、知識が豊富なアテンションネットワーク、グループごとのセマンティックモジュール、およびマルチモーダルストーリー生成デコーダーをエンドツーエンドで同時にトレーニングおよび推測します。客観的および主観的な評価指標の両方を備えた人気のあるビジュアルストーリーテリングデータセットでの実質的な実験は、他の最先端の方法と比較して、提案されたスキームの優れたパフォーマンスを示しています。

As a technically challenging topic, visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images. Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images. Hence, these schemes could not capture consistent dependencies from holistic representation, impairing the generation of reasonable and fluent story. To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed. Three main novel components are designed and supported by substantial experiments to reveal practical advantages. First, a knowledge-enriched attention network is designed to extract implicit concepts from external knowledge system, and these concepts are followed by a cascade cross-modal attention mechanism to characterize imaginative and concrete representations. Second, a group-wise semantic module with second-order pooling is developed to explore the globally consistent guidance. Third, a unified one-stage story generation model with encoder-decoder structure is proposed to simultaneously train and infer the knowledge-enriched attention network, group-wise semantic module and multi-modal story generation decoder in an end-to-end fashion. Substantial experiments on the popular Visual Storytelling dataset with both objective and subjective evaluation metrics demonstrate the superior performance of the proposed scheme as compared with other state-of-the-art methods.

updated: Thu Mar 10 2022 12:55:47 GMT+0000 (UTC)

published: Thu Mar 10 2022 12:55:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト