Learning to Model Multimodal Semantic Alignment for Story Visualization

Bowen Li; Thomas Lukasiewicz

ストーリーの視覚化のためのマルチモーダルセマンティックアラインメントのモデル化の学習

ストーリービジュアライゼーションの目的は、一連の画像を生成して、複数の文からなるストーリーの各文を説明することです。画像は現実的で、動的なシーンやキャラクター全体でグローバルな一貫性を維持する必要があります。現在の作品は、固定されたアーキテクチャと入力モダリティの多様性のために、意味の不整合の問題に直面しています。この問題に対処するために、GAN ベースの生成モデルでそれらのセマンティックレベルを一致させることを学習することにより、テキストと画像表現の間のセマンティックアラインメントを調査します。より具体的には、さまざまなセマンティック深度を動的に探索し、一致するセマンティックレベルで異なるモーダル情報を融合するための学習に従って動的相互作用を導入します。さまざまなデータセットでの広範な実験は、最先端の方法と比較して、セグメンテーションマスクも補助キャプションネットワークも使用しない、画質とストーリーの一貫性に関するアプローチの改善を示しています。

Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story, where the images should be realistic and keep global consistency across dynamic scenes and characters. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. To address this problem, we explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model. More specifically, we introduce dynamic interactions according to learning to dynamically explore various semantic depths and fuse the different-modal information at a matched semantic level, which thus relieves the text-image semantic misalignment problem. Extensive experiments on different datasets demonstrate the improvements of our approach, neither using segmentation masks nor auxiliary captioning networks, on image quality and story consistency, compared with state-of-the-art methods.

updated: Mon Nov 14 2022 11:41:44 GMT+0000 (UTC)

published: Mon Nov 14 2022 11:41:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト