Album Storytelling with Iterative Story-aware Captioning and Large Language Models

Munan Ning; Yujia Xie; Dongdong Chen; Zeyin Song; Lu Yuan; Yonghong Tian; Qixiang Ye; Li Yuan

反復的なストーリーを意識したキャプションと大規模言語モデルによるアルバムストーリーテリング

この研究では、アルバムを鮮やかで一貫したストーリーに変換する方法を研究しています。この作業を私たちは「アルバムストーリーテリング」と呼んでいます。この作業は、思い出を保存し、経験の共有を促進するのに役立ちますが、現在の文献では未開発の領域のままです。大規模言語モデル (LLM) では、長く一貫したテキストを生成できるようになり、アルバムのストーリーテリング用の AI アシスタントを開発する機会が開かれます。自然なアプローチの 1 つは、キャプションモデルを使用してアルバム内の各写真を説明し、その後、 LLM を使用して、生成されたキャプションを要約し、魅力的なストーリーに書き換えます。ただし、生成された各キャプション (「ストーリーに依存しない」) は必ずしも、ストーリーに関連する説明に関するものではないため、画像と矛盾する幻覚情報を含むストーリーが生成されることがよくあります。ストーリー全体を把握できなかったり、必要な情報が欠落していたりします。これらの制限に対処するために、新しい反復的なアルバムストーリーテリングパイプラインを提案します。具体的には、最初のストーリーから始めて、ストーリーを意識したキャプションモデルを構築し、ストーリー全体をガイダンスとして使用してキャプションを改良します。洗練されたキャプションは LLM に入力されて、新しく洗練されたストーリーが生成されます。このプロセスは、一貫性を維持しながらストーリーに事実上の誤りが最小限になるまで繰り返し繰り返されます。提案したパイプラインを評価するために、vlog からの画像コレクションの新しいデータセットと一連の体系的な評価指標を導入します。私たちの結果は、私たちの方法がより正確で魅力的なアルバムのストーリーを効果的に生成し、一貫性と鮮やかさが向上していることを示しています。

This work studies how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling''. While this task can help preserve memories and facilitate experience sharing, it remains an underexplored area in current literature. With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text, opening up the opportunity to develop an AI assistant for album storytelling. One natural approach is to use caption models to describe each photo in the album, and then use LLMs to summarize and rewrite the generated captions into an engaging story. However, we find this often results in stories containing hallucinated information that contradicts the images, as each generated caption ("story-agnostic") is not always about the description related to the whole story or miss some necessary information. To address these limitations, we propose a new iterative album storytelling pipeline. Specifically, we start with an initial story and build a story-aware caption model to refine the captions using the whole story as guidance. The polished captions are then fed into the LLMs to generate a new refined story. This process is repeated iteratively until the story contains minimal factual errors while maintaining coherence. To evaluate our proposed pipeline, we introduce a new dataset of image collections from vlogs and a set of systematic evaluation metrics. Our results demonstrate that our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.

updated: Mon May 22 2023 11:45:10 GMT+0000 (UTC)

published: Mon May 22 2023 11:45:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト