A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System

Mauajama Firdaus; Avinash Madasu; Asif Ekbal

マルチモーダル対話システムにおけるスロットベースの応答生成のための統合フレームワーク

自然言語理解 (NLU) と自然言語生成 (NLG) は、スロットの形式で必要な情報を取得し、抽出された情報に従って適切な応答を生成することによってユーザーを理解するタスクを処理する、あらゆる会話システムの 2 つの重要なコンポーネントです。情報。最近、画像、音声、ビデオなどの補完情報を統合した対話システムが非常に人気を得ています。この研究では、発話から必要なスロット値を抽出し、一貫した応答を生成する機能を備えたエンドツーエンドのフレームワークを提案します。これにより、テキスト情報と視覚情報の両方を備えたマルチモーダル対話システムでユーザーが所望の目標を達成できるように支援します。。必要な情報を抽出するタスクは、テキストだけでなく、対話内に存在する視覚的な手がかりにも依存します。同様に、世代にとって、マルチモーダル情報を含む以前のダイアログコンテキストは、一貫性のある有益な応答を提供するために重要です。事前トレーニング済み DialoGPT を使用したマルチモーダル階層エンコーダーを採用し、ナレッジベース (Kb) を活用して両方のタスクに強力なコンテキストを提供します。最後に、特定の発話内の必要な情報に焦点を当てるスロットアテンションメカニズムを設計します。最後に、デコーダは、指定された対話コンテキストと抽出されたスロット値に対応する応答を生成します。マルチモーダル対話データセット (MMD) の実験結果は、提案されたフレームワークが両方のタスクにおいてベースラインアプローチよりも優れていることを示しています。コードは https://github.com/avinashsai/slot-gpt で入手できます。

Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system that handles the task of understanding the user by capturing the necessary information in the form of slots and generating an appropriate response in accordance with the extracted information. Recently, dialogue systems integrated with complementary information such as images, audio, or video have gained immense popularity. In this work, we propose an end-to-end framework with the capability to extract necessary slot values from the utterance and generate a coherent response, thereby assisting the user to achieve their desired goals in a multimodal dialogue system having both textual and visual information. The task of extracting the necessary information is dependent not only on the text but also on the visual cues present in the dialogue. Similarly, for the generation, the previous dialog context comprising multimodal information is significant for providing coherent and informative responses. We employ a multimodal hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge base (Kb) to provide a stronger context for both the tasks. Finally, we design a slot attention mechanism to focus on the necessary information in a given utterance. Lastly, a decoder generates the corresponding response for the given dialogue context and the extracted slot values. Experimental results on the Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms the baselines approaches in both the tasks. The code is available at https://github.com/avinashsai/slot-gpt.

updated: Sat May 27 2023 10:06:03 GMT+0000 (UTC)

published: Sat May 27 2023 10:06:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト