ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Bang Yang; Fenglin Liu; Yuexian Zou; Xian Wu; Yaowei Wang; David A. Clifton

ZeroNLG: ゼロショットマルチモーダルおよび多言語自然言語生成のためのドメインの整列と自動エンコード

自然言語生成 (NLG) は、画像、ビデオ、またはテキストの形式で入力データを受け取り、対応する自然言語テキストを出力として生成します。既存の NLG メソッドは、主に教師ありアプローチを採用し、結合されたデータとテキストのペアに大きく依存しています。ただし、ターゲットを絞った多くのシナリオや英語以外の言語では、十分な量のラベル付きデータを利用できないことがよくあります。ダウンストリームタスクのラベル付きデータへの依存を緩和するために、画像からテキスト (画像キャプション)、ビデオからテキスト (ビデオキャプション)、およびテキストからテキストへの翻訳 (ニューラル機械翻訳) を、統一されたフレームワーク内で、英語、中国語、ドイツ語、フランス語にわたって提供します。 ZeroNLG は、トレーニング用にラベル付けされたダウンストリームペアを必要としません。トレーニング中、ZeroNLG (i) は、(モダリティと言語にまたがる) さまざまなドメインを、共有された共通の潜在空間内の対応する座標に投影します。（ii）この空間で対応する座標を揃えることにより、異なるドメインを橋渡しします。 (iii) 教師なし多言語自動エンコーダーを構築して、共有潜在空間内の座標を指定して入力テキストを再構築することによってテキストを生成することを学習します。その結果、推論中に、データからテキストへのパイプラインに基づいて、ZeroNLG は、共通空間内の入力データの座標を指定して、異なる言語間でターゲットセンテンスを生成できます。この統合されたフレームワーク内で、ビジュアル (画像またはビデオ) データを入力として指定すると、ZeroNLG はゼロショットビジュアルキャプションを実行できます。入力としてテキスト文が与えられると、ZeroNLG はゼロショット機械翻訳を実行できます。 12 の NLG タスクに関する大規模な実験の結果を提示し、ラベル付けされたダウンストリームペアをトレーニングに使用することなく、ZeroNLG が高品質で信頼できる出力を生成し、既存のゼロショット法よりも大幅に優れていることを示します。

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.

updated: Mon Jun 03 2024 12:47:12 GMT+0000 (UTC)

published: Sat Mar 11 2023 17:14:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト