L-Verse: Bidirectional Generation Between Image and Text

Taehoon Kim; Gwangmo Song; Sihaeng Lee; Sangyun Kim; Yewon Seo; Soonyoung Lee; Seung Hwan Kim; Honglak Lee; Kyunghoon Bae

L-Verse：画像とテキスト間の双方向生成

自然言語の長距離の相互作用を学ぶことをはるかに超えて、トランスフォーマーは、そのパワーと拡張性を備えた多くのビジョンタスクの事実上の標準になりつつあります。特に画像とテキスト間のクロスモーダルタスクでは、ベクトル量子化変分オートエンコーダー（VQ-VAE）が、生のRGB画像を一連の特徴ベクトルにするために広く使用されています。画像とテキストの相関関係をより有効に活用するために、テキストから画像および画像からテキストへの機能拡張変分オートエンコーダ（AugVAE）と双方向自動回帰変換器（BiART）で構成される新しいアーキテクチャであるL-Verseを提案します。世代。私たちのAugVAEは、ImageNet1K検証セットでの最先端の再構成パフォーマンスと、実際の目に見えない画像に対する堅牢性を示しています。他のモデルとは異なり、BiARTは、条件付き参照としての画像（またはテキスト）と生成ターゲットを区別できます。 L-Verseは、微調整や追加のオブジェクト検出フレームワークなしで、画像からテキストまたはテキストから画像の生成タスクに直接使用できます。定量的および定性的な実験では、L-Verseは、MS-COCOキャプションでの画像からテキストへの生成とテキストから画像への生成の両方で以前の方法に対して印象的な結果を示しています。さらに、概念キャプションでのL-Verseアーキテクチャのスケーラビリティを評価し、一般的なドメインでの双方向の視覚言語表現学習の初期結果を示します。

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalabilty. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for text-to-image and image-to-text generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial results of bidirectional vision-language representation learning on general domain.

updated: Fri Dec 03 2021 07:28:09 GMT+0000 (UTC)

published: Mon Nov 22 2021 11:48:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト