Indonesian Text-to-Image Synthesis with Sentence-BERT and FastGAN

Made Raharja Surya Mahadi; Nugraha Priya Utama

Sentence-BERT と FastGAN を使用したインドネシア語のテキストから画像への合成

現在、テキストから画像への合成では、テキストエンコーダーと画像ジェネレーターアーキテクチャが使用されています。このトピックに関する研究は困難です。これは、自然言語と視覚の間のドメインギャップによるものです。現在、このトピックに関するほとんどの研究は、写真のようにリアルな画像を作成することにのみ焦点を当てていますが、この場合、他のドメインは言語であり、それほど集中していません.現在の研究の多くは、入力テキストとして英語を使用しています。また、世界中には多くの言語があります。インドネシアの公用語としてのインドネシア語は非常に人気があります。この言語は、フィリピン、オーストラリア、日本で教えられています。新しいデータセットを高品質で別の言語に翻訳または再作成するには、多くの費用がかかります。写真のようにリアルな画像を生成する以外に、画像ジェネレーターが他の言語でどのように機能するかを調べる必要があるため、このドメインの研究が必要です。これを実現するために、Google 翻訳を使用して人間が手動で CUB データセットを Bahasa に翻訳します。テキストエンコーダーとして Sentence BERT を使用し、イメージジェネレーターとして FastGAN を使用します。 FastGAN は多くのスキップ励起モジュールと自動エンコーダーを使用して、現在の最先端モデル (Zhang、Xu、Li、Zhang、Wang、Huang、Metaxas、 2019）。また、インセプションスコアとフレシェ開始距離でそれぞれ 4.76 ± 0.43 と 46.401 を得て、現在の英語のテキストから画像への生成モデルに匹敵します。平均オピニオンスコアも 5 段階中 3.22 となり、これは生成された画像が人間に受け入れられることを意味します。ソースコードへのリンク: https://github.com/share424/Indonesian-Text-to-Image-synthesis-with-Sentence-BERT-and-FastGAN

Currently, text-to-image synthesis uses text encoder and image generator architecture. Research on this topic is challenging. This is because of the domain gap between natural language and vision. Nowadays, most research on this topic only focuses on producing a photo-realistic image, but the other domain, in this case, is the language, which is less concentrated. A lot of the current research uses English as the input text. Besides, there are many languages around the world. Bahasa Indonesia, as the official language of Indonesia, is quite popular. This language has been taught in Philipines, Australia, and Japan. Translating or recreating a new dataset into another language with good quality will cost a lot. Research on this domain is necessary because we need to examine how the image generator performs in other languages besides generating photo-realistic images. To achieve this, we translate the CUB dataset into Bahasa using google translate and manually by humans. We use Sentence BERT as the text encoder and FastGAN as the image generator. FastGAN uses lots of skip excitation modules and auto-encoder to generate an image with resolution 512x512x3, which is twice as bigger as the current state-of-the-art model (Zhang, Xu, Li, Zhang, Wang, Huang and Metaxas, 2019). We also get 4.76 +- 0.43 and 46.401 on Inception Score and Fréchet inception distance, respectively, and comparable with the current English text-to-image generation models. The mean opinion score also gives as 3.22 out of 5, which means the generated image is acceptable by humans. Link to source code: https://github.com/share424/Indonesian-Text-to-Image-synthesis-with-Sentence-BERT-and-FastGAN

updated: Sat Mar 25 2023 16:54:22 GMT+0000 (UTC)

published: Sat Mar 25 2023 16:54:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト