Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering

Triet Minh Thai; Son T. Luu

多言語の視覚的質問応答のための畳み込みシーケンス間ネットワークと画像機能の統合

Visual Question Answering (VQA) は、コンピューターが画像に基づいて入力された質問に対して正しい回答を返す必要があるタスクです。このタスクは人間なら簡単に解決できますが、コンピューターにとっては難しい課題です。 VLSP2022-EVJVQA 共有タスクは、新しくリリースされたデータセット UIT-EVJVQA の多言語ドメインで視覚的質問応答タスクを実行します。このデータセットでは、質問と回答が英語、ベトナム語、日本語の 3 つの異なる言語で書かれています。私たちはシーケンスからシーケンスへの学習タスクとしてこの課題に取り組みました。そこでは、事前にトレーニングされた最先端の VQA モデルと画像の特徴からのヒントを、畳み込みシーケンスからシーケンスネットワークを使用して統合し、目的の答えを生成しました。私たちの結果は、パブリックテストセットで F1 スコアで最大 0.3442、プライベートテストセットで 0.4210 を獲得し、競争で 3 位になりました。

Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.

updated: Sun Sep 03 2023 14:50:34 GMT+0000 (UTC)

published: Wed Mar 22 2023 15:49:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト