Multi-Head Attention with Diversity for Learning Grounded Multilingual   Multimodal Representations

Po-Yao Huang; Xiaojun Chang; Alexander Hauptmann

接地された多言語マルチモーダル表現を学習するための多様性を備えたマルチヘッド注意

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

画像検索の多言語バージョンを促進および理解することを目的として、視覚オブジェクト検出を活用し、多様なマルチヘッドの注意を払うモデルを提案して、接地された多言語マルチモーダル表現を学習します。具体的には、私たちのモデルは、2つの言語のさまざまな種類のテキストセマンティクスと、文章と画像の間のきめ細かな調整のためのビジュアルオブジェクトに対応しています。改善された視覚的意味的埋め込み空間を学習するために、注意の多様性を明示的に奨励する新しい目的関数を導入します。 Multi30KデータセットのGerman-ImageマッチングタスクとEnglish-Imageマッチングタスク、およびビジュアルコンテンツの英語の説明を含むSemantic Textual Similarityタスクでモデルを評価します。結果は、3つのタスクすべてにおいて、モデルが他の方法よりも大幅にパフォーマンスが向上することを示しています。

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

updated: Mon Sep 30 2019 18:58:03 GMT+0000 (UTC)

published: Mon Sep 30 2019 18:58:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト