Granular Multimodal Attention Networks for Visual Dialog

Badri N. Patro; Shivansh Patel; Vinay P. Namboodiri

視覚的対話のための粒状マルチモーダル注意ネットワーク

ビジョンと言語のタスクは注意から恩恵を受けています。多くの異なる注意モデルが提案されています。ただし、注意を適用する必要がある尺度は十分に検討されていません。特に、この作業では、新しい方法Granular Multimodal Attentionを提案します。この方法では、Visual Dialogタスクを解決するときに出席する必要がある適切な粒度の問題に特に対処することを目指しています。提案された方法は、画像とテキストの両方の注意ネットワークの改善を示しています。次に、画像とテキストのグラニュルに共同で参加し、最高のパフォーマンスを示すグラニュラーマルチモーダルアテンションネットワークを提案します。この作業では、視覚的な対話を解決しながら、きめ細かな注意を得て、徹底的なマルチモーダルな注意を払うことが最善の方法であるように見えます。

Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend while solving the Visual Dialog task. The proposed method shows improvement in both image and text attention networks. We then propose a granular Multi-modal Attention network that jointly attends on the image and text granules and shows the best performance. With this work, we observe that obtaining granular attention and doing exhaustive Multi-modal Attention appears to be the best way to attend while solving visual dialog.

updated: Sun Oct 13 2019 10:49:41 GMT+0000 (UTC)

published: Sun Oct 13 2019 10:49:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト