Towards Language-guided Visual Recognition via Dynamic Convolutions

Gen Luo; Yiyi Zhou; Xiaoshuai Sun; Yongjian Wu; Yue Gao; Rongrong Ji

動的畳み込みによる言語誘導視覚認識に向けて

この論文では、言語に基づく視覚認識を探求することにより、統一されたエンドツーエンドのマルチモーダルネットワークの確立に取り組んでいます。この目標に近づくために、最初に言語依存畳み込み（LaConv）と呼ばれる新しいマルチモーダル畳み込みモジュールを提案します。その畳み込みカーネルは、自然言語情報に基づいて動的に生成されます。これは、さまざまなマルチモーダル例の差別化された視覚的特徴を抽出するのに役立ちます。 LaConvモジュールに基づいて、LaConvNetと呼ばれる最初の完全に言語駆動型の畳み込みネットワークをさらに構築します。これにより、視覚認識とマルチモーダル推論を1つのフォワード構造に統合できます。 LaConvとLaConvNetを検証するために、2つの視覚と言語のタスク、つまり視覚的な質問応答（VQA）と参照表現の理解（REC）の4つのベンチマークデータセットで広範な実験を行います。実験結果は、既存のマルチモーダルモジュールと比較したLaConvのパフォーマンスの向上を示しているだけでなく、コンパクトなネットワーク、高い一般化能力、優れたパフォーマンス（RefCOCO +で+ 4.7％など）を含む統合ネットワークとしてのLaConvNetのメリットも示しています。。

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental results not only shows the performance gains of LaConv compared to the existing multi-modal modules, but also witness the merits of LaConvNet as an unified network, including compact network, high generalization ability and excellent performance, e.g., +4.7% on RefCOCO+.

updated: Thu Sep 14 2023 13:37:38 GMT+0000 (UTC)

published: Sun Oct 17 2021 11:29:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト