Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Xu Yang; Hanwang Zhang; Chongyang Gao; Jianfei Cai

画像キャプション用の視覚言語ニューラルモジュールをコロケーションすることを学ぶ

人間は、ある場所で sth do sth のように文をさまざまな部分に分解し、各部分を特定の内容で埋める傾向があります。これに触発されて、モジュラー設計の原則に従って、新しい画像キャプションナーを提案します: Visual-Linguistic Neural Modules (CVLNM) をコロケートする方法を学習します。言語 (つまり、質問) が完全に観察可能な VQA で広く使用されているニューラルモジュールネットワークとは異なり、視覚言語モジュールを配置するタスクはより困難です。これは、言語が部分的にしか観察できないためです。そのためには、画像キャプションのプロセス中にモジュールを動的に配置する必要があります。要約すると、CVLNM の設計とトレーニングに次のような技術的貢献を行っています。、および動詞) および常識的な推論のためのデコーダー内の別の言語的なもの、2) 視覚的な推論を強化するための自己注意ベースのモジュールコントローラー、3) モジュールコントローラーに課される品詞ベースの構文損失により、私たちのCVLNMのトレーニング。 MS-COCO データセットに関する広範な実験は、CVLNM がより効果的であることを示しています。たとえば、新しい最先端の 129.5 CIDEr-D を達成し、より堅牢です。使用できるトレーニングサンプルが少ない場合。コードは https://github.com/GCYZSL/CVLMN で入手できます

Humans tend to decompose a sentence into different parts like sth do sth at someplace and then fill each part with certain content. Inspired by this, we follow the principle of modular design to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the widely used neural module networks in VQA, where the language (i.e. , question) is fully observable, the task of collocating visual-linguistic modules is more challenging. This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: 1) distinguishable module design -- four modules in the encoder including one linguistic module for function words and three visual modules for different content words (i.e. , noun, adjective, and verb) and another linguistic one in the decoder for commonsense reasoning, 2) a self-attention based module controller for robustifying the visual reasoning, 3) a part-of-speech based syntax loss imposed on the module controller for further regularizing the training of our CVLNM. Extensive experiments on the MS-COCO dataset show that our CVLNM is more effective, e.g. , achieving a new state-of-the-art 129.5 CIDEr-D, and more robust, e.g. , being less likely to overfit to dataset bias and suffering less when fewer training samples are available. Codes are available at https://github.com/GCYZSL/CVLMN

updated: Mon Apr 24 2023 02:27:07 GMT+0000 (UTC)

published: Tue Oct 04 2022 03:09:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト