Curriculum Learning for Compositional Visual Reasoning

Wafa Aissa; Marin Ferecatu; Michel Crucianu

構成的視覚推論のためのカリキュラム学習

Visual Question Answering (VQA) は、大規模なデータセットと高価なトレーニングを必要とする複雑なタスクです。ニューラルモジュールネットワーク (NMN) は、まず質問を推論パスに変換し、次にそのパスに従って画像を分析し、回答を提供します。 GQA データセットで学習を「ウォームスタート」するために事前定義されたクロスモーダル埋め込みに依存する NMN メソッドを提案し、トレーニングを改善してデータをより有効に活用する方法として、カリキュラム学習 (CL) に焦点を当てます。 CL メソッドの定義には、いくつかの難易度基準が採用されています。 CL メソッドを適切に選択することで、トレーニングのコストとトレーニングデータの量を大幅に削減し、最終的な VQA 精度への影響を限定できることを示します。さらに、トレーニング中に中間損失を導入し、これにより CL 戦略を簡素化できることがわかりました。

Visual Question Answering (VQA) is a complex task requiring large datasets and expensive training. Neural Module Networks (NMN) first translate the question to a reasoning path, then follow that path to analyze the image and provide an answer. We propose an NMN method that relies on predefined cross-modal embeddings to ``warm start'' learning on the GQA dataset, then focus on Curriculum Learning (CL) as a way to improve training and make a better use of the data. Several difficulty criteria are employed for defining CL methods. We show that by an appropriate selection of the CL method the cost of training and the amount of training data can be greatly reduced, with a limited impact on the final VQA accuracy. Furthermore, we introduce intermediate losses during training and find that this allows to simplify the CL strategy.

updated: Mon Mar 27 2023 08:47:18 GMT+0000 (UTC)

published: Mon Mar 27 2023 08:47:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト