Are Deep Neural Networks SMARTer than Second Graders?

Anoop Cherian; Kuan-Chuan Peng; Suhas Lohit; Kevin A. Smith; Joshua B. Tenenbaum

ディープニューラルネットワークは 2 年生よりも賢いのでしょうか?

最近では、囲碁、アートの生成、ChatGPT など、優れた認知能力を必要とするタスクの解決に向けたディープニューラルネットワークの応用例が増えています。このような劇的な進歩により、問題解決においてニューラルネットワークはどの程度一般化できるのかという疑問が生じています。幅広いスキルが必要ですか？この質問に答えるために、私たちは、SMART: シンプルマルチモーダルアルゴリズム推論タスクと関連する SMART-101 データセットを提案します。これは、特に 6 年生の子供向けに設計された視覚言語パズルを解く際のニューラルネットワークの抽象化、演繹、一般化能力を評価するためのものです。 -8歳のグループ。私たちのデータセットは 101 個のユニークなパズルで構成されています。各パズルは絵と質問で構成されており、その解決には算術、代数、空間推論などのいくつかの初歩的なスキルを組み合わせる必要があります。ディープニューラルネットワークのトレーニングに向けてデータセットを拡張するために、解決アルゴリズムを保持しながら、パズルごとにまったく新しいインスタンスをプログラムで生成します。 SMART-101 のパフォーマンスをベンチマークするために、さまざまな最先端のバックボーンを使用した視覚と言語のメタ学習モデルを提案します。私たちの実験では、強力なディープモデルは教師付き設定でのパズルでは妥当なパフォーマンスを提供しますが、一般化のために分析するとランダムな精度よりも優れているわけではないことが明らかになりました。また、SMART-101 のサブセットで最近の ChatGPT やその他の大規模な言語モデルを評価したところ、これらのモデルは説得力のある推論能力を示しているものの、答えが正しくないことが多いことがわかりました。

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a subset of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.

updated: Sun Jun 18 2023 15:07:39 GMT+0000 (UTC)

published: Tue Dec 20 2022 04:33:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト