Are Deep Neural Networks SMARTer than Second Graders?

Anoop Cherian; Kuan-Chuan Peng; Suhas Lohit; Kevin A. Smith; Joshua B. Tenenbaum

ディープニューラルネットワークは 2 年生よりもスマートですか?

最近では、優れた認知能力を必要とするタスク (囲碁、アートの生成、ChatGPT など) の解決にディープニューラルネットワークを適用する例が増えています。幅広いスキルが必要ですか？この質問に答えるために、私たちは SMART を提案します: シンプルなマルチモーダルアルゴリズム推論タスクと関連する SMART-101 データセット。 -8 歳のグループ。私たちのデータセットは、101 個のユニークなパズルで構成されています。各パズルは絵と質問で構成されており、それらを解くには、算数、代数、空間推論などのいくつかの基本的なスキルを組み合わせる必要があります。ディープニューラルネットワークのトレーニングに向けてデータセットをスケーリングするために、ソリューションアルゴリズムを保持しながら、パズルごとにまったく新しいインスタンスをプログラムで生成します。 SMART-101 のパフォーマンスをベンチマークするために、さまざまな最先端のバックボーンを使用した視覚と言語のメタ学習モデルを提案します。私たちの実験では、強力なディープモデルは、教師ありの設定でパズルに対して妥当なパフォーマンスを提供しますが、一般化のために分析すると、ランダムな精度よりも優れているわけではないことが明らかになりました。また、SMART-101 の一部で最近の ChatGPT やその他の大規模な言語モデルを評価したところ、これらのモデルは説得力のある推論能力を示していますが、答えはしばしば正しくないことがわかりました。

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a part of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.

updated: Fri Jun 02 2023 15:17:43 GMT+0000 (UTC)

published: Tue Dec 20 2022 04:33:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト