A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning

Zhisheng Tang; Mayank Kejriwal

意思決定と空間推論に関する ChatGPT と DALL-E 2 のパイロット評価

最近リリースされた 2 つのジェネレーティブトランスフォーマーモデル、ChatGPT と DALL-E 2 の認知能力 (意思決定と空間推論) を選択的に評価するパイロットスタディを実施します。出力の事後的な質的分析は、DALL-E 2 が各空間推論プロンプトに対して少なくとも 1 つの正しい画像を生成できることを示していますが、生成されたほとんどの画像は正しくありません (モデルは、プロンプト)。同様に、古典的な Von Neumann-Morgenstern 効用定理の下で開発された合理性公理で ChatGPT を評価すると、合理的な意思決定のある程度のレベルを示していますが、その決定の多くは、合理的な構造の下でも公理の少なくとも 1 つに違反していることがわかります。好み、賭け、および意思決定プロンプトの。このような問題に対する ChatGPT の出力は、一般的に予測できない傾向がありました。単純な意思決定の問題については不合理な決定を下した (または誤った推論プロセスを採用した) 場合でも、より複雑な賭け構造については正しい結論を導き出すことができました。これらのモデルは本質的に生成的であり、プロンプトへの応答において制限がないことを考えると、このような「認知」評価をスケールアップすること、または回答キーのクローズドセット (「グラウンドトゥルース」) を使用して実行することに伴うニュアンスと課題について簡単にコメントします。 .

We conduct a pilot study selectively evaluating the cognitive abilities (decision making and spatial reasoning) of two recently released generative transformer models, ChatGPT and DALL-E 2. Input prompts were constructed following neutral a priori guidelines, rather than adversarial intent. Post hoc qualitative analysis of the outputs shows that DALL-E 2 is able to generate at least one correct image for each spatial reasoning prompt, but most images generated are incorrect (even though the model seems to have a clear understanding of the objects mentioned in the prompt). Similarly, in evaluating ChatGPT on the rationality axioms developed under the classical Von Neumann-Morgenstern utility theorem, we find that, although it demonstrates some level of rational decision-making, many of its decisions violate at least one of the axioms even under reasonable constructions of preferences, bets, and decision-making prompts. ChatGPT's outputs on such problems generally tended to be unpredictable: even as it made irrational decisions (or employed an incorrect reasoning process) for some simpler decision-making problems, it was able to draw correct conclusions for more complex bet structures. We briefly comment on the nuances and challenges involved in scaling up such a 'cognitive' evaluation or conducting it with a closed set of answer keys ('ground truth'), given that these models are inherently generative and open-ended in responding to prompts.

updated: Wed Feb 15 2023 05:04:49 GMT+0000 (UTC)

published: Wed Feb 15 2023 05:04:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト