Human or Machine? Turing Tests for Vision and Language

Mengmi Zhang; Giorgia Dellaferrera; Ankur Sikarwar; Marcelo Armendariz; Noga Mudrik; Prachi Agrawal; Spandan Madan; Andrei Barbu; Haochen Yang; Tanishq Kumar; Meghna Sadwani; Stella Dellaferrera; Michele Pizzochero; Hanspeter Pfister; Gabriel Kreiman

人間か機械か？視覚と言語のチューリングテスト

AI アルゴリズムが、以前は人間だけの領域であった日常の活動にますます参加するようになるにつれて、私たちは必然的に、機械がどれだけ私たちに似ているかを考えるよう求められます。この問題に対処するために、チューリングテストに目を向け、現在の AI の人間を模倣する能力を体系的にベンチマークします。チューリングのようなテストで人間と機械を評価する方法論を確立し、選択されたドメイン、パラメーター、および変数の代表的なセットを体系的に評価します。この実験では、769 人の人間のエージェント、24 人の最先端の AI エージェント、896 人の人間の裁判官、8 人の AI 裁判官が、視覚と言語のモダリティを含む 6 つのタスクにわたる 21,570 のチューリングテストでテストされました。驚くべきことに、現在の AI は、複雑な視覚的および言語的課題において、さまざまな年齢、性別、および教育レベルの人間の裁判官になりすますことができるとはほど遠いことが結果から明らかになりました。対照的に、単純な AI ジャッジは、人間の回答と機械の回答を区別する点で、人間のジャッジよりも優れています。ここで紹介する精選された大規模なチューリングテストデータセットとその評価指標は、エージェントが人間であるかどうかを評価するための貴重な洞察を提供します。現在の AI における人間の模倣能力をベンチマークするために提案された定式化は、研究コミュニティがチューリングテストを他の研究分野や条件に拡大する道を開きます。すべてのソースコードとデータは、https://tinyurl.com/8x8nha7p で公開されています。

As AI algorithms increasingly participate in daily activities that used to be the sole province of humans, we are inevitably called upon to consider how much machines are really like us. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans. We establish a methodology to evaluate humans versus machines in Turing-like tests and systematically evaluate a representative set of selected domains, parameters, and variables. The experiments involved testing 769 human agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges, in 21,570 Turing tests across 6 tasks encompassing vision and language modalities. Surprisingly, the results reveal that current AIs are not far from being able to impersonate human judges across different ages, genders, and educational levels in complex visual and language challenges. In contrast, simple AI judges outperform human judges in distinguishing human answers versus machine answers. The curated large-scale Turing test datasets introduced here and their evaluation metrics provide valuable insights to assess whether an agent is human or not. The proposed formulation to benchmark human imitation ability in current AIs paves a way for the research community to expand Turing tests to other research areas and conditions. All of source code and data are publicly available at https://tinyurl.com/8x8nha7p

updated: Wed Nov 23 2022 16:16:52 GMT+0000 (UTC)

published: Wed Nov 23 2022 16:16:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト