Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

Lingxi Xie; Longhui Wei; Xiaopeng Zhang; Kaifeng Bi; Xiaotao Gu; Jianlong Chang; Qi Tian

コンピュータービジョンにおける AGI に向けて: GPT と大規模言語モデルから学んだ教訓

AI コミュニティは、現実世界のあらゆる問題に適用できる汎用人工知能 (AGI) として知られるアルゴリズムを追求してきました。最近、ラージ言語モデル (LLM) を利用したチャットシステムが出現し、自然言語処理 (NLP) で AGI を実現する有望な方向性として急速に注目されていますが、コンピュータービジョン (CV) での AGI への道は依然として不透明です。視覚信号は言語信号よりも複雑であるという事実からジレンマを抱えているかもしれませんが、私たちは具体的な理由を見つけ、問題を解決するために GPT や LLM から経験を吸収することに興味を持っています。このペーパーでは、AGI の概念的な定義から始めて、NLP がチャットシステムを介してさまざまなタスクをどのように解決するかを簡単にレビューします。この分析により、統一が CV の次の重要な目標であることがわかりました。しかし、この方向に向けたさまざまな努力にもかかわらず、CV は依然として、すべてのタスクを自然に統合する GPT のようなシステムには程遠いです。 CV の本質的な弱点は、環境から学習するパラダイムの欠如にあるが、テキストの世界では NLP がその課題を達成していることを私たちは指摘します。次に、CV アルゴリズム (つまり、エージェント) を世界規模の対話可能な環境に配置し、そのアクションに関する将来のフレームを予測するように事前トレーニングし、その後、さまざまなタスクを達成するための命令で微調整するパイプラインを想像します。。私たちは、このアイデアを前進させ、規模を拡大するために多大な研究とエンジニアリングの努力を期待しており、そのために将来の研究の方向性についての見解を共有しています。

The AI community has been pursuing algorithms known as artificial general intelligence (AGI) that apply to any kind of real-world problem. Recently, chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve AGI in natural language processing (NLP), but the path towards AGI in computer vision (CV) remains unclear. One may owe the dilemma to the fact that visual signals are more complex than language signals, yet we are interested in finding concrete reasons, as well as absorbing experiences from GPT and LLMs to solve the problem. In this paper, we start with a conceptual definition of AGI and briefly review how NLP solves a wide range of tasks via a chat system. The analysis inspires us that unification is the next important goal of CV. But, despite various efforts in this direction, CV is still far from a system like GPT that naturally integrates all tasks. We point out that the essential weakness of CV lies in lacking a paradigm to learn from environments, yet NLP has accomplished the task in the text world. We then imagine a pipeline that puts a CV algorithm (i.e., an agent) in world-scale, interactable environments, pre-trains it to predict future frames with respect to its action, and then fine-tunes it with instruction to accomplish various tasks. We expect substantial research and engineering efforts to push the idea forward and scale it up, for which we share our perspectives on future research directions.

updated: Wed Jun 14 2023 17:15:01 GMT+0000 (UTC)

published: Wed Jun 14 2023 17:15:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト