Neural World Models for Computer Vision

Anthony Hu

コンピュータービジョンのためのニューラルワールドモデル

人間は、受動的な観察と能動的な対話を通じて世界のメンタルモデルを学習することで、環境内をナビゲートします。彼らの世界モデルにより、次に何が起こるかを予測し、根底にある目的に応じて行動することができます。このような世界モデルは、自動運転などの複雑な環境での計画に大きな期待を寄せています。人間のドライバーまたは自動運転システムは、目またはカメラで周囲を認識します。彼らは、(i) 空間記憶 (例: オクルージョン) を持ち、(ii) 部分的に観察可能な入力またはノイズの多い入力 (例: 太陽光で目がくらんだとき) を満たし、(iii) 観察できない出来事について推論できる必要がある世界の内部表現を推論します。確率的に（例えば、起こり得るさまざまな未来を予測する）。彼らは、世界モデルを通じて物理世界で予測、計画、行動できる、具体化されたインテリジェントエージェントです。この論文では、カメラ観察と専門家によるデモンストレーションから、ディープニューラルネットワークによってパラメーター化されたワールドモデルとポリシーをトレーニングするための一般的なフレームワークを紹介します。ジオメトリ、セマンティクス、モーションなどの重要なコンピュータービジョンの概念を活用して、ワールドモデルを複雑な都市部の運転シーンに合わせて拡張します。まず、コンピュータビジョンにおける重要な量、つまり深度、セマンティックセグメンテーション、オプティカルフローを予測するモデルを提案します。次に、鳥瞰図空間で動作するための誘導バイアスとして 3D ジオメトリを使用します。我々は、360°周囲の単眼カメラのみからの鳥瞰図で動的エージェントの確率的な将来の軌道を予測できるモデルを初めて提示します。最後に、閉ループ運転におけるワールドモデルを学習する利点を示します。私たちのモデルは、都市の運転環境における静的なシーン、動的なシーン、および自己行動を統合して予測できます。

Humans navigate in their environment by learning a mental model of the world through passive observation and active interaction. Their world model allows them to anticipate what might happen next and act accordingly with respect to an underlying objective. Such world models hold strong promises for planning in complex environments like in autonomous driving. A human driver, or a self-driving system, perceives their surroundings with their eyes or their cameras. They infer an internal representation of the world which should: (i) have spatial memory (e.g. occlusions), (ii) fill partially observable or noisy inputs (e.g. when blinded by sunlight), and (iii) be able to reason about unobservable events probabilistically (e.g. predict different possible futures). They are embodied intelligent agents that can predict, plan, and act in the physical world through their world model. In this thesis we present a general framework to train a world model and a policy, parameterised by deep neural networks, from camera observations and expert demonstrations. We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes. First, we propose a model that predicts important quantities in computer vision: depth, semantic segmentation, and optical flow. We then use 3D geometry as an inductive bias to operate in the bird's-eye view space. We present for the first time a model that can predict probabilistic future trajectories of dynamic agents in bird's-eye view from 360° surround monocular cameras only. Finally, we demonstrate the benefits of learning a world model in closed-loop driving. Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment.

updated: Thu Jun 15 2023 14:58:21 GMT+0000 (UTC)

published: Thu Jun 15 2023 14:58:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト