Learning What and Where: Disentangling Location and Identity Tracking Without Supervision

Manuel Traub; Sebastian Otte; Tobias Menge; Matthias Karlbauer; Jannik Thümmel; Martin V. Butz

何をどこで学習するか: 監視なしで位置情報と ID 追跡を解きほぐす

私たちの脳は、視覚的なデータストリームを背景と目立つオブジェクトに簡単に分解できます。さらに、オブジェクトの動きと相互作用を予測することができます。これは、概念的な計画と推論に不可欠な能力です。 CATER などの最近のオブジェクト推論データセットは、特に明示的なオブジェクト表現、オブジェクトの永続性、およびオブジェクト推論を対象とする場合に、現在の視覚ベースの AI システムの根本的な欠点を明らかにしました。ここでは、CATER 追跡の課題に優れた、自己管理型の LOCation および ID 追跡システム (Loci) を紹介します。脳の背側経路と腹側経路に着想を得た Loci は、「何」と「どこ」の個別のスロット単位のエンコーディングを処理することで結合の問題に取り組んでいます。 Loci の予測コーディングのような処理は、個々のスロットが個々のオブジェクトをエンコードする傾向があるように、積極的なエラーの最小化を促進します。オブジェクトとオブジェクトのダイナミクス間の相互作用は、絡み合っていない潜在空間で処理されます。順方向の適格性の蓄積と組み合わせた時間による切り捨てられた逆伝播により、学習が大幅に高速化され、メモリ効率が向上します。現在のベンチマークで優れたパフォーマンスを発揮するだけでなく、Loci はビデオストリームから効果的にオブジェクトを抽出し、それらをロケーションコンポーネントとゲシュタルトコンポーネントに分離します。この分離は、概念レベルでの効果的な計画と推論を促進する表現を提供すると信じています。

Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object representations, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal and ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of `what' and `where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers a representation that will facilitate effective planning and reasoning on conceptual levels.

updated: Wed Jan 11 2023 11:37:38 GMT+0000 (UTC)

published: Thu May 26 2022 13:30:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト