Pretrained Language Models as Visual Planners for Human Assistance

Dhruvesh Patel; Hamid Eghbalzadeh; Nitin Kamra; Michael Louis Iuzzolino; Unnat Jain; Ruta Desai

人間支援のためのビジュアルプランナーとしての事前トレーニング済み言語モデル

ユーザーを複雑な複数段階の目標達成に導くことができるマルチモーダル AI アシスタントの進化を追求する中で、私たちは「支援のためのビジュアルプランニング (VPA)」というタスクを提案します。簡潔な自然言語の目標 (例: 「棚を作る」) とユーザーのこれまでの進捗状況のビデオが与えられると、VPA の目的は計画、つまり「砂の棚」、「ペイント」などの一連のアクションを考案することです。棚」などで指定された目的を実現します。これには、（トリミングされていない）ビデオからユーザーの進行状況を評価し、それを自然言語目標の要件、つまりどのアクションをどの順序で選択するか、と関連付けることが必要です。したがって、これには、長いビデオ履歴と任意に複雑なアクションの依存関係を処理する必要があります。これらの課題に対処するために、VPA をビデオアクションのセグメンテーションと予測に分解します。重要なのは、予測ステップをマルチモーダルシーケンスモデリング問題として定式化して実験し、事前トレーニングされた LM の強みを (シーケンスモデルとして) 活用できるようにしていることです。 Visual Language Model based Planner (VLaMP) と呼ばれるこの新しいアプローチは、生成された計画の品質を評価する一連の指標全体でベースラインを上回ります。さらに、包括的なアブレーションを通じて、言語の事前トレーニング、視覚的観察、目標情報などの各コンポーネントの価値も分離します。すべてのデータ、モデルチェックポイント、トレーニングコードをオープンソース化しました。

In our pursuit of advancing multi-modal AI assistants capable of guiding users to achieve complex multi-step goals, we propose the task of "Visual Planning for Assistance (VPA)". Given a succinct natural language goal, e.g., "make a shelf", and a video of the user's progress so far, the aim of VPA is to devise a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf", etc. to realize the specified goal. This requires assessing the user's progress from the (untrimmed) video, and relating it to the requirements of natural language goal, i.e., which actions to select and in what order? Consequently, this requires handling long video history and arbitrarily complex action dependencies. To address these challenges, we decompose VPA into video action segmentation and forecasting. Importantly, we experiment by formulating the forecasting step as a multi-modal sequence modeling problem, allowing us to leverage the strength of pre-trained LMs (as the sequence model). This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics that gauge the quality of the generated plans. Furthermore, through comprehensive ablations, we also isolate the value of each component--language pre-training, visual observations, and goal information. We have open-sourced all the data, model checkpoints, and training code.

updated: Sat Aug 26 2023 06:22:41 GMT+0000 (UTC)

published: Mon Apr 17 2023 18:07:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト