Pretrained Language Models as Visual Planners for Human Assistance

Dhruvesh Patel; Hamid Eghbalzadeh; Nitin Kamra; Michael Louis Iuzzolino; Unnat Jain; Ruta Desai

人間支援のためのビジュアルプランナーとしての事前学習済み言語モデル

複雑なマルチステップの目標を達成するためにユーザーを導くことができるマルチモーダル AI アシスタントに向けて前進するために、私たちは支援のためのビジュアルプランニング (VPA) のタスクを提案します。自然言語で簡潔に記述された目標 (たとえば「棚を作る」) と、ユーザーのこれまでの進行状況のビデオが与えられると、VPA の目的は計画、つまり「砂の棚」などの一連のアクションを取得することです。「絵の具棚」など、目標を達成するために。これには、トリミングされていないビデオからのユーザーの進行状況を評価し、それを基本的な目標の要件、つまり、アクションの関連性とそれらの間の依存関係の順序付けに関連付ける必要があります。したがって、これには長いビデオ履歴と、任意に複雑なアクションの依存関係を処理する必要があります。これらの課題に対処するために、VPA をビデオアクションセグメンテーションと予測に分解します。予測ステップをマルチモーダルシーケンスモデリング問題として定式化し、事前にトレーニングされた LM をシーケンスモデルとして活用する Visual Language Model based Planner (VLaMP) を提示します。生成された計画を評価するすべてのメトリックに関して、VLaMP がベースラインよりも大幅に優れていることを示しています。さらに、広範なアブレーションを通じて、言語の事前トレーニング、視覚的観察、およびパフォーマンスに関する目標情報の価値も分離します。データ、モデル、コードをリリースして、視覚的な計画に関する将来の研究を支援できるようにします。

To make progress towards multi-modal AI assistants which can guide users to achieve complex multi-step goals, we propose the task of Visual Planning for Assistance (VPA). Given a goal briefly described in natural language, e.g., "make a shelf", and a video of the user's progress so far, the aim of VPA is to obtain a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf", etc., to achieve the goal. This requires assessing the user's progress from the untrimmed video, and relating it to the requirements of underlying goal, i.e., relevance of actions and ordering dependencies amongst them. Consequently, this requires handling long video history, and arbitrarily complex action dependencies. To address these challenges, we decompose VPA into video action segmentation and forecasting. We formulate the forecasting step as a multi-modal sequence modeling problem and present Visual Language Model based Planner (VLaMP), which leverages pre-trained LMs as the sequence model. We demonstrate that VLaMP performs significantly better than baselines w.r.t all metrics that evaluate the generated plan. Moreover, through extensive ablations, we also isolate the value of language pre-training, visual observations, and goal information on the performance. We will release our data, model, and code to enable future research on visual planning for assistance.

updated: Mon Apr 17 2023 18:07:36 GMT+0000 (UTC)

published: Mon Apr 17 2023 18:07:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト