Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

Razvan-George Pasca; Alexey Gavryushin; Yen-Ling Kuo; Luc Van Gool; Otmar Hilliges; Xi Wang

過去を要約して未来を予測する: コンテキストの自然言語記述がマルチモーダルオブジェクトインタラクションを促進する

私たちは自己中心的なビデオにおけるオブジェクトの相互作用の予測を研究します。このタスクでは、オブジェクトに対する過去のアクションによって形成された時空間コンテキスト、つまり造語されたアクションコンテキストを理解する必要があります。私たちは、マルチモーダルトランスフォーマーベースのアーキテクチャである TransFusion を提案します。アクションのコンテキストを要約することで、言語の表現力を活用します。 TransFusion は、事前トレーニングされた画像キャプションとビジョン言語モデルを活用して、過去のビデオフレームからアクションコンテキストを抽出します。このアクションコンテキストは、次のビデオフレームとともにマルチモーダルフュージョンモジュールによって処理され、次のオブジェクトインタラクションを予測します。私たちのモデルにより、より効率的なエンドツーエンドの学習が可能になります。大規模な事前トレーニング済み言語モデルにより、常識と一般化機能が追加されます。 Ego4D と EPIC-KITCHEN-100 での実験は、マルチモーダル融合モデルの有効性を示しています。また、視覚だけで十分と思われるタスクにおいて、言語ベースのコンテキスト概要を使用する利点も強調しています。私たちの方法は、Ego4D テストセット上の全体的な mAP において相対的に最先端のアプローチを 40.4% 上回っています。 EPIC-KITCHEN-100の実験を通じてTransFusionの有効性を検証します。ビデオとコードは https://eth-ait.github.io/transfusion-proj/ で入手できます。

We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatiotemporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarising the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/.

updated: Fri Jun 23 2023 10:43:14 GMT+0000 (UTC)

published: Sun Jan 22 2023 21:30:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト