Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

Razvan-George Pasca; Alexey Gavryushin; Yen-Ling Kuo; Otmar Hilliges; Xi Wang

過去を要約して未来を予測する: コンテキストの自然言語記述がマルチモーダルオブジェクトの相互作用を促進する

自己中心的なビデオにおけるオブジェクトの相互作用予測のタスクを研究します。将来のアクションとオブジェクトの予測を成功させるには、過去のアクションとオブジェクトの関係によって形成された時空間コンテキストを理解する必要があります。過去の行動を簡潔に要約することで、言語の表現力を効果的に活用するマルチモーダル変換器ベースのアーキテクチャであるTransFusionを提案します。 TransFusion は、事前トレーニング済みの画像キャプションモデルを活用し、キャプションを要約して、過去のアクションとオブジェクトに焦点を当てます。このアクションコンテキストと単一の入力フレームは、マルチモーダルフュージョンモジュールによって処理され、次のオブジェクトの相互作用が予測されます。私たちのモデルは、高密度のビデオ機能を言語表現に置き換えることで、より効率的なエンドツーエンドの学習を可能にし、大規模な事前トレーニング済みモデルにエンコードされた知識を活用できるようにします。 Ego4D と EPIC-KITCHENS-100 での実験では、マルチモーダルフュージョンモデルの有効性と、言語ベースのコンテキストサマリーを使用する利点が示されています。私たちの方法は、Ego4D テストセットの全体的な mAP で最先端のアプローチよりも 40.4% 優れています。 EPIC-KITCHENS-100 での実験を通じて、TransFusion の一般性を示します。ビデオとコードは、https://eth-ait.github.io/transfusion-proj/ で入手できます。

We study the task of object interaction anticipation in egocentric videos. Successful prediction of future actions and objects requires an understanding of the spatio-temporal context formed by past actions and object relationships. We propose TransFusion, a multimodal transformer-based architecture, that effectively makes use of the representational power of language by summarizing past actions concisely. TransFusion leverages pre-trained image captioning models and summarizes the caption, focusing on past actions and objects. This action context together with a single input frame is processed by a multimodal fusion module to forecast the next object interactions. Our model enables more efficient end-to-end learning by replacing dense video features with language representations, allowing us to benefit from knowledge encoded in large pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model and the benefits of using language-based context summaries. Our method outperforms state-of-the-art approaches by 40.4% in overall mAP on the Ego4D test set. We show the generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at: https://eth-ait.github.io/transfusion-proj/.

updated: Sun Jan 22 2023 21:30:12 GMT+0000 (UTC)

published: Sun Jan 22 2023 21:30:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト