MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song; Wenhao Chai; Guanhong Wang; Yucheng Zhang; Haoyang Zhou; Feiyang Wu; Xun Guo; Tian Ye; Yan Lu; Jenq-Neng Hwang; Gaoang Wang

MovieChat: 長いビデオを理解するための高密度トークンから疎メモリへ

最近では、ビデオ基盤モデルと大規模言語モデルを統合して、特定の事前定義された視覚タスクの制限を克服するビデオ理解システムを構築しています。しかし、既存のシステムはフレーム数が非常に少ないビデオしか処理できません。長いビデオの場合、計算の複雑さ、メモリのコスト、および長期的な時間的接続が残りの課題となります。アトキンソン・シフリン記憶モデルに触発されて、私たちは、急速に更新される短期記憶と、コンパクトで持続する長期記憶を含む記憶メカニズムを開発します。トランスフォーマーでは記憶の伝達手段としてトークンを採用しています。 MovieChat は、長時間ビデオの理解において最先端のパフォーマンスを実現します。

Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.

updated: Mon Jul 31 2023 07:15:45 GMT+0000 (UTC)

published: Mon Jul 31 2023 07:15:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト