VideoLLM: Modeling Video Sequence with Large Language Models

Guo Chen; Yin-Dong Zheng; Jiahao Wang; Jilan Xu; Yifei Huang; Junting Pan; Yi Wang; Yali Wang; Yu Qiao; Tong Lu; Limin Wang

VideoLLM: 大規模な言語モデルを使用したビデオシーケンスのモデリング

ビデオデータの急激な増加に伴い、ビデオコンテンツを分析して理解するための自動化テクノロジーが緊急に必要とされています。ただし、既存のビデオ理解モデルはタスク固有であることが多く、多様なタスクを処理する包括的な機能が不足しています。 GPT のような大規模言語モデル (LLM) の成功により、シーケンス因果推論におけるその優れた能力が実証されました。この洞察に基づいて、ビデオシーケンスを理解するために自然言語処理 (NLP) から事前にトレーニングされた LLM のシーケンス推論機能を活用する、VideoLLM と呼ばれる新しいフレームワークを提案します。 VideoLLM には、慎重に設計されたモダリティエンコーダとセマンティックトランスレータが組み込まれており、さまざまなモダリティからの入力を統一されたトークンシーケンスに変換します。このトークンシーケンスは、デコーダ専用 LLM に供給されます。その後、単純なタスクヘッドを利用して、VideoLLM はさまざまな種類のビデオ理解タスク用の効果的な統合フレームワークを生成します。 VideoLLM の有効性を評価するために、複数の LLM と微調整方法を使用して広範な実験を実施します。 4 つの異なるデータセットをソースとする 8 つのタスクで VideoLLM を評価します。実験結果は、LLM の理解および推論能力がビデオ理解タスクに効果的に移行できることを示しています。コードは https://github.com/cg1177/VideoLLM でリリースされます。

With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM.

updated: Tue May 23 2023 07:48:15 GMT+0000 (UTC)

published: Mon May 22 2023 17:51:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト