Learning video embedding space with Natural Language Supervision

Phani Krishna Uppala; Abhishek Bamotra; Shriti Priya; Vaidehi Joshi

自然言語教師による動画埋め込み空間の学習

CLIP モデルの最近の成功は、幅広い視覚および言語タスクに適用できる可能性を示しています。ただし、これは言語と画像の埋め込み空間の関係を確立するだけであり、ビデオドメインとは関係ありません。この論文では、ビデオ埋め込み空間を自然言語にマッピングするための新しいアプローチを提案します。最初に事前トレーニング済みの CNN を使用してビデオの各フレームから視覚的特徴を抽出し、次に CLIP モデルを使用してビデオドメインの視覚的特徴と対応するテキスト説明をエンコードする 2 段階のアプローチを提案します。 UCF101 と HMDB51 の 2 つのベンチマークデータセットでメソッドを評価し、両方のタスクで最先端のパフォーマンスを達成しました。

The recent success of the CLIP model has shown its potential to be applied to a wide range of vision and language tasks. However this only establishes embedding space relationship of language to images, not to the video domain. In this paper, we propose a novel approach to map video embedding space to natural langugage. We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain, along with the corresponding text descriptions. We evaluate our method on two benchmark datasets, UCF101 and HMDB51, and achieve state-of-the-art performance on both tasks.

updated: Sat Apr 08 2023 02:44:20 GMT+0000 (UTC)

published: Sat Mar 25 2023 23:24:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト