Leveraging Local Temporal Information for Multimodal Scene Classification

Saurabh Sahu; Palash Goyal

マルチモーダルシーン分類のためのローカル時間情報の活用

堅牢なビデオシーン分類モデルは、ビデオの空間的（ピクセル単位）および時間的（フレーム単位）の特性を効果的にキャプチャする必要があります。トークンのシーケンスが与えられた個々のトークンのコンテキスト化された表現を取得するように設計された自己注意を備えたTransformerモデルは、多くのコンピュータービジョンタスクでますます人気が高まっています。ただし、ビデオを理解するためのTransformerベースのモデルの使用はまだ比較的検討されていません。さらに、これらのモデルは、隣接するビデオフレーム間の強力な時間的関係を利用して、強力なフレームレベルの表現を取得できません。この論文では、ビデオフレーム間のローカルおよびグローバルの両方の時間的関係を活用して、個々のフレームのより適切なコンテキスト表現を取得する、新しい自己注意ブロックを提案します。これにより、モデルはさまざまな粒度でビデオを理解できます。ビデオ分類のタスクに関する大規模なYoutTube-8Mデータセットでのモデルのパフォーマンスを示し、その結果をさらに分析して改善を示します。

Robust video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively. Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks. However, the use of Transformer based models for video understanding is still relatively unexplored. Moreover, these models fail to exploit the strong temporal relationships between the neighboring video frames to get potent frame-level representations. In this paper, we propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames. This enables the model to understand the video at various granularities. We illustrate the performance of our models on the large scale YoutTube-8M data set on the task of video categorization and further analyze the results to showcase improvement.

updated: Tue Oct 26 2021 19:58:32 GMT+0000 (UTC)

published: Tue Oct 26 2021 19:58:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト