A Multi-modal Deep Learning Model for Video Thumbnail Selection

Zhifeng Yu; Nanchun Shi

ビデオサムネイル選択のためのマルチモーダル深層学習モデル

サムネイルはオンライン動画の顔です。動画の数と種類の両方が爆発的に増加していることは、潜在的な視聴者が動画を選択する時間を節約し、さらには動画をクリックするように誘うため、優れたサムネイルの重要性を裏付けています。優れたサムネイルは、視聴者の注意を引くと同時に、動画のコンテンツを最もよく表すフレームである必要があります。ただし、これまでの手法とモデルはビデオ内のフレームにのみ焦点を当てており、このように焦点を絞ると、ビデオの一部である多くの有用な情報が除外されると考えられます。このホワイトペーパーでは、コンテンツの定義を拡張して、ビデオのタイトル、説明、およびオーディオを含め、これらのモダリティによって提供される情報を選択モデルで利用します。具体的には、モデルは最初にフレームを時間的に均一にサンプリングし、このサブセットの上位1,000フレームを返し、2列畳み込みニューラルネットワークによって美的スコアが最も高くなり、ダウンストリームタスクですべてのフレームを処理する計算負荷を回避します。次に、モデルには、VGG16から抽出されたフレーム特徴、ELECTRAからのテキスト特徴、およびTRILLからのオーディオ特徴が組み込まれています。これらのモデルは、人気のあるデータセットでの結果と競争力のあるパフォーマンスのために選択されました。特徴抽出後、時系列の特徴、フレーム、オーディオがTransformerエンコーダーレイヤーに送られ、対応するモダリティを表すベクトルが返されます。 4つの機能（フレーム、タイトル、説明、オーディオ）はそれぞれ、連結の前にコンテキストゲーティングレイヤーを通過します。最後に、モデルは潜在空間でベクトルを生成し、潜在空間でこのベクトルに最も類似しているフレームを選択します。私たちの知る限りでは、以前の最先端モデルの結果を上回る、ビデオサムネイルを選択するためのマルチモーダルディープラーニングモデルを最初に提案しました。

Thumbnail is the face of online videos. The explosive growth of videos both in number and variety underpins the importance of a good thumbnail because it saves potential viewers time to choose videos and even entice them to click on them. A good thumbnail should be a frame that best represents the content of a video while at the same time capturing viewers' attention. However, the techniques and models in the past only focus on frames within a video, and we believe such narrowed focus leave out much useful information that are part of a video. In this paper, we expand the definition of content to include title, description, and audio of a video and utilize information provided by these modalities in our selection model. Specifically, our model will first sample frames uniformly in time and return the top 1,000 frames in this subset with the highest aesthetic scores by a Double-column Convolutional Neural Network, to avoid the computational burden of processing all frames in downstream task. Then, the model incorporates frame features extracted from VGG16, text features from ELECTRA, and audio features from TRILL. These models were selected because of their results on popular datasets as well as their competitive performances. After feature extraction, the time-series features, frames and audio, will be fed into Transformer encoder layers to return a vector representing their corresponding modality. Each of the four features (frames, title, description, audios) will pass through a context gating layer before concatenation. Finally, our model will generate a vector in the latent space and select the frame that is most similar to this vector in the latent space. To the best of our knowledge, we are the first to propose a multi-modal deep learning model to select video thumbnail, which beats the result from the previous State-of-The-Art models.

updated: Thu Dec 31 2020 21:10:09 GMT+0000 (UTC)

published: Thu Dec 31 2020 21:10:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト