Smoothed Gaussian Mixture Models for Video Classification and Recommendation

Sirjan Kafle; Aman Gupta; Xue Xia; Ananth Sankar; Xi Chen; Di Wen; Liang Zhang

ビデオの分類と推奨のための平滑化されたガウス混合モデル

Vector of Locally Aggregated Descriptors（VLAD）などのクラスターおよび集約手法、およびNetVLADのようなエンドツーエンドの識別的にトレーニングされた同等の手法は、最近、ビデオ分類およびアクション認識タスクで人気があります。これらの手法は、ビデオフレームをクラスターに割り当て、各クラスターの平均に関してフレームの残差を集約することによってビデオを表すことによって機能します。一部のクラスターではビデオ固有のデータがほとんど表示されない場合があるため、これらの機能はノイズが多い可能性があります。この論文では、平滑化ガウス混合モデル（SGMM）と呼ばれる新しいクラスターと集計の方法と、ディープ平滑化ガウス混合モデル（DSGMM）と呼ばれるそのエンドツーエンドの識別的にトレーニングされた同等物を提案します。 SGMMは、各ビデオを、そのビデオ用にトレーニングされたガウス混合モデル（GMM）のパラメーターで表します。少数のクラスターは、多数のビデオでトレーニングされたユニバーサルバックグラウンドモデル（UBM）を使用して、ビデオ固有の推定値を平滑化することで対処されます。 VLADに対するSGMMの主な利点は、少数のトレーニングサンプルに対する感度が低くなる平滑化です。 YouTube-8M分類タスクに関する広範な実験を通じて、SGMM / DSGMMはVLAD / NetVLADよりも小さいながらも統計的に有意な差で、一貫して優れていることを示しています。また、LinkedInで作成されたデータセットを使用して結果を表示し、メンバーがアップロードされたビデオを視聴するかどうかを予測します。

Cluster-and-aggregate techniques such as Vector of Locally Aggregated Descriptors (VLAD), and their end-to-end discriminatively trained equivalents like NetVLAD have recently been popular for video classification and action recognition tasks. These techniques operate by assigning video frames to clusters and then representing the video by aggregating residuals of frames with respect to the mean of each cluster. Since some clusters may see very little video-specific data, these features can be noisy. In this paper, we propose a new cluster-and-aggregate method which we call smoothed Gaussian mixture model (SGMM), and its end-to-end discriminatively trained equivalent, which we call deep smoothed Gaussian mixture model (DSGMM). SGMM represents each video by the parameters of a Gaussian mixture model (GMM) trained for that video. Low-count clusters are addressed by smoothing the video-specific estimates with a universal background model (UBM) trained on a large number of videos. The primary benefit of SGMM over VLAD is smoothing which makes it less sensitive to small number of training samples. We show, through extensive experiments on the YouTube-8M classification task, that SGMM/DSGMM is consistently better than VLAD/NetVLAD by a small but statistically significant margin. We also show results using a dataset created at LinkedIn to predict if a member will watch an uploaded video.

updated: Thu Dec 17 2020 06:52:41 GMT+0000 (UTC)

published: Thu Dec 17 2020 06:52:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト