Self-supervised Video-centralised Transformer for Video Face Clustering

Yujiang Wang; Mingzhi Dong; Jie Shen; Yiming Luo; Yiming Lin; Pingchuan Ma; Stavros Petridis; Maja Pantic

ビデオフェイスクラスタリング用の自己監視型ビデオ集中型トランスフォーマー

この論文は、ビデオ集中型トランスフォーマーを使用して、ビデオの顔クラスタリングのための新しい方法を提示します。以前の作品では、フレームレベルの表現を学習するために対照学習を採用し、時間的次元に沿って特徴を集約するために平均プーリングを使用することがよくありました。このアプローチでは、複雑なビデオダイナミクスを完全にキャプチャできない場合があります。さらに、ビデオベースの対照学習の最近の進歩にもかかわらず、ビデオ顔クラスタリングタスクに利益をもたらす自己監視クラスタリングに適した顔表現を学習しようとした人はほとんどいません。これらの制限を克服するために、私たちの方法では、トランスフォーマーを使用して、ビデオ内の顔の時間的に変化する特性をより適切に反映できるビデオレベルの表現を直接学習します。また、トランスフォーマーモデルをトレーニングするためのビデオ集中型の自己監視フレームワークを提案します。また、顔のクラスタリングに関連する研究でまだ研究されていない、急速に出現している分野である自己中心的なビデオでの顔のクラスタリングについても調査します。この目的のために、EasyCom-Clusteringという名前の最初の大規模な自己中心的なビデオ顔クラスタリングデータセットを提示してリリースします。広く使用されているビッグバン理論（BBT）データセットと新しいEasyCom-クラスタリングデータセットの両方で、提案された方法を評価します。結果は、ビデオ集中型トランスフォーマーのパフォーマンスが、両方のベンチマークで以前のすべての最先端の方法を上回り、フェイスビデオの自己注意深い理解を示していることを示しています。

This paper presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based contrastive learning, few have attempted to learn a self-supervised clustering-friendly face representation that benefits the video face clustering task. To overcome these limitations, our method employs a transformer to directly learn video-level representations that can better reflect the temporally-varying property of faces in videos, while we also propose a video-centralised self-supervised framework to train the transformer model. We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering. To this end, we present and release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering. We evaluate our proposed method on both the widely used Big Bang Theory (BBT) dataset and the new EasyCom-Clustering dataset. Results show the performance of our video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.

updated: Wed Feb 15 2023 18:30:00 GMT+0000 (UTC)

published: Thu Mar 24 2022 16:38:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト