The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Haider Al-Tahan; Yalda Mohsenzadeh

自己監視視聴覚表現学習に対する時空間増強の影響

聴覚と視覚の対照的な学習は、個別に調査すると非常に成功しています。ただし、効果的な視聴覚表現を実現するために、両方のドメインから学んだ原則をどのように統合できるかについては、依然として大きな疑問があります。この論文では、ラベルのないビデオから視聴覚表現を学習するための対照的なフレームワークを提示します。自己監視型の事前トレーニング中に利用される増強のタイプと強度は、対照的なフレームワークが十分に機能するために重要な役割を果たします。したがって、視聴覚表現の学習に適した時間的増強の構成を広範囲に調査します。ビデオの時間的コヒーレンシを損なうことのない不可逆時空間変換が最も効果的であることがわかります。さらに、これらの変換の有効性は、より高い時間分解能とより強い変換強度に比例することを示します。サンプリングベースの時間的拡張のみで事前トレーニングされた自己監視モデルと比較して、時間的拡張で事前トレーニングされた自己監視モデルは、AVEデータセットの線形分類器のパフォーマンスを約6.5％向上させます。最後に、その単純さにもかかわらず、提案された変換は、自己監視学習フレームワーク（SimSiam、MoCoV3など）およびベンチマーク視聴覚データセット（AVE）全体でうまく機能することを示します。

Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective audiovisual representations. In this paper, we present a contrastive framework to learn audiovisual representations from unlabeled videos. The type and strength of augmentations utilized during self-supervised pre-training play a crucial role for contrastive frameworks to work sufficiently. Hence, we extensively investigate composition of temporal augmentations suitable for learning audiovisual representations; we find lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective. Furthermore, we show that the effectiveness of these transformations scales with higher temporal resolution and stronger transformation intensity. Compared to self-supervised models pre-trained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear classifier performance on AVE dataset. Lastly, we show that despite their simplicity, our proposed transformations work well across self-supervised learning frameworks (SimSiam, MoCoV3, etc), and benchmark audiovisual dataset (AVE).

updated: Wed Oct 13 2021 23:48:58 GMT+0000 (UTC)

published: Wed Oct 13 2021 23:48:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト