Sound and Visual Representation Learning with Multiple Pretraining Tasks

Arun Balajee Vasudevan; Dengxin Dai; Luc Van Gool

複数の事前トレーニングタスクによる音と視覚の表現学習

さまざまな自己監視タスク（SSL）により、データからさまざまな機能が明らかになります。学習した機能表現は、ダウンストリームタスクごとに異なるパフォーマンスを示す可能性があります。この観点から、この作業は、すべてのダウンストリームタスクに適切に一般化される複数のSSLタスク（Multi-SSL）を組み合わせることを目的としています。具体的には、本研究では、バイノーラル音と画像データを分離して調査します。バイノーラルサウンドについては、3つのSSLタスク、つまり、空間アラインメント、前景オブジェクトの時間同期、バイノーラルオーディオおよび時間ギャップ予測を提案します。マルチSSLのいくつかのアプローチを調査し、ビデオ検索、空間サウンドの超解像、およびOmniAudioデータセットのセマンティック予測に関するダウンストリームタスクのパフォーマンスに関する洞察を提供します。バイノーラルサウンド表現に関する私たちの実験は、SSLタスクのインクリメンタル学習（IL）を介したマルチSSLが、ダウンストリームタスクのパフォーマンスにおいて単一のSSLタスクモデルおよび完全に監視されたモデルよりも優れていることを示しています。他のモダリティへの適用性のチェックとして、画像表現学習用のマルチSSLモデルも作成し、最近提案されたSSLタスクであるMoCov2とDenseCLを使用します。ここで、Multi-SSLは、MoCov2、DenseCL、DetCoなどの最近の方法をVOC07分類で2.06％、3.27％、1.19％、COCO検出で+ 2.83、+ 1.56、+ 1.61AP上回っています。コードは公開されます。

Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modality, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection. Code will be made publicly available.

updated: Tue Jan 04 2022 09:09:38 GMT+0000 (UTC)

published: Tue Jan 04 2022 09:09:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト