ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning

Zhiwu Qing; Ziyuan Huang; Shiwei Zhang; Mingqian Tang; Changxin Gao; Marcelo H. Ang Jr; Rong Ji; Nong Sang

ParamCrop：ビデオ対照学習のためのパラメトリックキュービッククロッピング

対照学習の中心的な考え方は、異なるインスタンスを区別し、同じインスタンスの異なるビューに同じ表現を共有させることです。些細な解決策を回避するために、拡張はさまざまなビューを生成する上で重要な役割を果たします。その中で、ランダムなトリミングは、モデルが強力で一般化された表現を学習するのに効果的であることが示されています。一般的に使用されるランダムクロップ操作は、トレーニングプロセスに沿って2つのビューの違いを統計的に一定に保ちます。この作業では、トレーニングプロセスに沿って2つの拡張ビュー間の視差を適応的に制御することで、学習した表現の品質が向上することを示すことで、この規則に挑戦します。具体的には、ビデオの対照学習のためのパラメトリック3次トリミング操作ParamCropを紹介します。これは、微分可能な3Dアフィン変換によってビデオから3D3次トリミングを自動的に行います。 ParamCropは、敵対的な目的を使用してビデオバックボーンと同時にトレーニングされ、データから最適なトリミング戦略を学習します。視覚化は、2つの拡張ビュー間の中心距離とIoUがParamCropによって適応的に制御され、トレーニングプロセスに沿って学習された視差の変化が強力な表現を学習するのに有益であることを示しています。広範なアブレーション研究は、複数の対照的な学習フレームワークとビデオバックボーンに対する提案されたParamCropの有効性を示しています。 ParamCropを使用すると、HMDB51データセットとUCF101データセットの両方で最先端のパフォーマンスが向上します。

The central idea of contrastive learning is to discriminate between different instances and force different views of the same instance to share the same representation. To avoid trivial solutions, augmentation plays an important role in generating different views, among which random cropping is shown to be effective for the model to learn a strong and generalized representation. Commonly used random crop operation keeps the difference between two views statistically consistent along the training process. In this work, we challenge this convention by showing that adaptively controlling the disparity between two augmented views along the training process enhances the quality of the learnt representation. Specifically, we present a parametric cubic cropping operation, ParamCrop, for video contrastive learning, which automatically crops a 3D cubic from the video by differentiable 3D affine transformations. ParamCrop is trained simultaneously with the video backbone using an adversarial objective and learns an optimal cropping strategy from the data. The visualizations show that the center distance and the IoU between two augmented views are adaptively controlled by ParamCrop and the learned change in the disparity along the training process is beneficial to learning a strong representation. Extensive ablation studies demonstrate the effectiveness of the proposed ParamCrop on multiple contrastive learning frameworks and video backbones. With ParamCrop, we improve the state-of-the-art performance on both HMDB51 and UCF101 datasets.

updated: Tue Aug 24 2021 03:18:12 GMT+0000 (UTC)

published: Tue Aug 24 2021 03:18:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト