Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V. Huynh; Lauren E. Gillespie; Jael Lopez-Saucedo; Claire Tang; Rohan Sikand; Moisés Expósito-Alonso

Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)x2014a new pre-training task for ground-level and aerial image representation learning of the natural worldx2014and introduce Nature Multi-View (NMV), a dataset of natural world imagery including >3 million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.

updated: Sat Sep 28 2024 19:07:22 GMT+0000 (UTC)

published: Sat Sep 28 2024 19:07:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト