Self-supervised similarity search for large scientific datasets

George Stein; Peter Harrington; Jacqueline Blaum; Tomislav Medan; Zarija Lukic

大規模な科学データセットの自己監視類似性検索

ラベルのない大規模なデータセットを探索および活用するための自己監視学習の使用について説明します。 Dark Energy Spectroscopic Instrument（DESI）Legacy Imaging Surveysの最新データリリースからの4,200万個の銀河画像に焦点を当て、最初に自己監視モデルをトレーニングして、それぞれの対称性、不確実性、およびノイズにロバストな低次元表現を抽出します。画像。次に、表現を使用して、インタラクティブな意味的類似性検索ツールを構築し、公開します。このツールを使用して、1つの例だけを挙げてレアオブジェクトを迅速に発見し、クラウドソーシングキャンペーンの速度を上げ、監視ありアプリケーションのトレーニングセットを構築および改善する方法を示します。空の調査からの画像に焦点を当てていますが、この手法は、あらゆる次元のあらゆる科学データセットに適用するのが簡単です。類似性検索Webアプリは、https：//github.com/georgestein/galaxy_searchにあります。

We present the use of self-supervised learning to explore and exploit large unlabeled datasets. Focusing on 42 million galaxy images from the latest data release of the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys, we first train a self-supervised model to distill low-dimensional representations that are robust to symmetries, uncertainties, and noise in each image. We then use the representations to construct and publicly release an interactive semantic similarity search tool. We demonstrate how our tool can be used to rapidly discover rare objects given only a single example, increase the speed of crowd-sourcing campaigns, and construct and improve training sets for supervised applications. While we focus on images from sky surveys, the technique is straightforward to apply to any scientific dataset of any dimensionality. The similarity search web app can be found at https://github.com/georgestein/galaxy_search

updated: Tue Nov 30 2021 19:01:18 GMT+0000 (UTC)

published: Mon Oct 25 2021 18:00:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト