A Bag-of-Prototypes Representation for Dataset-Level Applications

Weijie Tu; Weijian Deng; Tom Gedeon; Liang Zheng

データセットレベルのアプリケーションのプロトタイプ表現

この作業では、データセットレベルの 2 つのタスク (トレーニングセットの適合性とテストセットの難易度の評価) のデータセットのベクトル化を調査します。前者は、トレーニングセットがターゲットドメインにどの程度適しているかを測定しますが、後者は、学習したモデルに対してテストセットがどれほど難しいかを調査します。 2 つのタスクの中心は、データセット間の基本的な関係を測定することです。これには、望ましいデータセットのベクトル化スキームが必要です。これは、結果として得られるデータセットベクトル間の距離がデータセット間の類似性を反映できるように、できるだけ多くの識別可能なデータセット情報を保持する必要があります。この目的のために、パッチ記述子で構成されるイメージレベルのバッグをセマンティックプロトタイプで構成されるデータセットレベルのバッグに拡張する、バッグオブプロトタイプ (BoP) データセット表現を提案します。具体的には、参照データセットからクラスター化された K 個のプロトタイプで構成されるコードブックを開発します。エンコードするデータセットが与えられると、その画像の各特徴をコードブック内の特定のプロトタイプに量子化し、K 次元のヒストグラムを取得します。データセットラベルへのアクセスを前提とせずに、BoP 表現はデータセットのセマンティック分布の豊富な特性を提供します。さらに、BoP 表現は、データセット間の類似性を測定するための Jensen-Shannon ダイバージェンスとうまく連携します。非常に単純ですが、BoP は、2 つのデータセットレベルのタスクの一連のベンチマークで、既存の表現よりも優れていることを一貫して示しています。

This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.

updated: Thu Mar 23 2023 13:33:58 GMT+0000 (UTC)

published: Thu Mar 23 2023 13:33:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト