Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching

Donggyun Kim; Jinwoo Kim; Seongwoong Cho; Chong Luo; Seunghoon Hong

ビジュアルトークンマッチングによる高密度予測タスクのユニバーサルフューショット学習

密な予測タスクは、コンピュータービジョンの問題の基本的なクラスです。教師ありメソッドはピクセル単位のラベル付けコストが高いため、少数のラベル付けされた画像から高密度のタスクを学習できる少数ショット学習ソリューションが望まれます。しかし、現在の少数ショット学習方法は、セマンティックセグメンテーションなどの制限された一連のタスクを対象としています。これは、おそらく、目に見えないセマンティクスの任意のタスクに柔軟かつ効率的に適応できる一般的で統一されたモデルを設計する際の課題によるものです。ビジュアルトークンマッチング (VTM) を提案します。これは、任意の密な予測タスクのためのユニバーサルフューズショット学習器です。すべてのタスクをカプセル化する画像とラベルのパッチレベルの埋め込みトークンに対して、ノンパラメトリックマッチングを採用しています。また、VTM は、マッチングアルゴリズムを調整する少量のタスク固有のパラメーターを使用して、あらゆるタスクに柔軟に適応します。複数の機能階層でトークンマッチングが実行される ViT バックボーンを含む強力な階層型エンコーダー/デコーダーアーキテクチャとして VTM を実装します。 Taskonomy データセットの挑戦的なバリアントで VTM を実験し、さまざまな目に見えない高密度の予測タスクを堅牢に少数ショットで学習することを観察します。驚くべきことに、ラベル付けされた新しいタスクの例を 10 個だけ使用した完全に監視されたベースライン (完全な監視の 0.004%) と競合し、完全な監視の 0.1% を使用した場合よりも優れている場合があります。コードは https://github.com/GitGyun/visual_token_matching で入手できます。

Dense prediction tasks are a fundamental class of problems in computer vision. As supervised methods suffer from high pixel-wise labeling cost, a few-shot learning solution that can learn any dense task from a few labeled images is desired. Yet, current few-shot learning methods target a restricted set of tasks such as semantic segmentation, presumably due to challenges in designing a general and unified model that is able to flexibly and efficiently adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks. It employs non-parametric matching on patch-level embedded tokens of images and labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm. We implement VTM as a powerful hierarchical encoder-decoder architecture involving ViT backbones where token matching is performed at multiple feature hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset and observe that it robustly few-shot learns various unseen dense prediction tasks. Surprisingly, it is competitive with fully supervised baselines using only 10 labeled examples of novel tasks (0.004% of full supervision) and sometimes outperforms using 0.1% of full supervision. Codes are available at https://github.com/GitGyun/visual_token_matching.

updated: Mon Mar 27 2023 07:58:42 GMT+0000 (UTC)

published: Mon Mar 27 2023 07:58:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト