Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang; Jiaxing Huang; Sheng Jin; Shijian Lu

視覚タスクのための視覚言語モデル: 調査

ほとんどの視覚認識研究は、ディープニューラルネットワーク (DNN) トレーニングでクラウドラベル付けされたデータに大きく依存しており、通常、単一の視覚認識タスクごとに DNN をトレーニングするため、面倒で時間のかかる視覚認識パラダイムにつながります。 2 つの課題に対処するために、視覚言語モデル (VLM) が最近集中的に調査されました。これは、インターネット上でほぼ無限に利用できる Web スケールの画像とテキストのペアから豊富な視覚言語相関を学習し、さまざまな言語のゼロショット予測を可能にします。単一の VLM による視覚認識タスク。このペーパーでは、次のようなさまざまな視覚認識タスクの視覚言語モデルの体系的なレビューを提供します。(1) 視覚認識パラダイムの開発を紹介する背景。 (2) 広く採用されているネットワークアーキテクチャ、トレーニング前の目的、およびダウンストリームタスクをまとめた VLM の基礎。 (3) VLM の事前トレーニングと評価で広く採用されているデータセット。 (4) 既存の VLM 事前トレーニング方法、VLM 転移学習方法、および VLM 知識蒸留方法のレビューと分類。 (5) レビューされた方法のベンチマーク、分析、および議論。 (6) 視覚認識のための将来の VLM 研究で追求できるいくつかの研究課題と潜在的な研究方向。この調査に関連するプロジェクトが https://github.com/jingyi0000/VLM_survey で作成されました。

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

updated: Fri Feb 16 2024 10:28:12 GMT+0000 (UTC)

published: Mon Apr 03 2023 02:17:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト