An implementation of the

Arnau Martí Sarri; Victor Rodriguez-Fernandez

「ゲス・フー」の実装CLIPを使用したゲーム

An implementation of the "Guess who?" game using CLIP

CLIP（Contrastive Language-Image Pretraining）は、自然言語の監視からコンピュータービジョンのタスクを学習するための効率的な方法であり、ゼロショット転送機能により、ディープラーニングの最近のブレークスルーを後押ししました。インターネット上で利用可能な画像とテキストのペアからトレーニングすることにより、CLIPモデルは、データセット固有のトレーニングを必要とせずに、ほとんどのタスクに自明ではなく転送されます。この作業では、CLIPを使用して、人気のあるゲーム「Guess who？」のエンジンを実装します。これにより、プレーヤーは自然言語のプロンプトを使用してゲームを操作し、CLIPはゲームボードの画像がそのプロンプトを満たすかどうかを自動的に判断します。 CLIPに質問を促すさまざまな方法でベンチマークを行うことにより、このアプローチのパフォーマンスを調査し、そのゼロショット能力の限界を示します。

CLIP (Contrastive Language-Image Pretraining) is an efficient method for learning computer vision tasks from natural language supervision that has powered a recent breakthrough in deep learning due to its zero-shot transfer capabilities. By training from image-text pairs available on the internet, the CLIP model transfers non-trivially to most tasks without the need for any data set specific training. In this work, we use CLIP to implement the engine of the popular game "Guess who?", so that the player interacts with the game using natural language prompts and CLIP automatically decides whether an image in the game board fulfills that prompt or not. We study the performance of this approach by benchmarking on different ways of prompting the questions to CLIP, and show the limitations of its zero-shot capabilites.

updated: Tue Nov 30 2021 13:10:52 GMT+0000 (UTC)

published: Tue Nov 30 2021 13:10:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト