IC^3: Image Captioning by Committee Consensus

David M. Chan; Austin Myers; Sudheendra Vijayanarasimhan; David A. Ross; John Canny

IC^3: 委員会のコンセンサスによる画像キャプション

人間に画像を説明するように頼んだ場合、彼らは何千もの異なる方法で説明する可能性があります。従来、画像キャプションモデルは、画像キャプションの参照分布を近似するようにトレーニングされていましたが、そうすると、視点が乏しいキャプションが助長されてしまいました。このようなキャプションは、多くの場合、可能な詳細のサブセットのみに焦点を当て、シーン内の潜在的に有用な情報を無視します.この作業では、いくつかの視点から高レベルの詳細をキャプチャする単一のキャプションを生成するように設計された、「委員会のコンセンサスによる画像のキャプション」(IC^3) という単純でありながら斬新な方法を紹介します。特に、人間は、IC^3 によって生成されたキャプションがベースライン SOTA モデルと少なくとも同じくらい有用であると 3 分の 2 以上の確率で評価しており、IC^3 キャプションは SOTA 自動再生システムのパフォーマンスを最大 84% 向上させることができ、重要な材料の改善を示しています。視覚的な説明のための既存の SOTA アプローチよりも優れています。私たちのコードは、https://github.com/DavidMChan/caption-by-committee で公開されています。

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC^3), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by IC^3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC^3 captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee

updated: Thu Feb 02 2023 18:58:05 GMT+0000 (UTC)

published: Thu Feb 02 2023 18:58:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト