Detection and Captioning with Unseen Object Classes

Berkan Demirel; Ramazan Gokberk Cinbis

見えないオブジェクトクラスによる検出とキャプション

画像キャプションの生成は、視覚認識と自然言語モデリングのドメインが交差する場所で最も困難な問題の1つです。この作業では、テスト画像に対応する視覚的またはテキストによるトレーニング例のない視覚的オブジェクトが含まれる可能性がある、この問題の実際的に重要な変形を提案し、研究します。この問題に対して、一般化されたゼロショット検出モデルとテンプレートベースの文生成モデルに基づく検出駆動型アプローチを提案します。検出コンポーネントを改善するために、クラス間の類似性に基づくクラス表現と実用的なスコアキャリブレーションメカニズムを共同で定義します。また、キャプションの視覚的コンポーネントと非視覚的コンポーネントを別々に処理することにより、キャプション出力に補完的な洞察を提供する新しい評価指標を提案します。私たちの実験は、提案されたゼロショット検出モデルがMS-COCOデータセットで最先端のパフォーマンスを取得し、ゼロショットキャプションアプローチが有望な結果をもたらすことを示しています。

Image caption generation is one of the most challenging problems at the intersection of visual recognition and natural language modeling domains. In this work, we propose and study a practically important variant of this problem where test images may contain visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model. In order to improve the detection component, we jointly define a class-to-class similarity based class representation and a practical score calibration mechanism. We also propose a novel evaluation metric that provides complimentary insights to the captioning outputs, by separately handling the visual and non-visual components of the captions. Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset and the zero-shot captioning approach yields promising results.

updated: Fri Aug 13 2021 10:43:20 GMT+0000 (UTC)

published: Fri Aug 13 2021 10:43:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト