Three ways to improve feature alignment for open vocabulary detection

Relja Arandjelović; Alex Andonian; Arthur Mensch; Olivier J. Hénaff; Jean-Baptiste Alayrac; Andrew Zisserman

オープンボキャブラリ検出の機能の配置を改善する 3 つの方法

ゼロショットオープン語彙検出の中心的な問題は、検出器が目に見えないクラスで適切に機能するように、視覚的特徴とテキスト特徴をどのように調整するかということです。以前のアプローチでは、機能ピラミッドと検出ヘッドをゼロからトレーニングするため、事前トレーニング中に確立されたビジョンとテキストの機能のアライメントが崩れ、言語モデルが目に見えないクラスを忘れるのを防ぐのに苦労していました。これらの問題を軽減するための 3 つの方法を提案します。まず、単純なスキームを使用してテキストの埋め込みを拡張し、トレーニング中に見られる少数のクラスへの過剰適合を防ぎ、同時にメモリと計算を節約します。次に、機能ピラミッドネットワークと検出ヘッドが変更され、トレーニング可能なゲートショートカットが含まれるようになりました。最後に、セルフトレーニングアプローチを使用して、画像とテキストのペアのより大きなコーパスを活用し、人間が注釈を付けた境界ボックスを持たないクラスの検出パフォーマンスを向上させます。私たちの 3 つの方法は、LVIS ベンチマークのゼロショットバージョンで評価され、それぞれが明確かつ重要な利点を示しています。私たちの最終的なネットワークは、mAP-all メトリックで新しい最先端を達成し、mAP-rare の競争力のあるパフォーマンスと、COCO および Objects365 への優れた転送を示しています。

The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.

updated: Thu Mar 23 2023 17:59:53 GMT+0000 (UTC)

published: Thu Mar 23 2023 17:59:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト