Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details

Achal Dave; Piotr Dollár; Deva Ramanan; Alexander Kirillov; Ross Girshick

大語彙オブジェクト検出器の評価：悪魔は詳細に宿る

設計上、オブジェクト検出の平均精度（AP）は、すべてのクラスを独立して処理することを目的としています。APは、カテゴリごとに独立して計算され、平均化されます。一方では、これはすべてのクラスを平等に扱うため、望ましいことです。一方、実際のユースケースの重要なプロパティである、カテゴリ間の信頼性の調整は無視されます。残念ながら、重要な条件（つまり、語彙が多く、インスタンス数が多い）では、APのデフォルトの実装はカテゴリに依存せず、適切に調整された検出器に直接報酬を与えることもありません。実際、LVISでは、デフォルトの実装がゲーム可能なメトリックを生成することを示しています。この場合、単純で直感的でない再ランク付けポリシーにより、APを大幅に改善できます。これらの制限に対処するために、2つの補完的なメトリックを導入します。まず、デフォルトのAP実装に対する簡単な修正を示し、当初の意図どおりにカテゴリ間で独立していることを確認します。最近のLVIS検出の進歩をベンチマークし、報告された多くの利益が新しい評価の下での改善に変換されないことを発見しました。これは、最近の改善が、カテゴリ間のランキングの変更を解釈するのが難しいことから生じる可能性があることを示唆しています。クロスカテゴリランキングを確実にベンチマークすることの重要性を考慮して、クロスカテゴリランキングを直接比較することにより、適切にキャリブレーションされた検出器に報酬を与えるAPのプールバージョン（AP-Pool）を検討します。最後に、キャリブレーションの従来のアプローチを再検討し、検出器を明示的にキャリブレーションすると、AP-Poolの最新技術が1.7ポイント向上することを確認します。

By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On one hand, this is desirable as it treats all classes equally. On the other hand, it ignores cross-category confidence calibration, a key property in real-world use cases. Unfortunately, under important conditions (i.e., large vocabulary, high instance counts) the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors. In fact, we show that on LVIS the default implementation produces a gameable metric, where a simple, un-intuitive re-ranking policy can improve AP by a large margin. To address these limitations, we introduce two complementary metrics. First, we present a simple fix to the default AP implementation, ensuring that it is independent across categories as originally intended. We benchmark recent LVIS detection advances and find that many reported gains do not translate to improvements under our new evaluation, suggesting recent improvements may arise from difficult to interpret changes to cross-category rankings. Given the importance of reliably benchmarking cross-category rankings, we consider a pooled version of AP (AP-Pool) that rewards properly calibrated detectors by directly comparing cross-category rankings. Finally, we revisit classical approaches for calibration and find that explicitly calibrating detectors improves state-of-the-art on AP-Pool by 1.7 points

updated: Tue Mar 15 2022 06:04:05 GMT+0000 (UTC)

published: Mon Feb 01 2021 18:56:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト