Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection

Zhe Chen; Jing Zhang; Yufei Xu; Dacheng Tao

オブジェクト検出で機能ピラミッドをブーストするためのトランスフォーマーベースのコンテキスト凝縮

現在のオブジェクト検出器には通常、マルチレベル機能融合（MFF）用の機能ピラミッド（FP）モジュールがあります。これは、さまざまなレベルの機能間のギャップを緩和し、包括的なオブジェクト表現を形成して、より優れた検出パフォーマンスを実現することを目的としています。ただし、通常、より良いMFF結果を得るには、重いクロスレベル接続または反復的な改良が必要であり、構造が複雑になり、計算が非効率になります。これらの問題に対処するために、既存のFPが計算コストを効果的に削減しながらより良いMFF結果を提供するのに役立つ、斬新で効率的なコンテキストモデリングメカニズムを提案します。特に、包括的なコンテキストを分解して2種類の表現に凝縮し、効率を高めることができるという新しい洞察を紹介します。 2つの表現には、局所的に集中した表現とグローバルに要約された表現が含まれ、前者は近くの領域からコンテキストキューを抽出することに焦点を当て、後者は画像シーン全体の主要な表現をグローバルコンテキストキューとして抽出します。凝縮されたコンテキストを収集することにより、Transformerデコーダーを使用して、それらとFPの各ローカル機能との関係を調査し、それに応じてMFFの結果を調整します。その結果、シンプルで軽量なTransformerベースのContext Condensation（TCC）モジュールが得られます。これにより、さまざまなFPをブーストし、同時に計算コストを削減できます。挑戦的なMSCOCOデータセットに関する広範な実験結果は、TCCが4つの代表的なFPと互換性があり、平均精度で最大7.8％の検出精度を一貫して向上させ、GFLOPで最大約20％複雑さを軽減することを示しています。最先端のパフォーマンスをより効率的に実現します。コードがリリースされます。

Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF) which aims to mitigate the gap between features from different levels and form a comprehensive object representation to achieve better detection performance. However, they usually require heavy cross-level connections or iterative refinement to obtain better MFF result, making them complicated in structure and inefficient in computation. To address these issues, we propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results while reducing the computational costs effectively. In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency. The two representations include a locally concentrated representation and a globally summarized representation, where the former focuses on extracting context cues from nearby areas while the latter extracts key representations of the whole image scene as global context cues. By collecting the condensed contexts, we employ a Transformer decoder to investigate the relations between them and each local feature from the FP and then refine the MFF results accordingly. As a result, we obtain a simple and light-weight Transformer-based Context Condensation (TCC) module, which can boost various FPs and lower their computational costs simultaneously. Extensive experimental results on the challenging MS COCO dataset show that TCC is compatible to four representative FPs and consistently improves their detection accuracy by up to 7.8 % in terms of average precision and reduce their complexities by up to around 20% in terms of GFLOPs, helping them achieve state-of-the-art performance more efficiently. Code will be released.

updated: Thu Jul 14 2022 01:45:03 GMT+0000 (UTC)

published: Thu Jul 14 2022 01:45:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト