Contrastive Proposal Extension with LSTM Network for Weakly Supervised Object Detection

Pei Lv; Suqi Hu; Tianran Hao; Haohan Ji; Lisha Cui; Haoyi Fan; Mingliang Xu; Changsheng Xu

弱教師あり物体検出のためのLSTMネットワークによる対照的な提案拡張

弱教師ありオブジェクト検出（WSOD）は、画像レベルのラベルのみを使用し、注釈のコストを大幅に節約できるため、ますます注目を集めています。ほとんどのWSODメソッドは、基本フレームワークとしてマルチインスタンスラーニング（MIL）を使用しており、インスタンス分類の問題と見なされます。ただし、MILに基づくこれらの方法は、対応する完全な領域ではなく、異なるインスタンスの最も識別力のある領域にのみ収束する傾向があります。つまり、整合性が不十分です。人間が物事を観察する習慣から着想を得て、初期提案と拡張提案を比較し、初期提案を最適化する新しい方法を提案します。具体的には、複数の方向性対照提案拡張（D-CPE）で構成される対照提案拡張（CPE）を使用することにより、WSODの1つの新しい戦略を提案します。各D-CPEには、LSTMネットワークに基づくエンコーダーと対応するデコーダーが含まれます。まず、MILの最初の提案の境界は、適切に設計された順序に従ってさまざまな位置に拡張されます。次に、CPEは、エンコーダーを使用して拡張プロポーザルと初期プロポーザルの機能セマンティクスを抽出することにより、それらを比較し、初期プロポーザルの整合性を計算して、初期プロポーザルのスコアを最適化します。これらの対照的なコンテキストセマンティクスは、基本的なWSODをガイドして、悪い提案を抑制し、良い提案のスコアを向上させます。さらに、LSTMの時間コーディングを制約し、WSODのパフォーマンスをさらに向上させるために、デコーダーとして単純な2ストリームネットワークが設計されています。 PASCAL VOC 2007、VOC 2012、およびMS-COCOデータセットでの実験は、私たちの方法が最先端の結果を達成したことを示しています。

Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs. Most of the WSOD methods use Multiple Instance Learning (MIL) as their basic framework, which regard it as an instance classification problem. However, these methods based on MIL tends to converge only on the most discriminate regions of different instances, rather than their corresponding complete regions, that is, insufficient integrity. Inspired by the habit of observing things by the human, we propose a new method by comparing the initial proposals and the extension ones to optimize those initial proposals. Specifically, we propose one new strategy for WSOD by involving contrastive proposal extension (CPE), which consists of multiple directional contrastive proposal extensions (D-CPE), and each D-CPE contains encoders based on LSTM network and corresponding decoders. Firstly, the boundary of initial proposals in MIL is extended to different positions according to well-designed sequential order. Then, CPE compares the extended proposal and the initial proposal by extracting the feature semantics of them using the encoders, and calculates the integrity of the initial proposal to optimize the score of the initial proposal. These contrastive contextual semantics will guide the basic WSOD to suppress bad proposals and improve the scores of good ones. In addition, a simple two-stream network is designed as the decoder to constrain the temporal coding of LSTM and improve the performance of WSOD further. Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.

updated: Sat Oct 16 2021 12:17:18 GMT+0000 (UTC)

published: Thu Oct 14 2021 16:31:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト