Relation Distillation Networks for Video Object Detection

ビデオオブジェクト検出のための関係蒸留ネットワーク

オブジェクトとオブジェクトの関係をモデル化すると、オブジェクトの検出に役立つことがよく知られています。それでも、特にオブジェクト間の相互作用を調査してビデオオブジェクト検出器を強化する場合、問題は簡単ではありません。困難は、ビデオ内の信頼できるオブジェクトの関係が、現在のフレーム内のオブジェクトだけでなく、ビデオの長距離スパンにわたって抽出されたすべての支援オブジェクトにも依存するという側面に由来します。このホワイトペーパーでは、時空間のコンテキストでオブジェクト間の相互作用をキャプチャする新しいデザインを紹介します。具体的には、Relation Distillation Networks（RDN）を提示します。これは、検出のためにオブジェクトの特徴を増強するためにオブジェクトの関係を新規に集約および伝播する新しいアーキテクチャです。技術的には、オブジェクト提案は最初に地域提案ネットワーク（RPN）を介して生成されます。次に、RDNは、多段階の推論を介してオブジェクトの関係をモデル化し、もう1つは、カスケード方式で高いオブジェクト性スコアを備えた支援オブジェクト提案を洗練することにより、関係を徐々に蒸留します。学習した関係は、各フレームでのオブジェクト検出の改善とフレーム間でのボックスリンクの両方の有効性を検証します。 ImageNet VIDデータセットで広範な実験が実施されており、最先端の方法と比較すると優れた結果が報告されています。さらに注目すべきことに、RDNはResNet-101とResNeXt-101でそれぞれ81.8％と83.2％のmAPを達成しています。リンクとリスコアリングをさらに装備すると、最新のmAPである83.8％と84.7％が得られます。

It has been well recognized that modeling object-to-object relations would be helpful for object detection. Nevertheless, the problem is not trivial especially when exploring the interactions between objects to boost video object detectors. The difficulty originates from the aspect that reliable object relations in a video should depend on not only the objects in the present frame but also all the supportive objects extracted over a long range span of the video. In this paper, we introduce a new design to capture the interactions across the objects in spatio-temporal context. Specifically, we present Relation Distillation Networks (RDN) --- a new architecture that novelly aggregates and propagates object relation to augment object features for detection. Technically, object proposals are first generated via Region Proposal Networks (RPN). RDN then, on one hand, models object relation via multi-stage reasoning, and on the other, progressively distills relation through refining supportive object proposals with high objectness scores in a cascaded manner. The learnt relation verifies the efficacy on both improving object detection in each frame and box linking across frames. Extensive experiments are conducted on ImageNet VID dataset, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our RDN achieves 81.8% and 83.2% mAP with ResNet-101 and ResNeXt-101, respectively. When further equipped with linking and rescoring, we obtain to-date the best reported mAP of 83.8% and 84.7%.

updated: Mon Aug 26 2019 07:45:43 GMT+0000 (UTC)

published: Mon Aug 26 2019 07:45:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト