Sequential Voting with Relational Box Fields for Active Object Detection

Qichen Fu; Xingyu Liu; Kris M. Kitani

アクティブオブジェクト検出のためのリレーショナルボックスフィールドを使用した順次投票

手とオブジェクトの相互作用を理解するための重要な要素は、アクティブなオブジェクト（人間の手によって操作されているオブジェクト）を識別する機能です。アクティブオブジェクトを正確にローカライズするために、どの方法でも、手、オブジェクト、背景のいずれに属しているかなど、各画像ピクセルによってエンコードされた情報を使用して推論する必要があります。アクティブオブジェクトのバウンディングボックスを決定する証拠として各ピクセルを活用するために、ピクセル単位の投票関数を提案します。ピクセル単位の投票関数は、入力として初期バウンディングボックスを受け取り、出力としてアクティブオブジェクトの改善されたバウンディングボックスを生成します。投票機能は、入力境界ボックス内の各ピクセルが改善された境界ボックスに投票するように設計されており、多数決のボックスが出力として選択されます。現在のバウンディングボックスとの関係で定義されたバウンディングボックスのフィールドを特徴付けるため、投票関数内で生成されたバウンディングボックスのコレクションをリレーショナルボックスフィールドと呼びます。私たちの投票機能はアクティブオブジェクトのバウンディングボックスを改善することができますが、通常、1ラウンドの投票ではアクティブオブジェクトを正確にローカライズするのに十分ではありません。そのため、投票機能を繰り返し適用して、バウンディングボックスの位置を順次改善します。ただし、ワンステップ予測子を繰り返し適用する（つまり、投票機能を使用した自己回帰処理）とデータ分布がシフトする可能性があることがわかっているため、強化学習（RL）を使用してこの問題を軽減します。標準のRLを採用して投票機能のパラメーターを学習し、標準の教師あり学習アプローチよりも有意義な改善を提供することを示します。 100DOHとMECCANOの2つの大規模データセットで実験を行い、AP50のパフォーマンスを最新技術よりもそれぞれ8％と30％向上させました。

A key component of understanding hand-object interactions is the ability to identify the active object -- the object that is being manipulated by the human hand. In order to accurately localize the active object, any method must reason using information encoded by each image pixel, such as whether it belongs to the hand, the object, or the background. To leverage each pixel as evidence to determine the bounding box of the active object, we propose a pixel-wise voting function. Our pixel-wise voting function takes an initial bounding box as input and produces an improved bounding box of the active object as output. The voting function is designed so that each pixel inside of the input bounding box votes for an improved bounding box, and the box with the majority vote is selected as the output. We call the collection of bounding boxes generated inside of the voting function, the Relational Box Field, as it characterizes a field of bounding boxes defined in relationship to the current bounding box. While our voting function is able to improve the bounding box of the active object, one round of voting is typically not enough to accurately localize the active object. Therefore, we repeatedly apply the voting function to sequentially improve the location of the bounding box. However, since it is known that repeatedly applying a one-step predictor (i.e., auto-regressive processing with our voting function) can cause a data distribution shift, we mitigate this issue using reinforcement learning (RL). We adopt standard RL to learn the voting function parameters and show that it provides a meaningful improvement over a standard supervised learning approach. We perform experiments on two large-scale datasets: 100DOH and MECCANO, improving AP50 performance by 8% and 30%, respectively, over the state of the art.

updated: Sun Nov 21 2021 19:54:13 GMT+0000 (UTC)

published: Thu Oct 21 2021 23:40:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト