ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

Nicholas Meegan; Hansi Liu; Bryan Cao; Abrar Alali; Kristin Dana; Marco Gruteser; Shubham Jain; Ashwin Ashok

ViFiCon: 自己管理型対照学習によるビジョンとワイヤレスの関連付け

ViFiCon を導入します。これは、ビジョンとワイヤレスモダリティ間で同期された情報を使用してクロスモーダルアソシエーションを実行する、自己教師あり対照学習スキームです。具体的には、このシステムは、RGB-D カメラ映像から収集された歩行者データと、ユーザーのスマートフォンデバイスからの WiFi Fine Time Measurements (FTM) を使用します。バンド画像内で空間的に複数の人物の深度データを積み重ねることにより、時系列を表します。 RGB-D (ビジョンドメイン) からの深度データは、本質的に観察可能な歩行者に関連付けられますが、FTM データ (ワイヤレスドメイン) は、ネットワーク上のスマートフォンにのみ関連付けられます。クロスモーダルアソシエーション問題を自己教師ありとして定式化するために、ネットワークは、口実タスクとして 2 つのモダリティのシーン全体の同期を学習し、その学習した表現を、個々の境界ボックスを特定のスマートフォンに関連付けるダウンストリームタスクに使用します。ビジョンとワイヤレス情報を関連付けます。カメラ映像で事前トレーニング済みの領域提案モデルを使用し、外挿されたバウンディングボックス情報を FTM データと共にデュアルブランチ畳み込みニューラルネットワークにフィードします。完全に監視された SoTA モデルと比較して、ViFiCon は高性能のビジョンとワイヤレスの関連付けを達成し、どのバウンディングボックスがどのスマートフォンデバイスに対応するかを見つけ、トレーニングデータの手動でラベル付けされた関連付けの例がないことを示します。

We introduce ViFiCon, a self-supervised contrastive learning scheme which uses synchronized information across vision and wireless modalities to perform cross-modal association. Specifically, the system uses pedestrian data collected from RGB-D camera footage as well as WiFi Fine Time Measurements (FTM) from a user's smartphone device. We represent the temporal sequence by stacking multi-person depth data spatially within a banded image. Depth data from RGB-D (vision domain) is inherently linked with an observable pedestrian, but FTM data (wireless domain) is associated only to a smartphone on the network. To formulate the cross-modal association problem as self-supervised, the network learns a scene-wide synchronization of the two modalities as a pretext task, and then uses that learned representation for the downstream task of associating individual bounding boxes to specific smartphones, i.e. associating vision and wireless information. We use a pre-trained region proposal model on the camera footage and then feed the extrapolated bounding box information into a dual-branch convolutional neural network along with the FTM data. We show that compared to fully supervised SoTA models, ViFiCon achieves high performance vision-to-wireless association, finding which bounding box corresponds to which smartphone device, without hand-labeled association examples for training data.

updated: Tue Oct 11 2022 15:04:05 GMT+0000 (UTC)

published: Tue Oct 11 2022 15:04:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト