Issues in Object Detection in Videos using Common Single-Image CNNs

Spencer Ploeger; Lucas Dasovic

一般的な単一画像CNNを使用したビデオでのオブジェクト検出の問題

コンピュータビジョンの成長している分野は、オブジェクト検出です。物体検出は、産業プロセス、医療画像分析、自動運転車などの多くのアプリケーションで使用されます。ビデオ内のオブジェクトを検出する機能は非常に重要です。物体検出システムは、大きな画像データセットでトレーニングされています。自動運転車などのアプリケーションでは、オブジェクト検出システムがビデオ内の複数のフレームを通じてオブジェクトを識別できることが重要です。これらのシステムをビデオに適用することには多くの問題があります。システムがフレームごとにオブジェクトを誤って識別し、意図しないシステム応答を引き起こす可能性のある影または明るさの変化。オブジェクト検出に使用されてきたニューラルネットワークはたくさんあり、フレーム間でオブジェクトを接続する方法があれば、これらの問題を取り除くことができます。これらのニューラルネットワークがビデオ内のオブジェクトの識別を上手に行うには、再トレーニングする必要があります。データセットは、連続するビデオフレームを表し、一致するグラウンドトゥルースレイヤーを持つ画像を使用して作成する必要があります。これらのデータセットを生成できる方法が提案されています。グラウンドトゥルースレイヤーには、移動するオブジェクトのみが含まれます。このレイヤーを生成するために、FlowNet2-Pytorchを使用して、新しいマグニチュードメソッドを使用してフローマスクを作成しました。同様に、セグメンテーションマスクは、MaskR-CNNやRefinenetなどのネットワークを使用して生成されます。これらのセグメンテーションマスクには、フレームで検出されたすべてのオブジェクトが含まれます。このセグメンテーションマスクをフローマスクのグラウンドトゥルース層と比較することにより、損失関数が生成されます。この損失関数を使用して、ビデオで一貫した予測を行うのに優れたニューラルネットワークをトレーニングできます。システムは複数のビデオサンプルでテストされ、フレームごとに損失が生成され、将来の作業でオブジェクト検出ニューラルネットワークをトレーニングするために使用できるマグニチュードメソッドの機能が証明されました。

A growing branch of computer vision is object detection. Object detection is used in many applications such as industrial process, medical imaging analysis, and autonomous vehicles. The ability to detect objects in videos is crucial. Object detection systems are trained on large image datasets. For applications such as autonomous vehicles, it is crucial that the object detection system can identify objects through multiple frames in video. There are many problems with applying these systems to video. Shadows or changes in brightness that can cause the system to incorrectly identify objects frame to frame and cause an unintended system response. There are many neural networks that have been used for object detection and if there was a way of connecting objects between frames then these problems could be eliminated. For these neural networks to get better at identifying objects in video, they need to be re-trained. A dataset must be created with images that represent consecutive video frames and have matching ground-truth layers. A method is proposed that can generate these datasets. The ground-truth layer contains only moving objects. To generate this layer, FlowNet2-Pytorch was used to create the flow mask using the novel Magnitude Method. As well, a segmentation mask will be generated using networks such as Mask R-CNN or Refinenet. These segmentation masks will contain all objects detected in a frame. By comparing this segmentation mask to the flow mask ground-truth layer, a loss function is generated. This loss function can be used to train a neural network to be better at making consistent predictions on video. The system was tested on multiple video samples and a loss was generated for each frame, proving the Magnitude Method's ability to be used to train object detection neural networks in future work.

updated: Wed May 26 2021 20:33:51 GMT+0000 (UTC)

published: Wed May 26 2021 20:33:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト