Why is the video analytics accuracy fluctuating, and what can we do about it?

Sibendu Paul; Kunal Rao; Giuseppe Coviello; Murugan Sankaradas; Oliver Po; Y. Charlie Hu; Srimat Chakradhar

ビデオ分析の精度が変動するのはなぜですか? また、それに対して何ができるでしょうか?

ビデオを一連の画像 (フレーム) と考え、画像のみでトレーニングされたディープニューラルネットワークモデルをビデオの同様の分析タスクに再利用するのが一般的な方法です。このホワイトペーパーでは、画像でうまく機能するディープラーニングモデルがビデオでもうまく機能するというこの飛躍的な信念には、実際には欠陥があることを示します。ビデオカメラが人間が知覚できる方法で変化していないシーンを見ている場合でも、ビデオ圧縮や環境 (照明) などの外的要因を制御すると、ビデオ分析アプリケーションの精度が著しく変動することがわかります。これらの変動が発生するのは、ビデオカメラによって生成された連続するフレームが視覚的に類似しているように見える場合がありますが、これらのフレームは、ビデオ分析アプリケーションによってまったく異なって認識されるためです。これらの変動の根本的な原因は、視覚的に満足できるビデオをキャプチャして生成するために、ビデオカメラが自動的に行う動的なカメラパラメーターの変更であることがわかりました。私たちが示すように、連続したフレーム内の画像ピクセル値のこれらのわずかな変化は、画像トレーニングされたディープラーニングモデルを再利用するビデオ分析タスクからの洞察の精度に著しく悪影響を与えるため、カメラは意図しない敵として機能します。カメラからのこの不注意な敵対効果に対処するために、転送学習手法を使用して、画像分析タスクの学習からの知識の転送を通じてビデオ分析タスクの学習を改善することを検討します。特に、新しくトレーニングされた Yolov5 モデルが、フレーム間のオブジェクト検出の変動を減らし、オブジェクトの追跡が向上することを示しています (追跡のミスが 40% 減少)。私たちの論文では、ビデオ分析アプリケーションに使用されるディープラーニングモデルに対するカメラの悪影響を軽減するための新しい方向性と手法も提供します。

It is a common practice to think of a video as a sequence of images (frames), and re-use deep neural network models that are trained only on images for similar analytics tasks on videos. In this paper, we show that this leap of faith that deep learning models that work well on images will also work well on videos is actually flawed. We show that even when a video camera is viewing a scene that is not changing in any human-perceptible way, and we control for external factors like video compression and environment (lighting), the accuracy of video analytics application fluctuates noticeably. These fluctuations occur because successive frames produced by the video camera may look similar visually, but these frames are perceived quite differently by the video analytics applications. We observed that the root cause for these fluctuations is the dynamic camera parameter changes that a video camera automatically makes in order to capture and produce a visually pleasing video. The camera inadvertently acts as an unintentional adversary because these slight changes in the image pixel values in consecutive frames, as we show, have a noticeably adverse impact on the accuracy of insights from video analytics tasks that re-use image-trained deep learning models. To address this inadvertent adversarial effect from the camera, we explore the use of transfer learning techniques to improve learning in video analytics tasks through the transfer of knowledge from learning on image analytics tasks. In particular, we show that our newly trained Yolov5 model reduces fluctuation in object detection across frames, which leads to better tracking of objects(40% fewer mistakes in tracking). Our paper also provides new directions and techniques to mitigate the camera's adversarial effect on deep learning models used for video analytics applications.

updated: Thu Sep 15 2022 20:46:09 GMT+0000 (UTC)

published: Tue Aug 23 2022 23:16:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト