Efficient Decision-based Black-box Patch Attacks on Video Recognition

Kaixun Jiang; Zhaoyu Chen; Tony Huang; Jiafeng Wang; Dingkang Yang; Bo Li; Yan Wang; Wenqiang Zhang

ビデオ認識に対する効率的な意思決定ベースのブラックボックスパッチ攻撃

ディープニューラルネットワーク (DNN) は優れたパフォーマンスを示していますが、知覚可能な局所的な摂動を入力に導入する敵対的なパッチに対して脆弱です。画像に敵対的パッチを生成することは多くの注目を集めていますが、ビデオに敵対的パッチを適用することは十分に調査されていません。さらに、攻撃者が脅威モデルを照会することによって予測されたハードラベルのみにアクセスする意思決定ベースの攻撃は、実際のビデオ認識シーンでは実用的であるとしても、ビデオモデルでも十分に調査されていません。このような研究がないため、ビデオモデルの堅牢性評価に大きなギャップが生じます。このギャップを埋めるために、この作業ではまず、ビデオモデルに対する意思決定ベースのパッチ攻撃を調査します。ビデオによってもたらされる巨大なパラメーター空間と、意思決定ベースのモデルによって返される最小限の情報が、攻撃の難易度とクエリの負担を大幅に増加させることを分析します。クエリ効率の高い攻撃を実現するために、時空間差分進化 (STDE) フレームワークを提案します。まず、STDE はターゲットビデオをパッチテクスチャとして導入し、時間差によって適応的に選択されたキーフレームにパッチのみを追加します。次に、STDE はパッチ領域を最小化することを最適化の目的として取り、局所的最適に陥ることなく大域的最適を検索するために、時空間突然変異と交差を採用します。実験によると、STDE は、脅威、効率、および知覚不能性の点で最先端のパフォーマンスを示しています。したがって、STDE は、ビデオ認識モデルの堅牢性を評価するための強力なツールになる可能性があります。

Although Deep Neural Networks (DNNs) have demonstrated excellent performance, they are vulnerable to adversarial patches that introduce perceptible and localized perturbations to the input. Generating adversarial patches on images has received much attention, while adversarial patches on videos have not been well investigated. Further, decision-based attacks, where attackers only access the predicted hard labels by querying threat models, have not been well explored on video models either, even if they are practical in real-world video recognition scenes. The absence of such studies leads to a huge gap in the robustness assessment for video models. To bridge this gap, this work first explores decision-based patch attacks on video models. We analyze that the huge parameter space brought by videos and the minimal information returned by decision-based models both greatly increase the attack difficulty and query burden. To achieve a query-efficient attack, we propose a spatial-temporal differential evolution (STDE) framework. First, STDE introduces target videos as patch textures and only adds patches on keyframes that are adaptively selected by temporal difference. Second, STDE takes minimizing the patch area as the optimization objective and adopts spatialtemporal mutation and crossover to search for the global optimum without falling into the local optimum. Experiments show STDE has demonstrated state-of-the-art performance in terms of threat, efficiency and imperceptibility. Hence, STDE has the potential to be a powerful tool for evaluating the robustness of video recognition models.

updated: Tue Mar 21 2023 15:08:35 GMT+0000 (UTC)

published: Tue Mar 21 2023 15:08:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト