Unsupervised Action Localization Crop in Video Retargeting for 3D ConvNets

Prithwish Jana; Swarnabja Bhaumik; Partha Pratim Mohanta

3DConvNetsのビデオリターゲティングにおける教師なしアクションローカリゼーションクロップ

ソーシャルメディア上のトリミングされていないビデオ、またはロボットや監視カメラによってキャプチャされたビデオは、さまざまなアスペクト比です。ただし、3D CNNは通常、入力として、空間次元が元のビデオよりも小さい正方形のビデオを必要とします。ランダムまたは中央のトリミングでは、ビデオの主題が完全に除外される場合があります。これに対処するために、これをリターゲティングおよびビデオからビデオへの合成の問題として形作ることにより、教師なしビデオトリミングアプローチを提案します。合成されたビデオは、1：1のアスペクト比を維持し、サイズが小さく、全期間を通じてビデオ対象を対象としています。まず、均一なモーションパターンを持つパッチを識別することにより、各フレームでアクションのローカリゼーションが実行されます。したがって、フレームごとに1つの顕著なパッチが特定されます。ただし、視点のジッターやちらつきを避けるために、パッチ間のフレーム間スケールまたは位置の変更は、時間の経過とともに徐々に実行する必要があります。この問題は、選択されたピボットタイムスタンプを通過し、その形状が中間のコントロールタイムスタンプの影響を受ける3D空間のpolyBezierフィッティングで対処されます。提案された方法の有効性を裏付けるために、動的トリミング手法を3つのベンチマークデータセットでのランダムトリミングと比較することにより、ビデオ分類タスクを評価します。 UCF-101、HMDB-51およびActivityNetv1.3。トリミング後のビデオ分類のクリップとトップ1の精度は、同じサイズのランダムトリミング入力の3D CNNパフォーマンスを上回り、いくつかのより大きなランダムトリミングサイズを上回っています。

Untrimmed videos on social media or those captured by robots and surveillance cameras are of varied aspect ratios. However, 3D CNNs usually require as input a square-shaped video, whose spatial dimension is smaller than the original. Random- or center-cropping may leave out the video's subject altogether. To address this, we propose an unsupervised video cropping approach by shaping this as a retargeting and video-to-video synthesis problem. The synthesized video maintains a 1:1 aspect ratio, is smaller in size and is targeted at video-subject(s) throughout the entire duration. First, action localization is performed on each frame by identifying patches with homogeneous motion patterns. Thus, a single salient patch is pinpointed per frame. But to avoid viewpoint jitters and flickering, any inter-frame scale or position changes among the patches should be performed gradually over time. This issue is addressed with a polyBezier fitting in 3D space that passes through some chosen pivot timestamps and whose shape is influenced by the in-between control timestamps. To corroborate the effectiveness of the proposed method, we evaluate the video classification task by comparing our dynamic cropping technique with random cropping on three benchmark datasets, viz. UCF-101, HMDB-51 and ActivityNet v1.3. The clip and top-1 accuracy for video classification after our cropping, outperform 3D CNN performances for same-sized random-crop inputs, also surpassing some larger random-crop sizes.

updated: Mon Nov 22 2021 09:17:07 GMT+0000 (UTC)

published: Sun Nov 14 2021 19:27:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト