Revisiting Deep Architectures for Head Motion Prediction in 360° Videos

Miguel Fabian Romero Rondon; Lucile Sassatelli; Ramon Aparicio Pardo; Frederic Precioso

360°ビデオでの頭の動きを予測するためのディープアーキテクチャの再検討

過去のユーザーの位置とビデオコンテンツ（他のユーザーのトレースを知らない）の2つのモダリティのみを使用して、360度ビデオでユーザーの頭の動きを予測することを検討します。私たちは2つの主要な貢献をしています。まず、この問題に対する既存の深層学習アプローチを再検討し、徹底的な根本原因分析から隠れた欠陥を特定します。次に、この分析の結果から、最先端のパフォーマンスを確立する新しい提案を設計します。まず、両方のモダリティを使用する既存の方法を再評価すると、ユーザーの軌跡のみを使用して、すべての方法がベースラインよりもパフォーマンスが悪いという驚くべき結果が得られます。メトリック、データセット、およびニューラルアーキテクチャの根本原因分析は、特に（i）コンテンツが2〜3秒より長い期間の予測に情報を提供できることを示しています。（既存の方法ではより短い期間が考慮されます）、および（ii）ベースラインと競合するには、ポジションの処理専用の回帰ユニットが必要ですが、これでは不十分です。次に、Structural-RNNの概念でサポートされている問題の再検討から、TRACKという名前の新しいディープニューラルアーキテクチャを設計します。 TRACKは、考慮されているすべてのデータセットと予測範囲で最先端のパフォーマンスを実現し、フォーカスタイプのビデオと範囲で2〜5秒で競合他社を最大20％上回ります。フレームワーク全体（コードとデータセット）はオンラインであり、ACM再現性バッジを受け取りました。

We consider predicting the user's head motion in 360-degree videos, with 2 modalities only: the past user's positions and the video content (not knowing other users' traces). We make two main contributions. First, we re-examine existing deep-learning approaches for this problem and identify hidden flaws from a thorough root-cause analysis. Second, from the results of this analysis, we design a new proposal establishing state-of-the-art performance. First, re-assessing the existing methods that use both modalities, we obtain the surprising result that they all perform worse than baselines using the user's trajectory only. A root-cause analysis of the metrics, datasets and neural architectures shows in particular that (i) the content can inform the prediction for horizons longer than 2 to 3 sec. (existing methods consider shorter horizons), and that (ii) to compete with the baselines, it is necessary to have a recurrent unit dedicated to process the positions, but this is not sufficient. Second, from a re-examination of the problem supported with the concept of Structural-RNN, we design a new deep neural architecture, named TRACK. TRACK achieves state-of-the-art performance on all considered datasets and prediction horizons, outperforming competitors by up to 20 percent on focus-type videos and horizons 2-5 seconds. The entire framework (codes and datasets) is online and received an ACM reproducibility badge.

updated: Wed Apr 14 2021 16:13:35 GMT+0000 (UTC)

published: Tue Nov 26 2019 17:13:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト