STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model

Houssem Boulahbal; Adrian Voicila; Andrew Comport

STDepthFormer: 自己教師ありトランスフォーマーモデルを使用したビデオからの時空間深度の予測

この論文では、新しい時空間アテンション（ST）ネットワークを使用して、ビデオ入力から将来のフレームのシーケンスを同時に予測する自己教師ありモデルが提案されています。 ST トランスフォーマーネットワークを使用すると、将来のフレーム全体で時間的な一貫性を制限しながら、異なるスケールの画像内の空間オブジェクト全体で一貫性を制限できます。これは、出力として単一のフレームを予測することに焦点を当てた深度予測の以前の研究には当てはまりませんでした。提案されたモデルは、一連の入力画像からモーションとジオメトリを制約しながら、単一画像の深度推論方法と同様に、オブジェクトの形状やテクスチャなどの以前のシーンの知識を活用します。トランスフォーマーアーキテクチャとは別に、以前の研究に関する主な貢献の 1 つは、単一の出力フレームではなく一連の出力フレームにわたって時空間の一貫性を強制する目的関数にあります。示されるように、これにより、より正確でロバストな深度シーケンス予測が得られます。このモデルは、KITTI ベンチマークの既存のベースラインを上回る、非常に正確な水深予測結果を達成します。提案された技術の有効性を評価するために、広範なアブレーション研究が実施されました。提案されたモデルの注目すべき結果の 1 つは、複数のオブジェクトの検出、セグメンテーション、および追跡を含む複雑なモデルを必要とするのではなく、シーン内のオブジェクトの動きを暗黙的に予測できることです。

In this paper, a self-supervised model that simultaneously predicts a sequence of future frames from video-input with a novel spatial-temporal attention (ST) network is proposed. The ST transformer network allows constraining both temporal consistency across future frames whilst constraining consistency across spatial objects in the image at different scales. This was not the case in prior works for depth prediction, which focused on predicting a single frame as output. The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods, whilst also constraining the motion and geometry from a sequence of input images. Apart from the transformer architecture, one of the main contributions with respect to prior works lies in the objective function that enforces spatio-temporal consistency across a sequence of output frames rather than a single output frame. As will be shown, this results in more accurate and robust depth sequence forecasting. The model achieves highly accurate depth forecasting results that outperform existing baselines on the KITTI benchmark. Extensive ablation studies were performed to assess the effectiveness of the proposed techniques. One remarkable result of the proposed model is that it is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.

updated: Thu Mar 02 2023 12:22:51 GMT+0000 (UTC)

published: Thu Mar 02 2023 12:22:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト