Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

Kun Li; Dan Guo; Guoliang Chen; Feiyang Liu; Meng Wang

複数人での会話における人間の行動分析のためのデータ拡張

このペーパーでは、ACM Multimedia 2023 の MultiMediate Grand Challenge 2023 に対する私たちのチーム HFUT-VUT のソリューションを紹介します。このソリューションは、身体動作認識、アイコンタクト検出、次の発言者の予測という 3 つのサブチャレンジをカバーしています。私たちは Swin Transformer をベースラインとして選択し、データ拡張戦略を活用して上記 3 つのタスクに対処します。具体的には、生のビデオをクロップして他の部分のノイズを除去します。同時に、データ拡張を利用してモデルの一般化を改善します。その結果、当社のソリューションは、対応するテストセットで平均精度に関して身体動作認識で 0.6262 という最高の結果を達成し、アイコンタクト検出で 0.7771 という精度を達成しました。さらに、私たちのアプローチは、非加重平均再現率の観点から、次の話者予測について 0.5281 という同等の結果も達成しています。

In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.

updated: Thu Aug 03 2023 04:04:40 GMT+0000 (UTC)

published: Thu Aug 03 2023 04:04:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト