Efficient Gesture Recognition for the Assistance of Visually Impaired People using Multi-Head Neural Networks

Samer Alashhab; Antonio Javier Gallego; Miguel Ángel Lozano

マルチヘッドニューラルネットワークを使用した視覚障害者の支援のための効率的なジェスチャ認識

この論文は、視覚障害を持つ人々を助けることを目的とした手のジェスチャーによって制御されるモバイルデバイスのためのインタラクティブなシステムを提案します。このシステムにより、ユーザーは単純な静的および動的な手のジェスチャーを行うことでデバイスを操作できます。各ジェスチャは、オブジェクト認識、シーンの説明、画像スケーリングなど、システム内のさまざまなアクションをトリガーします（たとえば、オブジェクトに指を向けると、その説明が表示されます）。このシステムは、マルチヘッドニューラルネットワークアーキテクチャに基づいており、最初にジェスチャを検出して分類し、その後、検出されたジェスチャに応じて、対応するアクションを実行する第2段階を実行します。このマルチヘッドアーキテクチャは、さまざまなタスクを同時に実行するために必要なリソースを最適化し、最初のバックボーンから取得した情報を利用して、第2段階でさまざまなプロセスを実行します。システムをトレーニングおよび評価するために、約40kの画像を含むデータセットを手動でコンパイルし、さまざまなタイプの手のジェスチャー、背景（屋内と屋外）、照明条件などを含むラベルを付けました。このデータセットには、合成ジェスチャー（事前トレーニングを目的としています）が含まれています。結果を改善するためのシステム）とさまざまな携帯電話を使用してキャプチャされた実際の画像。得られた結果と最新技術との比較は、ジェスチャの分類とローカリゼーションの精度、またはオブジェクトとシーンの説明の生成など、システムによって実行されるさまざまなアクションに関して競争力のある結果を示しています。

This paper proposes an interactive system for mobile devices controlled by hand gestures aimed at helping people with visual impairments. This system allows the user to interact with the device by making simple static and dynamic hand gestures. Each gesture triggers a different action in the system, such as object recognition, scene description or image scaling (e.g., pointing a finger at an object will show a description of it). The system is based on a multi-head neural network architecture, which initially detects and classifies the gestures, and subsequently, depending on the gesture detected, performs a second stage that carries out the corresponding action. This multi-head architecture optimizes the resources required to perform different tasks simultaneously, and takes advantage of the information obtained from an initial backbone to perform different processes in a second stage. To train and evaluate the system, a dataset with about 40k images was manually compiled and labeled including different types of hand gestures, backgrounds (indoors and outdoors), lighting conditions, etc. This dataset contains synthetic gestures (whose objective is to pre-train the system in order to improve the results) and real images captured using different mobile phones. The results obtained and the comparison made with the state of the art show competitive results as regards the different actions performed by the system, such as the accuracy of classification and localization of gestures, or the generation of descriptions for objects and scenes.

updated: Sat May 14 2022 06:01:47 GMT+0000 (UTC)

published: Sat May 14 2022 06:01:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト