In this paper user modeling task is examined by processing a gallery of photos and videos on a mobile device. We propose novel engine for user preference prediction based on scene recognition, object detection and facial analysis. At first, all faces in a gallery are clustered and all private photos and videos with faces from large clusters are processed on the embedded system in offline mode. Other photos may be sent to the remote server to be analyzed by very deep models. The visual features of each photo are obtained from scene recognition and object detection models. These features are aggregated into a single user descriptor in the neural attention block. The proposed pipeline is implemented for the Android mobile platform. Experimental results with a subset of Photo Event Collection, Web Image Dataset for Event Recognition and Amazon Fashion datasets demonstrate the possibility to process images very efficiently without significant accuracy degradation. The source code of Android mobile application is publicly available at https://github.com/HSE-asavchenko/mobile-visual-preferences.