Time invariant hand gesture recognition for human-computer interaction
D. Kostyrev <dmitry.kostvre\>&.gmail.com> S. Anischenko < sergey. anishenko&gmail ,com> M. Petrushan < drn(a)bk.ru> A.B. Kogan Research Institute for Neurocybernetics, Stachki av. 194/1, Rostov-on-Don, Russian Federation.
Abstract. Hand motion driven human-computer interface based on novel time-invariant gesture description is proposed. Description is represented as a sequence of overthreshold motion distribution histograms. Such description utilizes information about gesture spatial configuration and motion dynamics. K-nearest-neighbour classifier was trained on six gesture types. Application for remote slideshow control was developed based on the proposed algorithm.
Keywords: human-computer interfaces, hand motion tracking, dynamic pattern recognition
1. Introduction
Popularity of natural interfaces for desktop and mobile computer control has been rapidly growing within last decade. Nowadays common human-computer interfaces (like keyboard) are gradually replaced by natural control interfaces based on gesture-driven, voice-driven, finger or full body motion driven control. These new methods are widely used in entertainment applications or in such fields, where a contact between human and input device is impossible or unwanted because of sterility requirement or in case, when a device have to be controlled by a group of people simultaneously.
Hand motion recognition task is concerned with several fundamental computer vision problems, in particular with the problems of dynamic patterns detection and recognition. The standard pipeline for single image analysis is represented as the sequence of procedures: preprocessing - segmentation - classification. This pipeline is admissible for video analysis only if the task is to detect and classify the objects, which movements are not of value. Otherwise, additional information about object movement or transformation has to be considered. In contrast to single image, video contains such additional information, that has to be utilized for object detection and recognition.
The main goal of our project is to develop the robust descriptor for the dynamic objects, invariant to object deformations and perspective transformations during a movement. In order to develop such descriptor the modifications of the standard single image analysis pipeline are proposed. The new dynamic gesture recognition method is described below. It utilizes information about duration, direction and amplitude of a motion along with spatial and intensity-based feature descriptions of images in video sequence. This method was used to develop the human-computer interface and application for presentation remote control.
2. Research background
Gesture-based human-computer interfaces can utilize hand stationary configuration (configuration is relevant), like "open palm" or "thumb up", hand motion (dynamics is relevant), like "from palm to fist" motion, "hands up" motion, etc. A variety of gesture recognition algorithms was developed, that can be divided into two groups according to configuration or motion relevance:
• single image analysis algorithms, that detect and recognize hand configuration in each frame of video;
• image sequences analysis algorithms, that detects hand configuration changing pattern for whole gesture video sequence.
Single image analysis pipeline for gesture recognition is similar to commonly used analysis procedures: preprocessing, segmentation, classification. Following steps for gesture analysis were proposed [1]: hand contour extraction, tracking and recognition based on selected features. The image sequence analysis pipeline differs from single image approach and contains following steps: background subtraction, description of a gesture and classification. In spite of the fact that every gesture detection and recognition method is unique, combining different approaches and algorithms, there are some common steps used in most methods, such as background subtraction, features extraction and classification of a gesture. Gesture detection and recognition methods use different background subtraction algorithms from very simple like frame difference [2 - 4] to more complex methods such as Adaptive Mixture of Gaussians [5, 6] and frame difference enhanced with Gaussian filter [7].
Various feature extraction methods for gesture recognition have been described in [3, 7, 8, 9]. Methods based on calculation of histograms of oriented gradients (HOG) are used in [9] and [3]. In [3] it is used for features extraction from a motion history image which is created from several frames in sequence. Fourier analysis based methods are used for different kind of movements in [7]. Hand shapes within single image analysis approach can be described by shape context descriptor [8]. Gesture classification is performed by different algorithms based on fuzzy logic [8], SVM for large feature vector (3780 elements) [9], Euclidian distances [3], and spectrum analysis [7].
Each above-mentioned gesture analysis method has its own advantages and disadvantages. For instance, single image based analysis provides precise estimation of hand location in image but it is limited with fixed hand configuration because of non-rigid nature of a human hand. On the other hand image sequence based methods are invariant to hand configuration in common but are dependent on gesture duration and completeness. Thus, gesture analysis methods have its limitations, for example, most of the methods based on skin color segmentation [10 -11] or background subtraction methods [2-4, 7, 9] are highly dependent on scene light conditions, quality of camera sensor, etc. Along with recognition quality the method processing speed is one of the key factors in a sense of end user experience, so it has to be taken into account when comparing gesture recognition methods.
3. Gesture detection and recognition
The new method of a hand gestures detection and recognition is presented. Following requirements were set during problem formulation. According to them the method must be invariant to gesture duration, hand initial position and be able to detect and recognize transformable hand configurations.
According to the requirements and research background this method should describe a gesture in terms of integral motion characteristics. Algorithm workflow scheme is presented in fig. 1.
According to this scheme two main components can be highlighted in the workflow of this algorithm: background subtraction and features extraction.
Fig. 1. The -workflow of the algorithm.
3.1 Background subtraction
Background subtraction is used for gesture duration estimation and as a preprocessing step for gesture description in our approach. Following requirements for background subtraction method were formulated:
• object contours estimation;
• no contour traces;
• real time performance.
Several popular background subtraction methods were reviewed within the research. Each of them was evaluated according to the following parameters: performance, contours continuity, length of contour traces. Background subtraction quality was evaluated qualitatively and algorithm performance was estimated according to the video processing frame rate. Algorithm performance is considered "realtime" with framerate 30. All algorithms (except ViBe) were evaluated according to the current implementation in BGSLibrary [12] on 3th generation Intel Core i5 processor powered PC. Results of performance and quality estimations are presented in the table 1.
Table. 1 Overview of background subtraction methods.
Algrorithm Performance Contours continuity Length of traces
Frame difference realtime low no
Moving mean realtime low no
Adaptive Mixture of Gaussians realtime low short
Gaussian Average realtime high short
Multi-Layer BGS offline low no
Fuzzy Gaussian realtime high long
Fussy Adaptive Som offline high long
ViBe (serial) offline high no
Reviewed background subtraction methods are disbalanced in a sense of sufficient quality, performance and traces absence. Thus, a new background subtraction algorithm is presented which should combine performance and contours continuity. It is based on the computation of time-dependent intensity variance map between N frames as follows:
nr , If Pi (x,y)2 flfPi (x,y)V
D(x,y) =--------w
. _ [255, ifD(x,y) > Threshold2 y ~ { 0, otherwise <2>
where D (x, y) - resulting variance map, pL - ith frame, N - number of frames in sequence, x,y - pixel coordinates. Threshold- binarization threshold.
Presented background subtraction algorithm demonstrates realtime performance, sufficient quality of contour separation without long traces. However, this method is depended on light conditions. Parameter N regulates background model update speed. Binarized variance maps of two adjacent frames with different N is presented in the fig. 2.
Binarized variance map B is prepared by median filtering for noise reduction and nearby contours merging.
Current frame motion rate is estimated by counting all non-zero elements in the binarized variance map. Motion is considered to be a gesture candidate when the following condition is met:
^ ^ B (x, y) >w * h* MotionThreshold
(3)
where x,y - pixel coordinates in binarized variance map B. w, h- width and height of variance map accordingly, MotionThreshold - threshold for estimating a gesture candidate in the frame.
.V = 2 rV = .i .V N J"
Fig. 2. Binarized variance maps of two adjacent frames with different N.
If the current frame satisfies the above-mentioned condition it is being added to the current gesture sequence. Current gesture sequence is considered complete when number of frames in sequence > Nmin and the condition (eq. 3) was not met in the Nmin + 1 frame. As soon as the gesture sequence is considered complete it is being described and classified.
3.2 Gesture description and classification
Any gesture sequence, containing more than Nrnm frames, can be described by a feature vector. Motion Distribution Histogram is proposed to be the integral description of the frame in gesture sequence. It describes a distribution of overthreshold variance (eq. 3) over frame. Each motion distribution histogram is calculated according to the mean center of masses for all variance maps in the sequence, thus, we can achieve initial hand position invariance. Variance maps are being divided into 16 sectors in which all non-zero elements are counted and stored in corresponding element of motion distribution histogram. Sectors are numbered clockwise from horizontal axis orientation. Each element of motion distribution histogram is normalized according to the area of binarized variance map of the frame. Examples of variance maps and corresponding motion distribution histograms are shown in the fig. 3.
Fig. 3. Binarized variance maps for different frames of "from left to right" gesture sequence and corresponding motion distribution histograms.
Variance map and motion distribution histogram calculated for all frames in the sequence. Motion distribution histograms sequence containing N histograms
are being divided into 4 subsequences according to the following intervals /:
\ iVl N jv- N 3JV \3N
k = .h = .4' '~2. .h = ~2' ./4 = I---N . 4
where /Vis the number of histograms in the sequence. 4. Experimental results
Proposed method has advantages and disadvantages comparing to the methods that are based on object detection and recognition in each frame of video. The main advantages of this approach are invariance to hand configuration transformations during the movement, invariance to the speed of the gesture and hand initial position in the camera field of view. However, this method is highly depended on background movement (which can be eliminated with depth map), distance between human and camera and lighting conditions.
In our approach we used Nmin = 5 , MotionThreshold = 0.08 (eq. 3) for motion detection in close range from the camera (approximately 50 cm), N = 2(eq. 2) and Threshold = 2 (eq. 2).
Six gesture types were selected for detection and recognition: hand movement from left to right, right to left, hand up, hand down, both hands from the center of the screen and both hands to the center of the screen. Training set containing samples of each gesture type was collected. The set containing seven examples of 2 classes (hands down and hands up) are displayed in the fig. 4, where "hand down" movement is visualized with black colour, and "hands up" movement is visualized with gray colour.
Fig. 4. Plot of 14 histograms containing samples from two classes: "hand up" and "hand down ". Black colour corresponds to "hands down " movement, gray colour corresponds to "hems up" movement. Feature vector element numbers displayed on horizontal axis and
values on vertical axis.
Classification is based on k-nearest neighbours method with Euclidian distance. Cross-class similarity estimation was performed by calculating mean Euclidian distances and standard deviation between descriptions of each class compared to another classes in the fig. 5.
5. Application
The new human-computer interface was implemented based on the gesture recognition method described above. The application for hand driven slideshow control was developed as an example of this interface.
I
I
5 i
»»I-■-■-■-
yf Li !
un v •H'rti * hnvi «■»i^ai »"ijflim^
Fig. 5. Mean Euclidian distances and standard deviation (error bars) betM'een feature vectors of each class compared to another classes: a - "left to right" gesture, b - "right to left", c -"hand up", d - "hand down", e - "hands away",f- "hands together".
The client - server application architecture was proposed to achieve remote slideshow control. The architecture of the demonstration application is presented in the fig. 6.
Fig. 6. The architecture of the application.
Conclusion
Hand motion guided human-computer interface based on the new dynamic patterns descriptor is presented. The distinctiveness of proposed gesture description was demonstrated by cross-class Euclidian distance measurement of training samples. Hand motion is described by the sequence of motion distribution histograms. This method demonstrates sufficient processing speed in terms of end user experience and classification accuracy for gesture sequences to be used for remote slideshow control. Further research within proposed approach aims to support different gestures types and non-relevant objects motion filtering using skin color map, depth map and motion map.
A cknowledgments
The work is supported by SFedu project № 213.01-2014/001, Russian Foundation for Basic Research, grants 12-01-31226 mola and 12-01-31266 mola.
References
[1]. Rautaray S.S., Agrawal A. A real time hand tracking system for interactive applications. International Journal of Computer Applications. 2011, vol. 18, no 6, pp. 28-33.
[2]. Shan C., Tan T., Wei Y. Real - time hand tracking using a mean shift embedded particle filter. Pattern Recognition. 2007, vol. 40, no 7, pp. 1958-1970. doi: 10.1016/j.patcog.2006.12.012
[3]. Davis J. W. Recognizing Movement using Motion Histograms. Technial Report 487, MIT Media Lab. 1999. vol. 1, no 487. doi: 10.1.1.46.6887
[4]. Torres G. Gesture recognition using motion detection. University' of Kansas. 2009.
[5]. Banerjee P., Sengupta S. Human motion detection and tracking for video surveillance. Proceedings of the national Conference of Communications, IIT Bombay, Mumbai. 2008, pp. 88-92
[6]. Stauffer C., Grimson W.E.L. Adaptive background mixture models for real-time tracking. Computer Vision and Pattern Recognition. 1999. IEEE Computer Societx' Conference on. IEEE. 1999, vol. 2. doi: 10.1109/CVPR.1999.784637
[7]. Cutler R., Davis L. Robust Real - time periodic motion detection, analysis, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000, vol. 22,no. 8. pp. 781-796. doi: 10.1.1.112.8904
[8]. Mori G., Belongie S., Malik J. Efficient shape matching using shape contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2005, vol.. 27, no. 11. pp. 1832-1837. doi: 10.1109/TPAMI.2005.220
[9]. Fogelton A. Real-time Hand Tracking using Modificated Flocks of Features Algorithm. Information Sciences and Technologies Bulletin of the ACM Slovakia, Special Section on Student Research in Informatics and Information Technologies. 2011, vol. 3, no 2, pp. 37-41. doi: 10.1.1.295.2305
[10]. Manresa C., Varona J., Mas R. Perales F. J. Real - time hand tracking and gesture recognition for human - computer interaction. Electronic Letters on Computer Vision and Image Analysis. 2005, vol.. 5, no. 3, pp. 96-104.
[11]. Deng L.Y., Hung J.C., Keh H., Lin K., Liu Y., Huang N. Real - time hand gesture recognition by shape context based matching and cost matrix. Journal of networks. 2011, vol. 6, no 5, pp. 697-704. doi:10.4304/jnw.6.5.697-704
[12]. Sobral, Andrews. BGSLibrary: An OpenCV С++ Background Subtraction Library. Proceedings of IX Workshop de Visao Computacional (WVC'2013). 2013.
Метод инвариантного распознавания жестов для реализации человеко-компьютерного интерфейса
Д. В. Костырее <dmitrv.kostvrev(a\gmail. com > С. II. Анищенко <sergev.anishenko(cpsinai\.com> M. В. Петруишн <drn(cpbk.ru> HIIII нейрокибернетики им. А.Б.Когана Академии биологии и биотехнологии Южного федерального университета
Аннотация. В данной статье представлен способ человеко-компьютерного взаимодействия с помощью жестов рук, основанный на новом способе описания жестов, инвариантном относительно длительности жеста. Описание представлено в виде последовательности сверхпороговых гистограмм распределения областей движения в поле зрения видеокамеры. Такой способ описания учитывает информацию о пространственной конфигурации жеста и динамики движения. В качестве классификатора использован метод к ближайших соседей. Для обучения классификатора были выбраны шесть типов жестов. На основе предложенного алгоритма было разработано демонстрационное приложение для удаленного управления показом презентаций.
Ключевые слова: человеко-компьютерное взаимодействие, распознавание жестов, распознавание динамических паттернов
References
[1]. Rautaray S.S., Agrawal A. A real time hand tracking system for interactive applications. International Journal of Computer Applications. 2011, vol. 18, no 6, pp. 28-33.
[2]. Shan C., Tan Т., Wei Y. Real - time hand tracking using a mean shift embedded particle filter. Pattern Recognition. 2007, vol. 40, no 7, pp. 1958-1970. doi: 10.1016/j.patcog.2006.12.012
[3]. Davis J. W. Recognizing Movement using Motion Histograms. Technial Report 487, MIT Media Lab. 1999. vol. 1, no 487. doi: 10.1.1.46.6887
[4]. Torres G. Gesture recognition using motion detection. University of Kansas. 2009.
[5]. Banerjee P., Sengupta S. Human motion detection and tracking for video surveillance. Proceedings of the national Conference of Communications, IIT Bombay, Mumbai. 2008, pp. 88-92
[6]. Stauffer С., Grimson W.E.L. Adaptive background mixture models for real-time tracking. Computer Vision and Pattern Recognition. 1999. IEEE Computer Societx' Conference on. IEEE. 1999, vol. 2. doi: 10.1109/CVPR.1999.784637
[7]. Cutler R., Davis L. Robust Real - time periodic motion detection, analysis, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000, vol. 22,no. 8. pp. 781-796. doi: 10.1.1.112.8904
[8]. Mori G., Belongie S., Malik J. Efficient shape matching using shape contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2005, vol.. 27, no. 11. pp. 1832-1837. doi: 10.1109/TPAMI.2005.220
[9]. Fogelton A. Real-time Hand Tracking using Modificated Flocks of Features Algorithm. Information Sciences and Technologies Bulletin of the ACM Slovakia, Special Section on Student Research in Informatics and Information Technologies. 2011, vol. 3, no 2, pp. 37-41. doi: 10.1.1.295.2305
[10]. Manresa C., Varona J., Mas R. Perales F. J. Real - time hand tracking and gesture recognition for human - computer interaction. Electronic Letters on Computer Vision and Image Analysis. 2005, vol.. 5, no. 3, pp. 96-104.
[11]. Deng L.Y., Hung J.C., Keh H., Lin K., Liu Y., Huang N. Real - time hand gesture recognition by shape context based matching and cost matrix. Journal of networks. 2011, vol. 6, no 5, pp. 697-704. doi:10.4304/jnw.6.5.697-704
[12]. Sobral, Andrews. BGSLibrary: An OpenCV С++ Background Subtraction Library. Proceedings of IX Workshop de Visao Computacional (WVC'2013). 2013.