A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 8 Issue 1
Jan.  2021

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Wenjin Zhang, Jiacun Wang and Fangping Lan, "Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks," IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 110-120, Jan. 2021. doi: 10.1109/JAS.2020.1003465
Citation: Wenjin Zhang, Jiacun Wang and Fangping Lan, "Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks," IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 110-120, Jan. 2021. doi: 10.1109/JAS.2020.1003465

Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks

doi: 10.1109/JAS.2020.1003465
More Information
  • Hand gestures are a natural way for human-robot interaction. Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications. This paper presents a novel deep learning network for hand gesture recognition. The network integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation. To learn short-term features, each video input is segmented into a fixed number of frame groups. A frame is randomly selected from each group and represented as an RGB image as well as an optical flow snapshot. These two entities are fused and fed into a convolutional neural network (ConvNet) for feature extraction. The ConvNets for all groups share parameters. To learn long-term features, outputs from all ConvNets are fed into a long short-term memory (LSTM) network, by which a final classification result is predicted. The new model has been tested with two popular hand gesture datasets, namely the Jester dataset and Nvidia dataset. Comparing with other models, our model produced very competitive results. The robustness of the new model has also been proved with an augmented dataset with enhanced diversity of hand gestures.

     

  • loading
  • [1]
    T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, Dec. 1998. doi: 10.1109/34.735811
    [2]
    H. Cooper, B. Holt, and R. Bowden, “Sign language recognition,” in Visual Analysis of Humans: Looking at People, T. B. Moeslund, A. Hilton, V. Krüger, and L. Sigal, Eds. London, UK: Springer, 2011, pp. 539–562.
    [3]
    J. S. Sonkusare, N. B. Chopade, R. Sor, and S. L. Tade, “A review on hand gesture recognition system,” in Proc. Int. Conf. Computing Communication Control and Automation, Pune, India, 2015, pp. 790–794.
    [4]
    L. Dipietro, A. M. Sabatini, and P. Dario, “A survey of glove-based systems and their applications,” IEEE Trans. Syst.,Man,Cybern.,Part C, vol. 38, no. 4, pp. 461–482, Jul. 2008. doi: 10.1109/TSMCC.2008.923862
    [5]
    B. K. Chakraborty, D. Sarma, M. K. Bhuyan, and K. F. MacDorman, “Review of constraints on vision-based gesture recognition for human-computer interaction,” in IET Comput. Vis., vol. 12, no. 1, pp. 3–15, Feb. 2018. doi: 10.1049/iet-cvi.2017.0052
    [6]
    C. Zhu, J. Y. Yang, Z. P. Shao, and C. P. Liu, “Vision based hand gesture recognition using 3D shape context,” IEEE/CAA J. Autom. Sinica, DOI: 10.1109/JAS.2019.1911534.
    [7]
    X. H. Yuan, L. B. Kong, D. C. Feng, and Z. C. Wei, “Automatic feature point detection and tracking of human actions in time-of-flight videos,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 4, pp. 677–685, Sept. 2017. doi: 10.1109/JAS.2017.7510625
    [8]
    B. Hu and J. C. Wang, “Deep learning based hand gesture recognition and UAV flight controls,” in Proc. 24th Int. Conf. Automation and Computing, Newcastle upon Tyne, United Kingdom, 2018, pp. 1–6.
    [9]
    G. Marin, F. Dominio, and P. Zanuttigh, “Hand gesture recognition with leap motion and kinect devices,” in Proc. IEEE Int. Conf. Image Processing, Paris, France, 2014, pp. 1565–1569.
    [10]
    K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. 27th Int. Conf. Neural Information Processing Systems, Lake Tahoe, USA, 2014, pp. 568–576.
    [11]
    M. Asadi-Aghbolaghi, A. Clapés, M. Bellantonio, H. J. Escalante, V. Ponce-López, X. Baró, I. Guyon, S. Kasaei, and S. Escalera, “Deep learning for action and gesture recognition in image sequences: A survey,” in Gesture Recognition, S. Escalera, I. Guyon, and V. Athitsos, Eds. Cham, Germany: Springer, 2017.
    [12]
    Y. Zhu, Z. Z. Lan, S. Newsam, and A. Hauptmann, “Hidden two-stream convolutional networks for action recognition”, in Proc. 14th Asian Conf. Computer Vision, Perth, Australia, 2018.
    [13]
    C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1933–1941.
    [14]
    R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, “ActionVLAD: Learning spatio-temporal aggregation for action classification,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 3165–3174.
    [15]
    L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016.
    [16]
    L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp. 2740–2755, Nov. 2019. doi: 10.1109/TPAMI.2018.2868668
    [17]
    H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Proc. 15th Annual Conf. Int. Speech Communication Association: Celebrating the Diversity of Spoken Languages, Singapore, 2014.
    [18]
    Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. doi: 10.1109/5.726791
    [19]
    M. Shilman, Z. L. Wei, S. Raghupathy, P. Simard, and D. Jones, “Discerning structure from freeform handwritten notes,” in Proc. 7th Int. Conf. Document Analysis and Recognition, Edinburgh, UK, 2003, pp. 60–65.
    [20]
    B. Schölkopf, J. Platt, and T. Hofmann, “Efficient learning of sparse representations with an energy-based model,” in Advances in Neural Information Processing Systems 19: Proc. the 2006 Conf., MIT Press, 2007.
    [21]
    D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 3642–3649.
    [22]
    K. Bong, S. Choi, C. Kim, and H. J. Yoo, “Low-power convolutional neural network processor for a face-recognition system,” IEEE Micro, vol. 37, no. 6, pp. 30–38, Nov.-Dec. 2017. doi: 10.1109/MM.2017.4241350
    [23]
    K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770–778.
    [24]
    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. doi: 10.1162/neco.1997.9.8.1735
    [25]
    R. Haridy. (2017, Aug. 22). Microsoft’s speech recognition system is now as good as a human. Microsoft, Redmond, Washington. [Online] Available: https://newatlas.com/microsoft-speech-recognition-equals-humans/50999/
    [26]
    F. Beaufays. (2015, Aug.). The neural networks behind Google Voice transcription. Google, Mountain View, CA. [Online]. Available: https://ai.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html
    [27]
    H. Sak, A. Senior, K. Rao, F. Beaufays, and J. Schalkwyk. (2015, Sept.). Google voice search: Faster and more accurate. Google, Mountain View, CA. [Online]. Available: https://ai.googleblog.com/2015/09/google-voice-search-faster-and-more.html
    [28]
    C. Smith. (2016, Jun. 13). iOS 10: Siri now works in third-party apps, comes with extra AI features. Apple Inc., Cupertino, CA. [Online]. Available: https://bgr.com/2016/06/13/ios-10-siri-third-party-apps/
    [29]
    W. Vogels. (2016, Nov. 30). Bringing the Magic of Amazon AI and Alexa to Apps on AWS. Amazon, Seattle, Washington, [Online]. Available: https://www.allthingsdistributed.com/2016/11/amazon-ai-and-alexa-for-all-aws-apps.html
    [30]
    AlphaStar team: Mastering the Real-Time Strategy Game StarCraft II. DeepMind, London, UK. [Online]. Available: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
    [31]
    C. K. Li, Y. H. Hou, P. C. Wang, and W. Q. Li, “Multiview-based 3-D action recognition using deep networks,” IEEE Trans. Human-Machine Systems, vol. 49, no. 1, pp. 95–104, Feb. 2019. doi: 10.1109/THMS.2018.2883001
    [32]
    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4489–4497.
    [33]
    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 1725–1732.
    [34]
    C. J. Tsai, Y. W. Tsai, S. L. Hsu, and Y. C. Wu, “Synthetic training of deep CNN for 3D hand gesture identification,” in Proc. Int. Conf. Control, Artificial Intelligence, Robotics & Optimization, Prague, Czech Republic, 2017, pp. 165–170.
    [35]
    C. Y. Li, X. Zhang, and L. W. Jin, “LPSNet: A novel log path signature feature based hand gesture recognition framework,” in Proc. IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 631–639.
    [36]
    O. Köpüklü, N. Köse, and G. Rigoll, “Motion fused frames: Data level fusion strategy for hand gesture recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Salt Lake City, USA, 2018, pp. 2184–21848.
    [37]
    O. Köpüklü, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” in Proc. 14th IEEE Int. Conf. Automatic Face & Gesture Recognition, Lille, France, 2019, pp. 1–8.
    [38]
    P. C. Wang, W. Q. Li, P. Ogunbona, J. Wan, and S. Escalera, Sergio. (2019)., “RGB-D-based human motion recognition with deep learning: A survey,” Comput. Vis. Image Understanding, vol. 171, pp. 118–139, Jun. 2018. doi: 10.1016/j.cviu.2018.04.007
    [39]
    O. Köpüklü, N. Kose, A. Gunduz, and G. Rigoll, “Resource Efficient 3D Convolutional Neural Networks,” in Proc. IEEE/CVF Int. Conf. Computer Vision Workshop, Seoul, Korea (South), 2019, pp. 1910–1919.
    [40]
    W. J. Zhang and J. C. Wang, “Dynamic hand gesture recognition based on 3D convolutional neural network models,” in Proc. IEEE 16th Int. Conf. Networking, Sensing and Control, Banff, Canada, 2019, pp. 224–229.
    [41]
    J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4694–4702.
    [42]
    S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. doi: 10.1109/TKDE.2009.191
    [43]
    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 2818–2826.
    [44]
    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
    [45]
    J. Materzynska, G. Berger, I. Bax, and R. Memisevic, “The jester dataset: A large-scale video dataset of human gestures,” in Proc. IEEE/CVF Int. Conf. Computer Vision Workshop, Seoul, Korea (South), 2019, pp. 2874–2882.
    [46]
    Twentybn, Twenty Billion Neurons Inc. Toronto, Canada. (2017) [Online]. Available: https://20bn.com/datasets/jester.
    [47]
    H. Wang, D. Oneata, J. Verbeek, and C. Schmid, “A robust and efficient video representation for action recognition,” Int. J. Comput. Vis., vol. 119, no. 3, pp. 219–238, Oct.–Dec. 2016. doi: 10.1007/s11263-015-0846-5
    [48]
    P. Molchanov, X. D. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4207–4215.
    [49]
    S. C. Gao, M. C. Zhou, Y. R. Wang, J. J. Cheng, H. Yachi, and J. H. Wang, “Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 2, pp. 601–614, Feb. 2019. doi: 10.1109/TNNLS.2018.2846646
    [50]
    J. J. Wang and T. Kumbasar, “Parameter optimization of interval Type-2 fuzzy neural networks based on PSO and BBBC methods,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 247–257, Jan. 2019. doi: 10.1109/JAS.2019.1911348

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(15)  / Tables(3)

    Article Metrics

    Article views (2179) PDF downloads(203) Cited by()

    Highlights

    • This study designed a new deep learning neural network model that integrates several state-of-the-art techniques for action recognition to tackle the complexity and performance issues in dynamic hand gesture recognition. Short-term video sampling, feature fusion, ConvNets with transfer learning and LSTMs are the key components of the new model.
    • This study developed a novel approach to “zoom-out” the existing dataset to increase the diversity of the dataset and thus ensure the robustness of a trained model.
    • Compared with existing models, the proposed network achieved a very competitive recognition performance on the two most popular hand gesture datasets, Jester and Nvidia.

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return