A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 7 Issue 5
Sep.  2020

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Jamal Banzi, Isack Bulugu and Zhongfu Ye, "Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation," IEEE/CAA J. Autom. Sinica, vol. 7, no. 5, pp. 1371-1379, Sept. 2020. doi: 10.1109/JAS.2020.1003090
Citation: Jamal Banzi, Isack Bulugu and Zhongfu Ye, "Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation," IEEE/CAA J. Autom. Sinica, vol. 7, no. 5, pp. 1371-1379, Sept. 2020. doi: 10.1109/JAS.2020.1003090

Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation

doi: 10.1109/JAS.2020.1003090
Funds:  This work was supported in part by the Fundamental Research Funds for the Central Universities (WK2350000002)
More Information
  • In this paper we present a CNN based approach for a real time 3D-hand pose estimation from the depth sequence. Prior discriminative approaches have achieved remarkable success but are facing two main challenges: Firstly, the methods are fully supervised hence require large numbers of annotated training data to extract the dynamic information from a hand representation. Secondly, unreliable hand detectors based on strong assumptions or a weak detector which often fail in several situations like complex environment and multiple hands. In contrast to these methods, this paper presents an approach that can be considered as semi-supervised by performing predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision. The hand is modelled using a novel latent tree dependency model (LDTM) which transforms internal joint location to an explicit representation. Then the modeled hand topology is integrated with the pose estimator using data dependent method to jointly learn latent variables of the posterior pose appearance and the pose configuration respectively. Finally, an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose. Experiments on three challenging public datasets, ICVL, MSRA, and NYU demonstrate the significant performance of the proposed method which is comparable or better than state-of-the-art approaches.

     

  • loading
  • [1]
    E. Barsoum, “Articulated hand pose estimation review,” arXiv preprint arXiv: 1604.06195, pp. 1−50, 2016.
    [2]
    A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Vision-based hand pose estimation: A review,” Comput. Vis. Image Underst., vol. 108, no. 1−2, pp. 52–73, Oct.–Nov. 2007.
    [3]
    S. Sridhar, F. Mueller, A. Oulasvirta, and C. Theobalt, “Fast and robust hand tracking using detection-guided optimization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3213−3221.
    [4]
    P. Krejov, A. Gilbert, and R. Bowden, “Combining discriminative and model based approaches for hand pose estimation,” in Proc. 11th IEEE Int. Conf. and Workshops on Automatic Face and Gesture Recognition, Ljubljana, Slovenia, 2015, pp. 1−7.
    [5]
    L. Tracewski, L. Bastin, and C. C. Fonte, “Repurposing a deep learning network to filter and classify volunteered photographs for land cover and land use characterization,” Geo-Spat. Inf. Sci., vol. 20, no. 3, pp. 252–268, Sept. 2017. doi: 10.1080/10095020.2017.1373955
    [6]
    H. Yu, J. W. Wang, Y. Bai, W. Yang, and G. S. Xia, “Analysis of large-scale UAV images using a multi-scale hierarchical representation,” Geo-Spat. Inf. Sci., vol. 21, no. 1, pp. 33–44, Jan. 2018. doi: 10.1080/10095020.2017.1418263
    [7]
    T. Y. Chen, P. W. Ting, M. Y. Wu, and L. C. Fu, “Learning a deep network with spherical part model for 3D hand pose estimation,” Pattern Recognit., vol. 80, pp. 1–20, Aug. 2018. doi: 10.1016/j.patcog.2018.02.029
    [8]
    C. Zimmermann and T. Brox, “Learning to estimate 3D hand pose from single RGB images,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4913−4921.
    [9]
    A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly, “Robust articulated-ICP for real-time hand tracking,” in Proc. Eurographics Symp. Geometry Processing, 2015, pp. 101−114.
    [10]
    H. Patel, A. Thakkar, M. Pandya, and K. Makwana, “Neural network with deep learning architectures,” J. Inf. Optim. Sci., vol. 39, no. 1, pp. 31–38, 2018.
    [11]
    Q. Ye, S. X. Yuan, and T. K. Kim, “Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation,” in Proc. European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 346−361.
    [12]
    X. Sun, Y. C. Wei, S. Liang, X. O. Tang, and J. Sun, “Cascaded hand pose regression,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 824−832.
    [13]
    A. Sinha, C. Choi, and K. Ramani, “DeepHand: Robust hand pose estimation by completing a matrix imputed with deep features,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4150−4158.
    [14]
    C. Choi, A. Sinha, J. H. Choi, S. Jang, and K. Ramani, “A collaborative filtering approach to real-time hand pose estimation,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2336−2344.
    [15]
    M. Oikonomidis, I. A. Lourakis, and A. A. Argyros, “Evolutionary quasi-random search for hand articulations tracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 3422−3429.
    [16]
    P. Krejov, A. Gilbert, and R. Bowden, “Guided optimisation through classification and regression for hand pose estimation,” Comput. Vis. Image Underst., vol. 155, pp. 124–138, Feb. 2017. doi: 10.1016/j.cviu.2016.11.005
    [17]
    C. Qian, X. Sun, Y. C. Wei, X. O. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 1106−1113.
    [18]
    D. H. Tang, T. H. Yu, and T. K. Kim, “Real-time articulated hand pose estimation using semi-supervised transductive regression forests,” in Proc. IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 3224−3231.
    [19]
    J. F. Banzi, Z. F. Ye, and I. Bulugu, “A novel hand pose estimation using dicriminative deep model and Transductive learning approach for occlusion handling and reduced descrepancy,” in Proc. IEEE Int. Conf. Computer and Communications, Chengdu, China, 2016, pp. 347−352.
    [20]
    G. Poier, K. Roditakis, S. Schulter, D. Michel, H. Bischof, and A. A. Argyros, “Hybrid one-shot 3D hand pose estimation by exploiting uncertainties,” in BMVC, Swansea, UK, 2015.
    [21]
    M. Tompson, Y. Stein, M. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., vol. 33, no. 5, pp. 169, Sept. 2014.
    [22]
    H. K. Guo, G. J. Wang, X. H. Chen, C. R. Zhang, F. Qiao, and H. Z. Yang, “Region ensemble network: Improving convolutional network for hand pose estimation,” in Proc. IEEE Int. Conf. Image Processing, Beijing, China, 2017, pp. 4512−4516.
    [23]
    M. Oberweger, P. Wohlhart, and V. Lepetit, “Training a feedback loop for hand pose estimation,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015 pp. 3316−3324.
    [24]
    A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with exemplar convolutional neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9, pp. 1734–1747, Sept. 2016. doi: 10.1109/TPAMI.2015.2496141
    [25]
    L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018. doi: 10.1109/TPAMI.2017.2699184
    [26]
    L. H. Ge, H. Liang, J. S. Yuan, and D. Thalmann, “3D convolutional neural networks for efficient and robust hand pose estimation from single depth images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5679−5688.
    [27]
    M. Oberweger and V. Lepetit, “DeepPrior++: Improving fast and accurate 3D hand pose estimation,” in Proc. IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 585−594.
    [28]
    Z. H. Zhou and J. Feng, “Deep forest: Towards an alternative to deep neural networks,” in Proc. Twenty-Sixth Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 3553−3559.
    [29]
    F. Wang and Y. Li, “Beyond physical connections: Tree models in human pose estimation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition., Portland, USA, 2013, pp. 596−603.
    [30]
    V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning high-dimensional Markov forest distributions: Analysis of error rates,” J. Mach. Learn. Res., vol. 12, pp. 1617–1653, Jul. 2011.
    [31]
    C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Inf. Theory, vol. 14, no. 3, pp. 462–167, May 1968. doi: 10.1109/TIT.1968.1054142
    [32]
    D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, San Diego, USA, 2015, pp. 1−13.
    [33]
    Y. P. Huang and R. P. N. Rao, “Predictive coding,” WIREs Cogn. Sci., vol. 2, no. 5, pp. 580–593, Sept.–Oct. 2011. doi: 10.1002/wcs.142
    [34]
    F. Stolzenburg, O. Michael, and O. Obst, “Predictive neural networks,” arXiv preprint arXiv: 1802.03308, 2018.
    [35]
    N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using LSTMs,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 843−852.
    [36]
    Theano Development Team, “Theano: A python framework for fast computation of mathematical expressions,” arXiv preprint arXiv: 1605.02688, 2016.
    [37]
    W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in Proc. Conf. ICLR, Palais des Congrès Neptune, Toulon, France, 2017.
    [38]
    D. H. Tang, H. J. Chang, A. Tejani, and T. K. Kim, “Latent regression forest: Structured estimation of 3D hand poses,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1374–1387, Jul. 2017. doi: 10.1109/TPAMI.2016.2599170
    [39]
    X. Y. Zhou, Q. F. Wan, W. Zhang, X. Y. Xue, and Y. C. Wei, “Model-based deep hand pose estimation,” in Proc. Int. Joint Conf. Artificial Intelligence, New York City, USA, 2016, pp. 2421−2427.
    [40]
    C. Wan, T. Probst, L. Van Gool, and A. Yao, “Crossing nets: Dual generative models with a shared latent space for hand pose estimation,” in Proc. Conf. Computer Vision and Pattern Recognition, pp. 7, 2017.
    [41]
    S. X. Yuan, Q. Ye, B. Stenger, S. Jain, and T. K. Kim, “Bighand 2.2m benchmark: Hand pose dataset and state of the art analysis,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 2605−2613.
    [42]
    M. Madadi, S. Escalera, X. Baró, and J. Gonzalez, “End-to-end global to local CNN learning for hand pose recovery in depth data,” arXiv preprint arXiv: 1705.09606, 2017.
    [43]
    L. H. Ge, Y. J. Cai, J. W. Weng, and J. S. Yuan, “Hand point net: 3D hand pose estimation using point sets,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 8417−8426.
    [44]
    J. Supančič III, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan, “Depth-based hand pose estimation: Methods, data, and challenges,” Int. J. Comput. Vision, vol. 126, no. 11, pp. 1180–1198, Nov. 2018. doi: 10.1007/s11263-018-1081-7
    [45]
    X. M. Deng, S. Yang, Y. D. Zhang, P. Tan, L. Chang, and H. A. Wang, “Hand 3D: Hand pose estimation using 3D neural network,” arXiv preprint arXiv: 1704.02224, 2017.
    [46]
    J. Banzi, I. Bulugu, and Z. F. Ye, “Deep predictive neural network: Unsupervised learning for hand pose estimation,” Int. J. Machine Learning and Computing, vol. 9, 2019. doi: 10.18178/ijmlc.2019.9.4.822

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(10)  / Tables(2)

    Article Metrics

    Article views (1587) PDF downloads(70) Cited by()

    Highlights

    • A new way of modelling a hand topology using (LDTM) which transforms internal joint locations to an explicit hand representation. This hand representation is more compact and invariant in scale and view angles.
    • Strong hand detector integrated with the deep learning based pose estimator into one pipeline. Therefore, our hand pose estimation is based on the prior knowledge of the human hand.

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return