A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 8 Issue 10
Oct.  2021

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
C. H. Liu, F. Zhu, Q. Liu, and Y. C. Fu, "Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification," IEEE/CAA J. Autom. Sinica, vol. 8, no. 10, pp. 1686-1696, Oct. 2021. doi: 10.1109/JAS.2021.1004141
Citation: C. H. Liu, F. Zhu, Q. Liu, and Y. C. Fu, "Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification," IEEE/CAA J. Autom. Sinica, vol. 8, no. 10, pp. 1686-1696, Oct. 2021. doi: 10.1109/JAS.2021.1004141

Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification

doi: 10.1109/JAS.2021.1004141
Funds:  This work was supported by the National Natural Science Foundation of China (61303108), Suzhou Key Industries Technological Innovation-Prospective Applied Research Project (SYG201804), A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), and the Fundamental Research Funds for the Gentral Universities, JLU (93K172020K25)
More Information
  • In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult. To solve the problem, we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision (HADS) which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism. HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning. Due to the fact that not all sub-goal points are reachable, a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm. HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals, while those that are not will be removed based on prior knowledge. Our experiments verified the effect of the algorithm.

     

  • loading
  • [1]
    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 2018.
    [2]
    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236
    [3]
    H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 2094–2100.
    [4]
    T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Proc. Advances in Int. Conf. Learning Representations, 2016, pp. 1–21.
    [5]
    Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2016, pp. 1995–2003.
    [6]
    R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999. doi: 10.1016/S0004-3702(99)00052-1
    [7]
    T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” Advances in Neural Information Processing Systems, 2016, pp. 3675–3683.
    [8]
    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight experience replay, ” in Advances in Neural Information Processing Systems, 2017, pp. 5048–5058.
    [9]
    C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” in Proc. Advances in Int. Conf. Learning Representations, 2017, pp. 1–17.
    [10]
    H. Le, N. Jiang, A. Agarwal, M. Dudik, Y. Yue, and H. Daumé, “Hierarchical imitation and reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2018, pp. 2923–2932.
    [11]
    X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Trans. Graphics, vol. 36, no. 4, pp. 1–13, 2017.
    [12]
    J. Rafati and D. C. Noelle, “Learning representations in model-free hierarchical reinforcement learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 10009–10010.
    [13]
    M. Imani and U. M. Braga-Neto, “Control of gene regulatory networks using bayesian inverse reinforcement learning,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 16, no. 4, pp. 1250–1261, 2019. doi: 10.1109/TCBB.2018.2830357
    [14]
    N. Dilokthanakul, C. Kaplanis, N. Pawlowski, and M. Shanahan, “Feature control as intrinsic motivation for hierarchical reinforcement learning,” IEEE Trans. Neural Networks &Learning Systems, vol. 30, no. 11, pp. 3409–3418, 2019.
    [15]
    H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” Advances in Neural Information Processing Systems, 2017, pp. 5392– 5402.
    [16]
    J. Yan, H. He, X. Zhong, and Y. Tang, “Q-learning-based vulnerability analysis of smart grid against sequential topology attacks,” IEEE Trans. Information Forensics &Security, vol. 12, no. 1, pp. 200–210, 2017.
    [17]
    H. C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers, “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE Trans. Medical Imaging, vol. 35, no. 5, pp. 1285–1298, 2016. doi: 10.1109/TMI.2016.2528162
    [18]
    B. Hengst, “Hierarchical reinforcement learning,” Encyclopedia of Machine Learning and Data Mining, pp. 611–619, 2017.
    [19]
    R. E. Parr and S. Russell, Hierarchical Control and Learning for Markov Decision Processes. University of California, Berkeley Berkeley, CA, 1998.
    [20]
    R. Ramesh, M. Tomar, and B. Ravindran, “Successor options: An option discovery framework for reinforcement learning,” in Proc. 28th Int. Joint Conf. Artificial Intelligence, 2019, pp. 3304–3310.
    [21]
    T. G. Dietterich, “Hierarchical reinforcement learning with the MAXQ value function decomposition,” Journal of Artificial Intelligence Research, vol. 13, pp. 227–303, 2000. doi: 10.1613/jair.639
    [22]
    P. Kai, A. Escande, and A. Kheddar, “Singularity resolution in equality and inequality constrained hierarchical task-space control by adaptive non-linear least-squares,” IEEE Robotics &Automation Letters, vol. 3, no. 4, pp. 3630–3637, 2018.
    [23]
    D. Abel, D. Arumugam, L. Lehnert, and M. Littman, “State abstractions for lifelong reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2018, pp. 10–19.
    [24]
    Y. Fu, Z. Xu, F. Zhu, Q. Liu, and X. Zhou, “Learn to human-level control in dynamic environment using incremental batch interrupting temporal abstraction,” Computer Science &Information Systems, vol. 13, no. 2, pp. 561–577, 2016.
    [25]
    A. Neitz, G. Parascandolo, S. Bauer, and B. Schölkopf, “Adaptive skip intervals: Temporal abstraction for recurrent dynamical models,” Advances in Neural Information Processing Systems, 2018, pp. 9816– 9826.
    [26]
    O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” Advances in Neural Information Processing Systems, 2018, pp. 3303–3313.
    [27]
    J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proc. 34th Int. Conf. Machine Learning-Volume 70, 2017, pp. 166–175.
    [28]
    I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel, “Model-based reinforcement learning via meta-policy optimization,” in Proc. Conf. Robot Learning, 2018, pp. 617–629.
    [29]
    C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2017, pp. 240–248.
    [30]
    Z. Xu, H. P. van Hasselt, and D. Silver, “Meta-gradient reinforcement learning,” Advances in Neural Information Processing Systems, 2018, pp. 2396–2407.
    [31]
    A. Garivier, P. Ménard, and G. Stoltz, “Explore first, exploit next: The true shape of regret in bandit problems,” Mathematics of Operations Research, vol. 44, no. 2, pp. 377–399, 2018.
    [32]
    M. P. Saka, O. Hasancebi, and Z. W. Geem, “Metaheuristics in structural optimization and discussions on harmony search algorithm,” Swarm and Evolutionary Computation, vol. 28, pp. 88–97, 2016. doi: 10.1016/j.swevo.2016.01.005
    [33]
    N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” Advances in Neural Information Processing Systems, 2015, pp. 2944–2952.
    [34]
    J. P. O’Doherty, S. W. Lee, and D. McNamee, “The structure of reinforcement-learning mechanisms in the human brain,” Current Opinion in Behavioral Sciences, vol. 1, pp. 94–100, 2015. doi: 10.1016/j.cobeha.2014.10.004
    [35]
    A. G. Barto, “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems. Springer, 2013, pp. 17–47.
    [36]
    P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in Proc. 31st AAAI Conf. Artificial Intelligence, 2017, pp. 1726– 1734.
    [37]
    Z. Zhao, Z. Yan, F. Li, M. Zhao, Z. Li, and S. Yan, “Discriminative sparse flexible manifold embedding with novel graph for robust visual representation and label propagation,” Pattern Recognition, vol. 61, pp. 492–510, 2017. doi: 10.1016/j.patcog.2016.07.042
    [38]
    C. Wong, N. Houlsby, Y. Lu, and A. Gesmundo, “Transfer learning with neural automl,” Advances in Neural Information Processing Systems, 2018, pp. 8356–8365.
    [39]
    G. D. Ruxton, “The unequal variance t-test is an underused alternative to student’s t-test and the Mann-Whitney U test,” Behavioral Ecology, vol. 17, no. 4, pp. 688–690, 2006. doi: 10.1093/beheco/ark016
    [40]
    J. C. F. De Winter, “Using the student’s t-test with extremely small sample sizes,” Practical Assessment Research &Evaluation, vol. 18, no. 10, pp. 1–12, 2013.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(12)  / Tables(7)

    Article Metrics

    Article views (1037) PDF downloads(86) Cited by()

    Highlights

    • The sub-goals are detected automatically via computer vision.
    • The agent receives an inner reward after completing the sub-goal.
    • The combined input of the sub-goal and the image achieves a better result.

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return