A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 1 Issue 3
Jul.  2014

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu and Yuchen Fu, "Experience Replay for Least-Squares Policy Iteration," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 274-281, 2014.
Citation: Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu and Yuchen Fu, "Experience Replay for Least-Squares Policy Iteration," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 274-281, 2014.

Experience Replay for Least-Squares Policy Iteration

Funds:

This work was supported by National Natural Science Foundation of China (61303108, 61272005, 61373094, 61103045), Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial Application of Basic Research Program (SYG201422).

  • Policy iteration, which evaluates and improves the control policy iteratively, is a reinforcement learning method. Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity. However, most existing online least-squares policy iteration methods only use each sample just once, resulting in the low utilization rate. With the goal of improving the utilization efficiency, we propose an experience replay for least-squares policy iteration (ERLSPI) and prove its convergence. ERLSPI method combines online least-squares policy iteration method with experience replay, stores the samples which are generated online, and reuses these samples with least-squares method to update the control policy. We apply the ERLSPI method for the inverted pendulum system, a typical benchmark testing. The experimental results show that the method can effectively take advantage of the previous experience and knowledge, improve the empirical utilization efficiency, and accelerate the convergence speed.

     

  • loading
  • [1]
    Wiering M, van Otterlo M. Reinforcement learning: state-of-the-art. Adaptation, Learning, and Optimization. Berlin, Heidelberg: Springer, 2012. 12-50
    [2]
    Zhu Fei, Liu Quan, Fu Qi-Ming, Fu Yu-Chen. A least square actor-critic approach for continuous action space. Journal of Computer Research and Development, 2014, 51(3): 548-558 (in Chinese)
    [3]
    Liu De-Rong, Li Hong-Liang, Wang Ding. Data-based self-learning optimal control: research progress and prospects. Acta Automatica Sinica, 2013, 39(11): 1858-1870 (in Chinese)
    [4]
    Zhu Mei-Qiang, Cheng Yu-Hu, Li Ming, Wang Xue-Song, Feng Huan-Ting. A hybrid transfer algorithm for reinforcement learning based on spectral method. Acta Automatica Sinica, 2012, 38(11): 1765-1776 (in Chinese)
    [5]
    Chen Xin, Wei Hai-Jun, Wu Min, Cao Wei-Hua. Tracking learning based on Gaussian regression for multi-agent systems in continuous space. Acta Automatica Sinica, 2013, 39(12): 2021-2031 (in Chinese)
    [6]
    Xu X, Zuo L, Huang Z H. Reinforcement learning algorithms with function approximation: recent advances and applications. Information Sciences, 2014, 261: 1-31
    [7]
    Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Recent Advances in Reinforcement Learning. New York: Springer, 1996. 33-57
    [8]
    Escandell-Montero P, Martínez-Martínez J D, Soria-Olivas E, Gómez-Sanchis J. Least-squares temporal difference learning based on an extreme learning machine. Neurocomputing, 2014, 141: 37-45
    [9]
    Maei H R, Szepesvári C, Bhatnagar S, Sutton R S. Toward off-policy learning control with function approximation. In: Proceedings of the 27th International Conference on Machine Learning. Haifa: Omnipress, 2010. 719-726
    [10]
    Tamar A, Castro D D, Mannor S. Temporal difference methods for the variance of the reward to go. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). Atlanta, Georgia, 2013. 495-503
    [11]
    Dann C, Neumann G, Peters J. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15(1): 809-883
    [12]
    Lagoudakis M G, Parr R, Littman M L. Least-squares methods in reinforcement learning for control. Methods and Applications of Artificial Intelligence. Berlin, Heidelberg: Springer, 2002, 2308: 249-260
    [13]
    Lagoudakis M, Parr R. Least squares policy iteration. Journal of Machine Learning Research, 2003, 4, 1107-1149
    [14]
    Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming using Function Approximators. New York: CRC Press, 2010. 100-118
    [15]
    Adam S, Busoniu L, Babuska R. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(2): 201-212
    [16]
    Jung T, Polani D. Kernelizing LSPE (λ). In: Proceedings of the 2007 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Honolulu, HI: IEEE, 2007. 338-345
    [17]
    Jung T, Polani D. Least squares SVM for least squares TD learning. In: Proceedings of the 17th European Conference on Artificial Intelligence. Amsterdam: IOS Press, 2006. 499-503

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (1156) PDF downloads(9) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return