Online fitted policy iteration based on extreme learning machines
Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to le...
| Main Authors: | , , , , , |
|---|---|
| Format: | article |
| Language: | Inglés |
| Published: |
Elsevier
2021
|
| Subjects: | |
| Online Access: | http://hdl.handle.net/20.500.11939/6972 https://www.sciencedirect.com/science/article/abs/pii/S0950705116001209#! |
| _version_ | 1855032523658100736 |
|---|---|
| author | Escandell-Montero, Pablo Lorente, Delia Martínez-Martínez, José M. Soria-Olivas, Emilio Vila-Francés, Joan Martín-Guerrero, José D. |
| author_browse | Escandell-Montero, Pablo Lorente, Delia Martín-Guerrero, José D. Martínez-Martínez, José M. Soria-Olivas, Emilio Vila-Francés, Joan |
| author_facet | Escandell-Montero, Pablo Lorente, Delia Martínez-Martínez, José M. Soria-Olivas, Emilio Vila-Francés, Joan Martín-Guerrero, José D. |
| author_sort | Escandell-Montero, Pablo |
| collection | ReDivia |
| description | Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network. |
| format | article |
| id | ReDivia6972 |
| institution | Instituto Valenciano de Investigaciones Agrarias (IVIA) |
| language | Inglés |
| publishDate | 2021 |
| publishDateRange | 2021 |
| publishDateSort | 2021 |
| publisher | Elsevier |
| publisherStr | Elsevier |
| record_format | dspace |
| spelling | ReDivia69722025-04-25T14:48:01Z Online fitted policy iteration based on extreme learning machines Escandell-Montero, Pablo Lorente, Delia Martínez-Martínez, José M. Soria-Olivas, Emilio Vila-Francés, Joan Martín-Guerrero, José D. Reinforcement learning Sequential decision-making Fitted policy iteration Extreme learning machine N01 Agricultural engineering Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network. 2021-01-18T09:31:45Z 2021-01-18T09:31:45Z 2016 article publishedVersion Escandell-Montero, P., Lorente, D., Martínez-Martínez, J. M., Soria-Olivas, E., Vila-Francés, J., & Martín-Guerrero, J. D. (2016). Online fitted policy iteration based on extreme learning machines. Knowledge-Based Systems, 100, 200-211. 0950-7051 http://hdl.handle.net/20.500.11939/6972 10.1016/j.knosys.2016.03.007 https://www.sciencedirect.com/science/article/abs/pii/S0950705116001209#! en Atribución-NoComercial-SinDerivadas 3.0 España http://creativecommons.org/licenses/by-nc-nd/3.0/es/ closedAccess Elsevier electronico |
| spellingShingle | Reinforcement learning Sequential decision-making Fitted policy iteration Extreme learning machine N01 Agricultural engineering Escandell-Montero, Pablo Lorente, Delia Martínez-Martínez, José M. Soria-Olivas, Emilio Vila-Francés, Joan Martín-Guerrero, José D. Online fitted policy iteration based on extreme learning machines |
| title | Online fitted policy iteration based on extreme learning machines |
| title_full | Online fitted policy iteration based on extreme learning machines |
| title_fullStr | Online fitted policy iteration based on extreme learning machines |
| title_full_unstemmed | Online fitted policy iteration based on extreme learning machines |
| title_short | Online fitted policy iteration based on extreme learning machines |
| title_sort | online fitted policy iteration based on extreme learning machines |
| topic | Reinforcement learning Sequential decision-making Fitted policy iteration Extreme learning machine N01 Agricultural engineering |
| url | http://hdl.handle.net/20.500.11939/6972 https://www.sciencedirect.com/science/article/abs/pii/S0950705116001209#! |
| work_keys_str_mv | AT escandellmonteropablo onlinefittedpolicyiterationbasedonextremelearningmachines AT lorentedelia onlinefittedpolicyiterationbasedonextremelearningmachines AT martinezmartinezjosem onlinefittedpolicyiterationbasedonextremelearningmachines AT soriaolivasemilio onlinefittedpolicyiterationbasedonextremelearningmachines AT vilafrancesjoan onlinefittedpolicyiterationbasedonextremelearningmachines AT martinguerrerojosed onlinefittedpolicyiterationbasedonextremelearningmachines |