Short note: LSTD makes efficient use of data and converges faster than the other conventional temporal difference learning methods. TD learning methods solve for the zero expected TD-error (or zero mean TD-error) and update the weight vector until it converges to a particular value. Whereas, LSTD directly computes the weight vector for which the expected
expected TD-update is zeor. However, a move from TD algorithm to the LSTD algorithm is equivalent to moving from a model-free method to a model-based method. Therefore the state-action value function that LSTD computes is of no use for the policy improvement when the model of the process in not available.
ARL-10/11 - Advanced Reinforcement Learning
Fixed-Point Estimation of State-Action Value Function & Least-Squares Policy Iteration: Video-mp4. Download-pdf.
Short note: LSPI makes an efficient use of data, much better than the LSTD algorithm. Being an off-policy algorithm, it can accept samples from any random policy. LSPI is an improvement over the LSTD algorithm. The innovation in LSPI is the LSQ (LSTDQ) algorithm, which is quite similar to the LSTD algorithm, and it also computes the weight vector for
vector for which the expected TD-Update is zero. LSPI returns the Fixed-Point solution for the weight vector. The policy for which value function is to be computed, is used in the LSTDQ, to select the actions in the resulting next state, which plays the cardinal role in LSPI algorithm.
Geometric Analysis of Bellman Residual Minimization & Fixed Point Methods: Video-mp4. Download-pdf.
ARL: Now supported by CCO-10/11 for understand ing core mathematical optimization
Based on recent publications in reinforcement learning
Recent breakthrough in using reinforcement leanring for autonomous navigation in unknown and unseen environments (DFLR-E Project)
Make sure that you understand Linear Function Approximation, Monte Carlo Methods and TD-Learning --- SARSA and Q-Learning. You can use my lectures in RL Channel or the book Reinforcement Learning: An Introduction. My first 5-lectures in that channel cover the important part of the book, i.e. first 6 chapters. These 5 lectures were delivered much before than this course. Many of those lectures didn't have MATLAB code, so here's the code for my 3rd lecture. Download-Policy_Evaluation_Sanjeev.zip. The Path Planning algorithm in Lecture-3 of RL Channel is nothing but the simple application of policy evaluation algorithm, but at that time (3rd year undergrad) it was big-deal for me, so I am sorry if the lecture appears to be too hyped.
Artificial Intelligence: Reinforcement Learning
Short note: Solving a control problem in reinforcement learning involves computing an optimal control policy for a given MDP (given task). The policy iteration algorithm finds this
this optimal policy by continuously iterating through the value determination and policy improvement. Value determination can be solved either by Bellman Residual Minimization or Fixed-Point solution. BRM finds the weight vector by minimizing the Bellman Residual, where as fixed point method directly finds the weight vector for which the expected TD-Update is zero. Geometrical analysis of FP and BRM, depicts a right-&Delta, where BRM & FP minimize the hypotenuse and base of right-&Delta respectively.
Kernel Methods in Reinforcement Learning: Kernelized Value Function Approximation: Video-mp4. Download-pdf.
Kernel trick is utilized in machine learning when the data-set is not separable in feature space. Kernels facilitate the classification, mapping the feature space to a higher dimensional space and hence the problem gets easily solved. Kernel Reinforcement Learning utilizes the power of Kernels to solve the value function approximation problems. The approach
The approach presented in the publication, Kernelized Value Function Approximation, is a model-based approach. It is used mainly for the prediction problems as it computes the state-value function. Moreover, it unifies the previous Kernel Reinforcement Learning approaches, such as KLSTD and GRPL methods. The algorithm uses a kernel based regularized regression model (dual representation) to find the relation between the states and their expected-discounted total reward. It further utilizes the regression model to predict the expected next-state kernel.
Proposed Topic for Lecture - 5: Kernelized State-Action Value Function Approximation in Reinforcement Leanring: (held in abeyance till Jan 20th, 2011)
My MATLAB Code: I had written this code during my UMass Intern. This code has been modified for teaching purpose and the main file has been written to demonstrate the application in the 20-state chain walk domain. Download- .
To use this code, download the Ronald Parr- LSPI package and extract the files in the same folder. However you will have to make changes in the file "chain_basis_pol.m". This is because KERNEL matrix would be prone to high values for states 11-20 in the basis function. Therefore a normalization is required. Moreover, the code I had written was a research purpose code, to verify the KLSTD, i.e. no comments in the code and therefore you will have to interpret the meaning of the variables using the paper. Sorry for the trouble. I hope you will enjoy :-)
Results in Chain Walk domain:
Implementation of the paper, for
state-action value function.
Proposed Topic for Lecture - 6: Understanding Reproducing Kernel Hilbert Spaces and Mercer Kernels
This course will focus on recently published algorithms in RL. It will also analyze the application of RL for autonomous navigation (DFRL-E Project). The secondary purpose of this course is to enable students (esp. in India) to learn advanced concepts in Reinforcement Learning and its application in Robotics and Navigation. The only prerequisite is basic knowledge in Machine Learning, Linear Algebra and the book "Reinforcement Learning: An Introduction". Lectures in Reinforcement Learning and Machine Learning channel of searching-eye are sufficient to build the basic background.