Add Thesis

A study of the exploration /exploitation trade-off in reinforcement learning

Applied to autonomous driving

Written by R. Louis, D. Yu

Paper category

Bachelor Thesis


Computer Science




Thesis: Reinforcement learning The idea of ​​reinforcement learning is to learn what to do in the environment-to map actions to situations, that is, to find strategies-to maximize numerical rewards (Sutton & Barto 2018). An intuitive example is when a human is practicing driving a manual transmission car. The environment is the car, and the interaction with the car involves three pedals, the steering wheel and the gearbox. In the beginning, a person uses the clutch and accelerator to practice gear shifting, the car moves forward or the car's motor malfunctions. The forward movement of the car indicates that we are balancing the clutch and accelerator pedal, which will generate positive rewards in the case of elevated human dopamine levels. A motor failure indicates an imbalance between the clutch and the accelerator pedal and leads to negative rewards, lowering human dopamine levels. Learning occurs when the agent discovers more and more positive responses from the environment by trying these behaviors in the same environment. The result is that strategies for maximizing numerical rewards have been learned. 1.2.1 Exploration/development trade-offs-o↵Reinforcement learning adopts the idea of ​​exploration. Exploration can be described as an attempt to discover new characteristics of the environment by performing sub-optimal operations. The idea is to perform an action that is different from past experience, indicating the correct decision given a specific environment. Therefore, exploration will produce two results. The exploration of undiscovered behaviors ends in reducing rewards, such as punishment (negative reward value) or increasing reward value. Exploration is necessary because sub-optimal actions in the same environment may produce more rewards. Therefore, learning agents based on reinforcement learning may generate negative rewards when they find the best action. (Coggan2004) In contrast, reinforcement learning agents adopt the idea of ​​using previously experienced actions. Development is the concept of repeatedly performing the same action in the same environment because it will result in the current maximum reward. Therefore, this idea is the opposite of exploration, and the conclusion is that increased development leads to less exploration. On the contrary, reinforcement learning agents encounter the dilemma of whether to explore undiscovered behaviors or use past experience. In practice, the development shows that the agent's decision is based on the learning strategy in the environment. (Coggan2004) One of the main challenges of reinforcement learning is the trade-off between exploration and development. The agent wants to maximize its reward for each action it takes—using past experience—but this leads to a decline in the exploration of new actions to improve knowledge about the reward generation process. When the agent increases exploration, it does not have to maximize its reward. Therefore, the agent must balance o↵ to achieve the best performance (Auer2002). An intuitive example of Trade-o↵ischoosing home route. One is the main comfortable route home. At some point, people noticed that a new driving path has been introduced into the trac. Therefore, a decision can be made whether to take a comfortable route or explore a new route. The newly discovered route home will increase or decrease the travel time home. 1.2.2 Model-based and model-free There are two ways to benefit from the experience generated by reinforcement learning. The model-based approach is derived from the idea of ​​indirect handling of experience. By building a model of state transitions and environmental results, and evaluating actions by searching in the model. The model-free method predicts rewards through trial and error without constructing a clear environmental model to directly manipulate the experience. (Gl ̈ascher et al. 2010) 1.3 Overview of Markov decision process Reinforcement learning can be formalized by using the idea of ​​dynamic system theory, especially the Markov decision process abbr MDP. More details are introduced in Section 2.1. The outline of the idea is to formalize all aspects of the problem faced by the learning agent. Sutton & Barto (2018) mentions three aspects-feelings, actions, and goals-which are intended to be included in the MDP. Perception involves the ability to perceive the state of the environment. Action involves taking action to influence the state. The goal is to complete tasks related to the state of the environment. Feelings and actions are related to finding the optimal strategy for maximizing numerical rewards in an attempt to achieve the third aspect, the goal. 1.4 Overview of Q-learning There are many methods for finding the optimal strategy for MarkovDecision Process, and Q-learning is one of them. Q-learning was proposed by Watkins & Dayan (1992), and its idea is to express the strategy as a matrix. The x-axis represents the state, and the y-axis represents the action. An entry in the matrix is ​​called the Q value, which is a digital representation of a goodone action in a given state. The goal of Q-learning is to perform the action with the highest Q value in a given state and update the matrix in order to achieve the optimal strategy in the MDP.1.5 problem statement. In this project, we will explore the exploration/utilization trade-off -o↵ One of the decision-making algorithms in reinforcement learning, especially Q-learning, through learning to simulate agents through different intersections. The goal is not to be exhaustive in simulating and testing all possible action and reward sets, but to define a single set that we can use in simulation, and adjust the parameters in Q-Learning to explore or utilize the learning process. Read Less