Q-Learning With The Frozen Lake Environment In Android

📱 Mobile Machine Learning

Q-learning is one of the simplest algorithms to try reinforcement learning. Reinforcement learning, as the name suggests, focuses on learning (by an agent) in a reinforced environment. The agent performs an action, analyses the outcome, and gets a reward. The agent then learns to interact with its environment by taking into consideration the rewards which it will get by performing specific actions in a particular state.

Q-learning is simple because it works without a neural network (the one with a neural network becomes deep Q-learning). We update a Q table, which serves as a map for our agent, i.e., it tells our agent which action to perform in which state so as to get maximum reward.

You may find the GitHub project here:

Before we start …

This article assumes that the readers are familiar with terms like state, action, episode, and reward in the context of reinforcement learning. Basic knowledge of Q-learning would be helpful. You’ll see an implementation of the Frozen Lake environment ahead, which is quite similar to Open AI’s Gym in Python.

As this is an Android project, you’ll see much more code in the GitHub repo than in the snippets below: The code that updates the UI as the agent that performs an action is not written in the code snippets. Also, you’ll find Kotlin Coroutines in action. To focus only on the Q-learning part, I have eliminated these lines (they are present in the GitHub repo) from the snippets below so as to avoid confusion and enhance readability.

So, enough of the disclaimers! Let’s move ahead!

Creating the Frozen Lake Environment

We’ll first have a look at the Frozen Lake Environment, as given on OpenAI’s Gym docs.

Basically, we have a starting point (denoted as S), an ending point (G) or goal, and four holes. So, our environment will be a 4 * 4 matrix with these states. But what elements (16, to be precise) should we keep in this matrix?

A decent approach will be to have the rewards as elements to this matrix. A reward for a particular action will help us update the Q value for a pair (state, action) using the Bellman’s equation. For this environment, we’ll use these rewards for the corresponding states:

So, our matrix would look like:

As we have 16 cells above, the agent could be present in any one of them. So, we have 16 states (this includes everything, holes, frozen lake, and the goal as well). The agent could move up, down, left, or right, so we have four actions here.

We’ll now head towards the code. Note, the classes Matrix and MatrixOps you’ll see are some helper classes for handling various operations on matrices. I won’t be sharing their code here. You may view these classes here in the source code.

We create a envMatrix which is Matrix object for holding our environment and the rewards at their respective positions.

private fun initEnvMatrix() {
    envMatrix = MatrixOps.zerosLike( 4 , 4 )
    // Set -1 at the positions of the "holes"
    envMatrix.set( 1 , 1  , -1.0 )
    envMatrix.set( 3 , 1  , -1.0 )
    envMatrix.set( 3 , 2 , -1.0 )
    envMatrix.set( 0 , 3 , -1.0 )
    // Set 1 at the position of the goal
    envMatrix.set( 3 , 3 , 1.0 )
}

See the below image to understand the position of various states.

Also, we have indices for our four actions:

Next, we create a FrozenLakeEnv class to hold all the methods and variables required by the agent.

// actions = ( 0 , left ) , ( 1 , up ) , ( 2 , right ) , ( 3 , down )

// Class for the Frozen Lake Enviroment
class FrozenLakeEnv() {

    // These variables will be used for the step() method
    companion object {
        val ACTION_LEFT = 0
        val ACTION_UP = 1
        val ACTION_RIGHT = 2
        val ACTION_DOWN = 3
    }

    private lateinit var envMatrix : Matrix

    // Initial position of the agent ( the S state ).
    var agentPosX = 0
    var agentPosY = 0
    // Number of actions and states
    var actionSpaceN : Int = 4
    var observationSpaceN : Int = 16
    
    init {
        initEnvMatrix()
    }

The variables agentPosX and agentPosY represent the starting point for our agent. It is the (0, 0) state in our envMatrix. These variables will be updated as the agent moves around in the environment.

We’ll also need some helper methods:

// Get a random action ( Just as env.actionSpace.sample() in Python )
fun actionSpaceSample() : Int {
    return Random.nextInt( actionSpaceN )
}

// Reset the positions of the agent ( Just as env.reset() in Python )
fun reset() : Int {
    agentPosY = 0
    agentPosX = 0
    return getStateFromPos( agentPosX , agentPosY )
}

// Get the index for the current state using the agent's position.
private fun getStateFromPos( agentPosX : Int , agentPosY : Int ) : Int {
    return agentPosX + ( 4 * agentPosY )
}

// Get a reward for the given position.
private fun getReward( agentPosX : Int , agentPosY : Int ) : Double {
    return envMatrix.get( agentPosX , agentPosY )
}

actionSpaceSample(): Similar to what you may have seen in Python while using gym i.e gym.actionSpace.sample(). So, it returns a random integer from (0, 4) which would represent an action.
reset(): Resets the environment. This is analogous to the method env.reset() in Python. The agent’s position is set to (0, 0) which represents the starting position i.e. the S cell.
getStateFromPos(): Gets an integer from (0, 16) which is unique for each state considering the agent’s current position.

getReward(): Gets a reward for the action the agent just performed. It uses the envMatrix we created earlier.

Now comes the interesting part. We are going to code the env.step() method which would take an action and return the new state as well as a reward.

The code for the step method:

// Perform an action in the environment.
fun step( action : Int ) : EnvOutput {
    var agentNewPosX : Int = agentPosX
    var agentNewPosY : Int = agentPosY
    // Update the agent's position for subtracting/adding a 1 to it's existing position.
    when( action ) {
        ACTION_LEFT -> agentNewPosX = agentPosX - 1
        ACTION_UP -> agentNewPosY = agentPosY - 1
        ACTION_RIGHT -> agentNewPosX = agentPosX + 1
        ACTION_DOWN -> agentNewPosY = agentPosY + 1
    }
    // Check if the updated positions are valid.
    if ( agentNewPosX in 0..3 && agentNewPosY in 0..3 ) {
        agentPosX = agentNewPosX
        agentPosY = agentNewPosY
        val reward = getReward( agentPosX , agentPosY )
        return EnvOutput(
                getStateFromPos( agentPosX , agentPosY ) ,
                reward ,
                // A reward of -1 means that we fell into the hole! The episode has ended here ...
                reward == -1.0
        )

    }
    else {
        return EnvOutput(
                getStateFromPos( agentPosX , agentPosY ) ,
                0.0 ,
                false)
    }
}

class EnvOutput( var newState : Int ,  var reward : Double , var isTerminated : Boolean )

We are actually changing the variables agentPosX and agentPosY by adding/subtracting 1 to move the agent in a direction. An invalid action, as discussed earlier, would result in either a negative value for agentPosX or agentPosY.

The if statement is actually answering our first question, i.e., it checks for valid actions. Let me explain it. If the agent is at position (0, 0) and ACTION_LEFT is performed, the agent, thereby, must have a position of (-1, 0) which is what we don’t need. You’ll notice that any invalid actions would lead to coordinates which are not in [ 0, 1, 2, 3 ], right?

As we can’t return multiple values from a method in Kotlin, we’ll use the EnvOutput class which stores three values, newState, reward and isTerminated.

At line no. 21, we can observe that the value of isTerminated will be reward == -1.0. Remember, while creating the envMatrix, a reward of -1 represents the hole, which indicates the end of the episode.

Implementing the Q-Learning Algorithm

In the source code, you’ll find a class, QLearningAgent, which uses our FrozenLakeEnv class internally and trains the agent. We’ll go step-by-step to understand each part of the code.

class QLearningAgent(private var context : Context, private var imageViews : Array<ImageView> )  {

    // The learning rate for our agent.
    var learningRate = 0.98

    // Discount factor ( gamma ).
    var discountFactor = 0.6

    // Use this as a threshold for determining exploitation/exploration.
    var epsilon = 0.6

    // Frozen Lake enviroment.
    private var frozenLakeEnv: FrozenLakeEnv = FrozenLakeEnv()

    // Q Table
    private var Q : Matrix? = null

    // Some other variables which will be used for updating the Q table using Bellman's equation.
    private var currentState : Int? = null
    private var action : Int? = null
    private var agentImageIndex : Int = 0

To understand some variables in the snippet above, I’ll write down the Bellman’s equation first:

Here’s our algorithm:

α is the learning rate and γ is the discount factor. S’ is the state the agent will be reaching if action a’ is performed. The maximum of all Q values for s’ and a’ is computed. R (s, a) returns the reward for action a and in state s.

So, α and γ are initialized in lines 4 and 7.

To understand the next variable (i.e., at line no. 10), we need to first understand exploration and exploitation.

In order to solve the Frozen Lake environment, our agent needs to first explore the whole environment, so that it can update the Q table for not falling off in one of the holes.

Exploration

We want our agent to take random actions (sometimes) so that it can find new ways to reach the goal G. Why only random actions? Because by taking random actions in any state, the agent might take up an action that was not prescribed by the Q table.

Since the Q table is initially a zero matrix, we would like to take some random action in the current state. Taking a random action could broadly be understood as exploration.

Exploitation

Instead of randomly choosing an action, our agent will look at its Q table and choose the most appropriate action for a given state. It does so by taking the action which has the maximum Q value for the current state. For example, moving from position (2, 3) to (3, 3) is the best action that could be performed by the agent as it takes us to the goal.

This process of using the Q table for picking the best possible action is called exploitation.

In fact, we have a strategy for choosing exploitation and exploration. What we’ll use here is Epsilon Greedy Strategy.

We generate a random number and check if it is smaller than ε. If it is smaller than ε, we choose exploration, otherwise, we go for exploitation.

The variable epsilon in the code snippet above is used for epsilon greedy algorithm for determining whether to have exploration or exploitation.

Now comes the most important snippet of code, the implementation of the Bellman’s equation to update the Q table.

fun start( numEpisodes : Int ) {

    // Init Q table
    Q = MatrixOps.zerosLike( frozenLakeEnv.observationSpaceN , frozenLakeEnv.actionSpaceN)

    // Run an episode
    for ( episodeNum in 0 until numEpisodes ){

        // Reset the environment at the beginning of each episode.
        currentState = frozenLakeEnv.reset()

        // Perform a maximum of 100 actions before the episode is terminated
        for ( t in 0 until 100 ) {

            // Choose an action ( epsilon-greedy strategy )
            action = if ( Random().nextDouble() < epsilon ){
                frozenLakeEnv.actionSpaceSample()
            }
            else {
                Q!!.getRow( currentState!! ).indexOf( Q!!.getRow( currentState!! ).max()!! )
            }

            // Perform the action in the environment
            val envOutput = frozenLakeEnv.step( action!! )

            // Update the Q table
            val p = Q?.get( currentState!! , action!! )!! +
                    learningRate *
                    ( envOutput.reward + discountFactor * ( Q!!.getRow( envOutput.newState ).max()!!) - Q?.get( currentState!! , action!! )!! )
            Q?.set( currentState!! , action!! , p )
            currentState = envOutput.newState

            // Break this loop is the episode has terminated.
            if ( envOutput.isTerminated ){
                break
            }

        }
    }
}

At line no. 10, we reset our environment, to bring the agent to the starting position.
From line no. 16 -21, we choose an action using an epsilon greedy strategy.
We perform the chosen action in the environment at line no. 24.
Update the Q Table using Bellman’s equation at line no. 27.
Break the loop if the episode is terminated (our agent fell into a hole). This would instantiate the 2nd episode. (Lines 34 and 35).

That’s it, we’ve completed the implementation of a Q learning agent on Android! 🥳

- [Chaquopy: Using Python In Android Apps](https://proandroiddev.com/chaquopy-using-python-in-android-apps-dd5177c9ab6b)
- [Deploying Scikit-Learn Models In Android Apps With ONNX](https://towardsdatascience.com/deploying-scikit-learn-models-in-android-apps-with-onnx-b3adabe16bab) ( [GitHub](https://github.com/shubham0204/Scikit_Learn_Android_Demo) )
- [Deploying TF models on Heroku for Android](https://heartbeat.comet.ml/deploying-tf-models-on-heroku-for-android-apps-8068f8363978) ( [GitHub](https://github.com/shubham0204/TensorFlow_Model_Deployment_Heroku_Android) )
- [Realtime Depth Estimation In Android]() ( [GitHub](https://github.com/shubham0204/Realtime_MiDaS_Depth_Estimation_Android) )
- [MLKit Selfie Segmentation in Android](https://proandroiddev.com/realtime-selfie-segmentation-in-android-with-mlkit-38637c8502ba) ( [GitHub](https://github.com/shubham0204/MLKit_Selfie_Segmentation_Android) )
- [Age + Gender Estimation in Android with TensorFlow](https://equipintelligence.medium.com/detecting-age-and-gender-with-tf-lite-on-android-33997eed6c25) ( [GitHub](https://github.com/shubham0204/Age-Gender_Estimation_TF-Android) )
- [Using FaceNet For On-Device Face Recognition With Android](https://towardsdatascience.com/using-facenet-for-on-device-face-recognition-with-android-f84e36e19761) ( [GitHub](https://github.com/shubham0204/FaceRecognition_With_FaceNet_Android) )
- [Text2Summary Android](https://medium.com/@equipintelligence/introducing-text2summary-text-summarization-on-android-674b62419019) ( [GitHub](https://github.com/shubham0204/Text2Summary-Android) )
- [Q-Learning With The Frozen Lake Environment In Android](https://heartbeat.fritz.ai/q-learning-with-the-frozen-lake-environment-in-android-937cf48dcc52) ( [GitHub](https://github.com/shubham0204/QLearning_With_FrozenLakeEnv_Android) )
- [Hyperparameter Optimization With Genetic Algorithms in Kotlin](https://heartbeat.fritz.ai/hyperparameter-optimization-with-genetic-algorithms-in-kotlin-75e9c5a1e5ab) ( [GitHub](https://github.com/shubham0204/NN_GeneticAlgo_Optimization_Kotlin) )
- [Exploring Random Forests In Light Of Kotlin](https://heartbeat.fritz.ai/exploring-random-forests-in-the-light-of-kotlin-250210793248) ( [GitHub](https://github.com/shubham0204/Decision_Tree_On_Android) )
- [Coding Feed-Forward Neural Networks in Kotlin (or Android)](https://heartbeat.fritz.ai/coding-feed-foward-neural-networks-in-kotlin-or-android-b93efd47538f) ( [GitHub](https://github.com/shubham0204/Feedforward_Neural_Network_Kotlin) )
- [Gaussian Naive Bayes ( for Iris Classification ) in Android](https://heartbeat.fritz.ai/implementing-the-gaussian-naive-bayes-classifier-in-android-67746b69d1b1) ( [GitHub](https://github.com/shubham0204/GaussianNaiveBayes_Android_App) )
- [How I made Skinly for Melanoma Detection in Android](https://medium.com/dataseries/how-i-made-skinly-for-melanoma-detection-in-android-6ad00f0bd26d) ( [GitHub](https://github.com/shubham0204/Skinly_for_Melanoma) )
- [Designing Decision Trees From Scratch on Android](https://medium.com/@equipintelligence/designing-decision-trees-from-scratch-on-android-68bf7ee0d01a) ( [GitHub](https://github.com/shubham0204/Decision_Tree_On_Android) )
- [Text Classification in Android with TensorFlow](https://medium.com/dataseries/spam-classification-in-android-with-tensorflow-lite-cde417e81260) ( [GitHub](https://github.com/shubham0204/Spam_Classification_Android_Demo) )
- [Sarcasm Detection using Word Embeddings in Android](https://medium.com/@equipintelligence/sarcasm-detection-using-word-embeddings-in-android-999a791d676a) ( [GitHub](https://github.com/shubham0204/Sarcaso_for_Android) )
- [Hands-on With Multiple Linear Regression on Android](https://medium.com/@equipintelligence/hands-on-with-multiple-linear-regression-on-android-723a29e4f690)
- [Bayes Text Classification in Kotlin for Android without TensorFlow](https://medium.com/predict/bayes-text-classification-in-kotlin-for-android-without-tensorflow-d10f1247c23d)
- [Introducing TensorFlow Lite Android Support Library](https://towardsdatascience.com/tensorflow-lite-android-support-library-simply-ml-on-android-561402292c80)

Q-Learning With The Frozen Lake Environment In Android

Create a Python-like Environment and Agent with Kotlin

📱 Mobile Machine Learning

Before we start …

Creating the Frozen Lake Environment

Implementing the Q-Learning Algorithm

Exploration

Exploitation

Further Reading

Fritz

Comments 0 Responses

Leave a Reply Cancel reply