Trending November 2023 # A Comprehensive Beginners Guide To Linear Algebra For Data Scientists # Suggested December 2023 # Top 14 Popular

You are reading the article A Comprehensive Beginners Guide To Linear Algebra For Data Scientists updated in November 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 A Comprehensive Beginners Guide To Linear Algebra For Data Scientists


One of the most common questions we get on Analytics Vidhya is,

How much maths do I need to learn to be a data scientist?

Even though the question sounds simple, there is no simple answer to the the question. Usually, we say that you need to know basic descriptive and inferential statistics to start. That is good to start.

But, once you have covered the basic concepts in machine learning, you will need to learn some more math. You need it to understand how these algorithms work. What are their limitations and in case they make any underlying assumptions. Now, there could be a lot of areas to study including algebra, calculus, statistics, 3-D geometry etc.

If you get confused (like I did) and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with Linear Algebra. 

But, the problem does not stop there. The next challenge is to figure out how to learn Linear Algebra. You can get lost in the detailed mathematics and derivation and learning them would not help as much! I went through that journey myself and hence decided to write this comprehensive guide.

If you have faced this question about how to learn & what to learn in Linear Algebra – you are at the right place. Just follow this guide.

And if you’re looking to understand where linear algebra fits into the overall data science scheme, here’s the perfect article:

Table of contents

Motivation – Why learn Linear Algebra?

2.3. Planes

3.3 Representing in Matrix form

4.2.3 Use of Inverse in Data Science

5.2 Use of Eigenvectors in Data Science: PCA algorithm

Singular Value Decomposition of a Matrix

End Notes

1. Motivation – Why learn Linear Algebra?

I would like to present 4 scenarios to showcase why learning Linear Algebra is important, if you are learning Data Science and Machine Learning.

Scenario 1:

What do you see when you look at the image above? You most likely said flower, leaves -not too difficult. But, if I ask you to write that logic so that a computer can do the same for you – it will be a very difficult task (to say the least).

You were able to identify the flower because the human brain has gone through million years of evolution. We do not understand what goes in the background to be able to tell whether the colour in the picture is red or black. We have somehow trained our brains to automatically perform this task.

But making a computer do the same task is not an easy task, and is an active area of research in Machine Learning and Computer Science in general. But before we work on identifying attributes in an image, let us ponder over a particular question- How does a machine stores this image?

You probably know that computers of today are designed to process only 0 and 1. So how can an image such as above with multiple attributes like colour be stored in a computer? This is achieved by storing the pixel intensities in a construct called Matrix. Then, this matrix can be processed to identify colours etc.

So any operation which you want to perform on this image would likely use Linear Algebra and matrices at the back end.

Scenario 2:

If you are somewhat familiar with the Data Science domain, you might have heard about the world “XGBOOST” – an algorithm employed most frequently by winners of Data Science Competitions. It stores the numeric data in the form of Matrix to give predictions. It enables XGBOOST to process data faster and provide more accurate results. Moreover, not just XGBOOST but various other algorithms use Matrices to store and process data.

Scenario 3:

Deep Learning- the new buzz word in town employs Matrices to store inputs such as image or speech or text to give a state-of-the-art solution to these problems. Weights learned by a Neural Network are also stored in Matrices. Below is a graphical representation of weights stored in a Matrix.

Scenario 4:

Another active area of research in Machine Learning is dealing with text and the most common techniques employed are Bag of Words, Term Document Matrix etc. All these techniques in a very similar manner store counts(or something similar) of words in documents and store this frequency count in a Matrix form to perform tasks like Semantic analysis, Language translation, Language generation etc.

So, now you would understand the importance of Linear Algebra in machine learning. We have seen image, text or any data, in general, employing matrices to store and process data. This should be motivation enough to go through the material below to get you started on Linear Algebra. This is a relatively long guide, but it builds Linear Algebra from the ground up.

2. Representation of problems in Linear Algebra

Let’s start with a simple problem. Suppose that price of 1 ball & 2 bat or 2 ball and 1 bat is 100 units. We need to find price of a ball and a bat.

Suppose the price of a bat is Rs ‘x’ and the price of a ball is Rs ‘y’. Values of ‘x’ and ‘y’ can be anything depending on the situation i.e. ‘x’ and ‘y’ are variables.

Let’s translate this in mathematical form –

2x + y = 100 ...........(1)

Similarly, for the second condition-

x + 2y  =  100 ..............(2)

Now, to find the prices of bat and ball, we need the values of ‘x’ and ‘y’ such that it satisfies both the equations. The basic problem of linear algebra is to find these values of ‘x’ and ‘y’ i.e. the solution of a set of linear equations.

Broadly speaking, in linear algebra data is represented in the form of linear equations. These linear equations are in turn represented in the form of matrices and vectors.

The number of variables as well as the number of equations may vary depending upon the condition, but the representation is in form of matrices and vectors.

2.1 Visualise the problem

It is usually helpful to visualize data problems. Let us see if that helps in this case.

Linear equations represent flat objects. We will start with the simplest one to understand i.e. line. A line corresponding to an equation is the set of all the points which satisfy the given equation. For example,

Points (50,0) , (0,100), (100/3,100/3) and (30,40) satisfy our  equation (1) . So these points should lie on the line corresponding to our equation (1). Similarly, (0,50),(100,0),(100/3,100/3) are some of the points that satisfy equation (2).

Now in this situation, we want both of the conditions to be satisfied i.e. the point which lies on both the lines.  Intuitively, we want to find the intersection point of both the lines as shown in the figure below.

Let’s solve the problem by elementary algebraic operations like addition, subtraction and substitution.

2x + y = 100 .............(1)

x + 2y = 100 ..........(2)

from equation (1)-

y = (100- x)/2

put value of y in equation (2)-

x + 2*(100-x)/2 = 100......(3) 

Now, since the equation (3) is an equation in single variable x, it can be solved for x and subsequently y.

That looks simple – let’s go one step further and explore.

2.2 Let’s complicate the problem

Now, suppose you are given a set of three conditions with three variables each as given below and asked to find the values of all the variables. Let’s solve the problem and see what happens.




From equation (4) we get,


Substituting value of z in equation (6), we get –



Now, we can solve equations (8) and (5) as a case of two variables to find the values of ‘x’ and ‘y’ in the problem of bat and ball above. Once we know‘x’ and ‘y’, we can use (7)  to find the value of ‘z’.

As you might see, adding an extra variable has tremendously increased our efforts for finding the solution of the problem. Now imagine having 10 variables and 10 equations. Solving 10 equations simultaneously can prove to be tedious and time consuming. Now dive into data science. We have millions of data points. How do you solve those problems?

We have millions of data points in a real data set. It is going to be a nightmare to reach to solutions using the approach mentioned above. And imagine if we have to do it again and again and again. It’s going to take ages before we can solve this problem. And now if I tell you that it’s just one part of the battle, what would you think? So, what should we do? Should we quit and let it go? Definitely NO. Then?

Matrix is used to solve a large set of linear equations. But before we go further and take a look at matrices, let’s visualise the physical meaning of our problem. Give a little bit of thought to the next topic. It directly relates to the usage of Matrices.

2.3 Planes

A linear equation in 3 variables represents the set of all points whose coordinates satisfy the equations. Can you figure out the physical object represented by such an equation? Try to think of 2 variables at a time in any equation and then add the third one. You should figure out that it represents a three-dimensional analogue of line.

Basically, a linear equation in three variables represents a plane. More technically, a plane is a flat geometric object which extends up to infinity.

As in the case of a line, finding solutions to 3 variables linear equation means we want to find the intersection of those planes. Now can you imagine, in how many ways a set of three planes can intersect? Let me help you out. There are 4 possible cases –

No intersection at all.

Planes intersect in a line.

They can intersect in a plane.

All the three planes intersect at a point.

Can you imagine the number of solutions in each case? Try doing this. Here is an aid picked from Wikipedia to help you visualise.

So, what was the point of having you to visualise all graphs above?

Normal humans like us and most of the super mathematicians can only visualise things in 3-Dimensions, and having to visualise things in 4 (or 10000) dimensions is difficult impossible for mortals. So, how do mathematicians deal with higher dimensional data so efficiently? They have tricks up their sleeves and Matrices is one such trick employed by mathematicians to deal with higher dimensional data.

Now let’s proceed with our main focus i.e. Matrix.

3. Matrix

Matrix is a way of writing similar things together to handle and manipulate them as per our requirements easily. In Data Science, it is generally used to store information like weights in an Artificial Neural Network while training various algorithms. You will be able to understand my point by the end of this article.

Technically, a matrix is a 2-D array of numbers (as far as Data Science is concerned). For example look at the matrix A below.

1 2 3

4 5 6

7 8 9

Generally, rows are denoted by ‘i’ and column are denoted by ‘j’.  The elements are indexed by ‘i’th row and ‘j’th chúng tôi denote the matrix by some alphabet e.g.  A and its elements by A(ij).

In above matrix

A12 =  2

To reach to the result, go along first row and reach to second column.

3.1 Terms related to Matrix

Order of matrix – If a matrix has 3 rows and 4 columns, order of the matrix is 3*4 i.e. row*column.

Square matrix – The matrix in which the number of rows is equal to the number of columns.

Diagonal matrix – A matrix with all the non-diagonal elements equal to 0 is called a diagonal matrix.

Upper triangular matrix – Square matrix with all the elements below diagonal equal to 0.

Lower triangular matrix – Square matrix with all the elements above the diagonal equal to 0.

Scalar matrix – Square matrix with all the diagonal elements equal to some constant k.

Identity matrix – Square matrix with all the diagonal elements equal to 1 and all the non-diagonal elements equal to 0.

Column matrix –  The matrix which consists of only 1 column. Sometimes, it is used to represent a vector.

Row matrix –  A matrix consisting only of row.

Trace – It is the sum of all the diagonal elements of a square matrix.

3.2 Basic operations on matrix

Let’s play with matrices and realise the capabilities of matrix operations.

Addition – Addition of matrices is almost similar to basic arithmetic addition. All you need is the order of all the matrices being added should be same. This point will become obvious once you will do matrix addition by yourself.

Suppose we have 2 matrices ‘A’ and ‘B’ and the resultant matrix after the addition is ‘C’. Then

Cij  =   Aij + Bij

For example, let’s take two matrices and solve them.

A      =

1 0

2 3

B    =

4 -1

0 5


C        =

5 -1

2 8

Observe that to get the elements of C matrix, I have added A and B element-wise i.e. 1 to 4, 3 to 5 and so on.

Scalar Multiplication –  Multiplication of a matrix with a scalar constant is called scalar multiplication. All we have to do in a scalar multiplication is to multiply each element of the matrix with the given constant.  Suppose we have a constant scalar ‘c’ and a matrix ‘A’.  Then multiplying ‘c’ with ‘A’  gives-

c[Aij] =  [c*Aij]

Transposition – Transposition simply means interchanging the row and column index. For example-

AijT= Aji

Transpose is used in vectorized implementation of linear and logistic regression.

Code in python

Code in R

View the code on Gist.


[,1] [,2] [,3] [1,] 11 12 13 [2,] 14 15 16 [3,] 17 18 19

View the code on Gist.

t(A) [,1] [,2] [,3] [1,] 11 14 17 [2,] 12 15 18 [3,] 13 16 19

Matrix multiplication

Matrix multiplication is one of the most frequently used operations in linear algebra. We will learn to multiply two matrices as well as go through its important properties.

Before landing to algorithms, there are a few points to be kept in mind.

The multiplication of two matrices of orders i*j and j*k results into a matrix of order i*k.  Just keep the outer indices in order to get the indices of the final matrix.

Two matrices will be compatible for multiplication only if the number of columns of the first matrix and the number of rows of the second one are same.

The third point is that order of multiplication matters.

Don’t worry if you can’t get these points. You will be able to understand by the end of this section.

Suppose, we are given two matrices A and B to multiply. I will write the final expression first and then will explain the steps.

I have picked this image from Wikipedia for your better understanding.

In the first illustration, we know that the order of the resulting matrix should be 3*3. So first of all, create a matrix of order 3*3. To determine (AB)ij , multiply each element of ‘i’th row of A with ‘j’th column of B one at a time and add all the terms. To help you understand element-wise multiplication, take a look at the code below.

import numpy as np


AB= array([[2250, 2316, 2382], [2556, 2631, 2706], [2862, 2946, 3030]]) BA= array([[2310, 2406, 2502], [2526, 2631, 2736], [2742, 2856, 2970]])

So, how did we get 2250 as first element of AB matrix?  2250=21*31+22*34+23*37. Similarly, for other elements.

Code in R

View the code on Gist.

A*B [,1] [,2] [,3] [1,] 220 252 286 [2,] 322 360 400 [3,] 442 486 532

Notice the difference between AB and BA.

Properties of matrix multiplication

Matrix multiplication is associative provided the given matrices are compatible for multiplication i.e.

ABC =  (AB)C = A(BC)



array([[306108, 313056, 320004], [347742, 355635, 363528], [389376, 398214, 407052]])

array([[306108, 313056, 320004], [347742, 355635, 363528], [389376, 398214, 407052]])

2. Matrix multiplication is not commutative i.e. AB and  BA are not equal. We have verified this result above.

Matrix multiplication is used in linear and logistic regression when we calculate the value of output variable by parameterized vector method. As we have learned the basics of matrices, it’s time to apply them.

3.3 Representing equations in matrix form

Let me do something exciting for you.  Take help of pen and paper and try to find the value of the matrix multiplication shown below

It can be verified very easily that the expression contains our three equations. We will name our matrices as ‘A’, ‘X’ and ‘Z’.

It explicitly verifies that we can write our equations together in one place as

AX   = Z

Next step has to be solution chúng tôi will go through two methods to find the solution.

4. Solving the Problem

Now, we will look in detail the two methods to solve matrix equations.

Row Echelon Form

Inverse of a Matrix

4.1 Row Echelon form

Now you have visualised what an equation in 3 variables represents and had a warm up on matrix operations. Let’s find the solution of the set of equations given to us to understand our first method of interest and explore it later in detail.

I have already illustrated that solving the equations by substitution method can prove to be tedious and time taking. Our first method introduces you with a neater and more systematic method to accomplish the job in which, we manipulate our original equations systematically to find the solution.  But what are those valid manipulations? Are there any qualifying criteria they have to fulfil? Well, yes. There are two conditions which have to be fulfilled by any manipulation to be valid.

Manipulation should preserve the solution i.e. solution should not be altered on imposing the manipulation.

Manipulation should be reversible.

So, what are those manipulations?

We can swap the order of equations.

We can multiply both sides of equations by any non-zero constant ‘c’.

We can multiply an equation by any non-zero constant and then add to other equation.

These points will become more clear once you go through the algorithm and practice it. The basic idea is to clear variables in successive equations and form an upper triangular matrix. Equipped with prerequisites, let’s get started. But before that, it is strongly recommended to go through this link for better understanding.

I will solve our original problem as an illustration. Let’s do it in steps.

Make an augmented matrix from the matrix ‘A’ and ‘Z’.

What I have done is I have just concatenated the two matrices. The augmented matrix simply tells that the elements in a row are coefficients of ‘x’, ‘y’ and ‘z’ and last element in the row is right-hand side of the equation.

    Multiply row (1) with 2 and subtract from row (2). Similarly, multiply equation 1 with 5 and subtract from row (3).

      In order to make an upper triangular matrix, multiply row (2) by 2 and then subtract from row (3).

        Now we have simplified our job, let’s retrieve the modified equations. We will start from the simplest i.e. the one with the minimum number of remaining variables. If you follow the illustrated procedure, you will find that last equation comes to be the simplest one.


        Now retrieve equation (2) and put the value of ‘z’ in it to find ‘y’. Do the same for equation (1).

        Isn’t it pretty simple and clean?

        Let’s ponder over another point. Will we always be able to make an upper triangular matrix which gives a unique solution? Are there different cases possible? Recall that planes can intersect in multiple ways. Take your time to figure it out and then proceed further.

        Different possible cases-

        It’s possible that we get a unique solution as illustrated in above example. It indicates that all the three planes intersect in a point.

        We can get a case like shown below

        Note that in last equation, 0=0 which is always true but it seems like we have got only 2 equations. One of the equations is redundant. In many cases, it’s also possible that the number of redundant equations is more than one. In this case, the number of solutions is infinite.

          There is another case where Echelon matrix looks as shown below

          Let’s retrieve the last equation.



          Is it possible? Very clear cut intuition is NO. But, does this signify something? It’s analogous to saying that it is impossible to find a solution and indeed, it is true. We can’t find a solution for such a set of equations. Can you think what is happening actually in terms of planes? Go back to the section where we saw planes intersecting and find it out.

          Note that this method is efficient for a set of 5-6 equations. Although the method is quite simple, if equation set gets larger, the number of times you have to manipulate the equations becomes enormously high and the method becomes inefficient.

          Rank of a matrix – Rank of a matrix is equal to the maximum number of linearly independent row vectors in a matrix.

          A set of vectors is linearly dependent if we can express at least one of the vectors as a linear combination of remaining vectors in the set.

          4.2 Inverse of a Matrix

          For solving a large number of equations in one go, the inverse is used. Don’t panic if you are not familiar with the inverse. We will do a good amount of work on all the required concepts. Let’s start with a few terms and operations.

          Determinant of a Matrix – The concept of determinant is applicable to square matrices only. I will lead you to the generalised expression of determinant in steps. To start with, let’s take a 2*2 matrix  A.

          For now, just focus on 2*2 matrix. The expression of determinant of the matrix A will be:

          det(A) =a*d-b*c

          Note that det(A) is a standard notation for determinant. Notice that all you have to do to find determinant in this case is to multiply diagonal elements together and put a positive or negative sign before them. For determining the sign, sum the indices of a particular element. If the sum is an even number, put a positive sign before the multiplication and if the sum is odd, put a negative sign.  For example, the sum of indices of element ‘a11’ is 2. Similarly the sum of indices of element ‘d’ is 4. So we put a positive sign before the first term in the expression.  Do the same thing for the second term yourself.

          Now take a 3*3 matrix ‘B’ and find its determinant.

          I am writing the expression first and then will explain the procedure step by step.

          Each term consists of two parts basically i.e. a submatrix and a coefficient. First of all, pick a constant. Observe that coefficients are picked from the first row only. To start with, I have picked the first element of the first row. You can start wherever you want. Once you have picked the coefficient, just delete all the elements in the row and column corresponding to the chosen coefficient. Next, make a matrix of the remaining elements; each one in its original position after deleting the row and column and find the determinant of this submatrix . Repeat the same procedure for each element in the first row. Now, for determining the sign of the terms, just add the indices of the coefficient element. If it is even, put a positive sign and if odd, put a negative sign. Finally, add all the terms to find the determinant. Now, let’s take a higher order matrix ‘C’ and generalise the concept.

          Try to relate the expression to what we have done already and figure out the final expression.

          Code in python

          arr = np.arange(100,116).reshape(4,4)

          array([[100, 101, 102, 103], [104, 105, 106, 107], [108, 109, 110, 111], [112, 113, 114, 115]])



          Code in R

          View the code on Gist.

          [,1] [,2] [,3] [1,] -0.16208333 -0.1125 0.17458333 [2,] -0.07916667 0.1250 -0.04583333 [3,] 0.20791667 -0.0125 -0.09541667 #Determinant -0.0004166667

          Minor of a matrix

          Let’s take a square matrix A. then minor corresponding to an element A(ij)  is the determinant of the submatrix formed by deleting the ‘i’th  row and ‘j’th column of the matrix. Hope you can relate with what I have explained already in the determinant section. Let’s take an example.

          To find the minor corresponding to element A11, delete first row and first column to find the submatrix.

          Now find the determinant of this matrix as explained already. If you calculate the determinant of this matrix, you should get 4. If we denote minor by M11, then

          M11 = 4

          Similarly, you can do for other elements.

          Cofactor of a matrix

          In the above discussion of minors, if we consider signs of minor terms, the resultant we get is called cofactor of a matrix. To assign the sign, just sum the indices of the corresponding element. If it turns out to be even, assign positive sign. Else assign negative. Let’s take above illustration as an example. If we add the indices i.e. 1+1=2, so we should put a positive sign. Let’s say it C11. Then

          C11 = 4

          You should find cofactors corresponding to other elements by yourself for a good amount of practice.

          Cofactor matrix

          Find the cofactor corresponding to each element. Now in the original matrix, replace the original element by the corresponding cofactor. The matrix thus found is called the cofactor matrix corresponding to the original matrix.

          For example, let’s take our matrix A. if you have found out the cofactors corresponding to each element, just put them in a matrix according to rule stated above. If you have done it right, you should get cofactor matrix

          Adjoint of a matrix – In our journey to find inverse, we are almost at the end. Just keep hold of the article for a couple of minutes and we will be there. So, next we will find the adjoint of a matrix.

          Suppose we have to find the adjoint of a matrix A. we will do it in two steps.

          In step 1, find the cofactor matrix of A.

          In step 2, just transpose the cofactor matrix.

          The resulting matrix is the adjoint of the original matrix. For illustration, lets find the adjoint of our matrix A. we already have cofactor matrix C. Transpose of cofactor matrix should be

          Finally, in the next section, we will find the inverse.

          4.2.1 Finding Inverse of a matrix

          Do you remember the concept of the inverse of a number in elementary algebra? Well, if there exist two numbers such that upon their multiplication gives 1 then those two numbers are called inverse of each other. Similarly in linear algebra, if there exist two matrices such that their multiplication yields an identity matrix then the matrices are called inverse of each other. If you can not get what I explained, just go with the article. It will come intuitively to you. The best way to learning is learning by doing. So, let’s jump straight to the algorithm for finding the inverse of a matrix A. Again, we will do it in two steps.

          Step 1: Find out the adjoint of the matrix A by the procedure explained in previous sections.

          Step2: Multiply the adjoint matrix by the inverse of determinant of the matrix A. The resulting matrix is the inverse of A.

          For example, let’s take our matrix A and find it’s inverse. We already have the adjoint matrix. Determinant of matrix A comes to be -2. So, its inverse will be

          Now suppose that the determinant comes out to be 0. What happens when we invert the determinant i.e. 0?  Does it make any sense?  It indicates clearly that we can’t find the inverse of such a matrix. Hence, this matrix is non-invertible. More technically, this type of matrix is called a singular matrix.

          Keep in mind that the resultant of multiplication of a matrix and its inverse is an identity matrix. This property is going to be used extensively in equation solving.

          Inverse is used in finding parameter vector corresponding to minimum cost function in linear regression.

          4.2.2 Power of matrices

          What happens when we multiply a number by 1? Obviously it remains the same. The same is applicable for an identity matrix i.e. if we multiply a matrix with an identity matrix of the same order, it remains same.

          Lets solve our original problem with the help of matrices. Our original problem represented in matrix was as shown below

          AX = Z i.e.

          What happens when we pre multiply both the sides with inverse of coefficient matrix i.e. A. Lets find out by doing.

          A-1 A X =A-1 Z

          We can manipulate it as,

          (A-1 A) X = A -1Z

          But we know multiply a matrix with its inverse gives an Identity Matrix. So,

          IX =  A -1Z

          Where I is the identity matrix of the corresponding order.

          If you observe keenly, we have already reached to the solution. Multiplying identity matrix to X does not change it. So the equation becomes

          X = A -1Z

          For solving the equation, we have to just find the inverse. It can be very easily done by executing a few lines of codes. Isn’t it a really powerful method?

          Code for inverse in python

          arr1 = np.arange(5,21).reshape(4,4)


          4.2.3 Application of inverse in Data Science

          Inverse is used to calculate parameter vector by normal equation in linear equation. Here is an illustration. Suppose we are given a data set as shown below-

          Team League Year RS RA W OBP SLG BA G OOBP OSLG

          ARI NL 2012 734 688 81 0.328 0.418 0.259 162 0.317 0.415

          ATL NL 2012 700 600 94 0.32 0.389 0.247 162 0.306 0.378

          BAL AL 2012 712 705 93 0.311 0.417 0.247 162 0.315 0.403

          BOS AL 2012 734 806 69 0.315 0.415 0.26 162 0.331 0.428

          CHC NL 2012 613 759 61 0.302 0.378 0.24 162 0.335 0.424

          CHW AL 2012 748 676 85 0.318 0.422 0.255 162 0.319 0.405

          CIN NL 2012 669 588 97 0.315 0.411 0.251 162 0.305 0.39

          CLE AL 2012 667 845 68 0.324 0.381 0.251 162 0.336 0.43

          COL NL 2012 758 890 64 0.33 0.436 0.274 162 0.357 0.47

          DET AL 2012 726 670 88 0.335 0.422 0.268 162 0.314 0.402

          HOU NL 2012 583 794 55 0.302 0.371 0.236 162 0.337 0.427

          KCR AL 2012 676 746 72 0.317 0.4 0.265 162 0.339 0.423

          LAA AL 2012 767 699 89 0.332 0.433 0.274 162 0.31 0.403

          LAD NL 2012 637 597 86 0.317 0.374 0.252 162 0.31 0.364

          It describes the different variables of different baseball teams to predict whether it makes to playoffs or not. But for right now to make it a regression problem, suppose we are interested in predicting OOBP from the rest of the variables. So, ‘OOBP’ is our target variable. To solve this problem using linear regression, we have to find parameter vector. If you are familiar with Normal equation method, you should have the idea that to do it, we need to make use of Matrices. Lets proceed further and denote our Independent variables below as matrix ‘X’.This data is a part of a data set taken from analytics edge. Here is the link for the data set.

          so,  X=

          734 688 81 0.328 0.418 0.259

          700 600 94 0.32 0.389 0.247

          712 705 93 0.311 0.417 0.247

          734 806 69 0.315 0.415 0.26

          613 759 61 0.302 0.378 0.24

          748 676 85 0.318 0.422 0.255

          669 588 97 0.315 0.411 0.251

          667 845 68 0.324 0.381 0.251

          758 890 64 0.33 0.436 0.274

          726 670 88 0.335 0.422 0.268

          583 794 55 0.302 0.371 0.236

          676 746 72 0.317 0.4 0.265

          767 699 89 0.332 0.433 0.274

          637 597 86 0.317 0.374 0.252

          To find the final parameter vector(θ) assuming our initial function is parameterised by θ and X , all you have to do is to find the inverse of (XT X) which can be accomplished very easily by using code as shown below.

          First of all, let me make the Linear Regression formulation easier for you to comprehend.

          f θ (X)= θT X, where θ is the parameter we wish to calculate and X is the column vector of features or independent variables.

          import numpy as np

          #you don’t need to bother about the following. It just #transforms the data from original source into matrix

          Df1 = df.head(14)


          X = np.asmatrix(X)

          x= np.transpose(X)




          Imagine if you had to solve this set of equations without using linear algebra. Let me remind you that this data set is less than even 1% of original date set. Now imagine if you had to find parameter vector without using linear algebra. It would have taken a lots of time and effort and could be even impossible to solve sometimes.

          One major drawback of normal equation method when the number of features is large is that it is computationally very costly. The reason is that if there are ‘n’ features, the matrix (XT X) comes to be the order n*n and its solution costs time of order O( n*n*n). Generally, normal equation method is applied when a number of features is of the order of 1000 or 10,000. Data sets with a larger number of features are handled with the help another method called Gradient Descent.

          5. Eigenvalues and Eigenvectors

          Eigenvectors find a lot of applications in different domains like computer vision, physics and machine learning. If you have studied machine learning and are familiar with Principal component analysis algorithm, you must know how important the algorithm is when handling a large data set. Have you ever wondered what is going on behind that algorithm? Actually, the concept of Eigenvectors is the backbone of this algorithm. Let us explore Eigen vectors and Eigen values for a better understanding of it.

          Let’s multiply a 2-dimensional vector with a 2*2 matrix and see what happens.

          This operation on a vector is called linear transformation.  Notice that the directions of input and output vectors are different. Note that the column matrix denotes a vector here.

          I will illustrate my point with the help of a picture as shown below.

          In the above picture, there are two types of vectors coloured in red and yellow and the picture is showing the change in vectors after a linear transformation. Note that on applying a linear transformation to yellow coloured vector, its direction changes but the direction of the red coloured vector doesn’t change even after applying the linear transformation. The vector coloured in red is an example of Eigenvector.

          Precisely, for a particular matrix; vectors whose direction remains unchanged even after applying linear transformation with the matrix are called Eigenvectors for that particular matrix. Remember that the concept of Eigen values and vectors is applicable to square matrices only. Another thing to know is that I have taken a case of two-dimensional vectors but the concept of Eigenvectors is applicable to a space of any number of dimensions.

          5.1 How to find Eigenvectors of a matrix?

          Suppose we have a matrix A and an Eigenvector ‘x’ corresponding to the matrix. As explained already, after multiplication with matrix the direction of ‘x’ doesn’t change. Only change in magnitude is permitted. Let us write it as an equation-

          Ax = cx

          (A-c)x = 0  …….(1)

          Please note that in the term (A-c), ‘c’ denotes an identity matrix of the order equal to ‘A’ multiplied by a scalar ‘c’

          We have two unknowns ‘c’ and ‘x’ and only one equation. Can you think of a trick to solve this equation?

          In equation (1), if we put the vector ‘x’ as zero vector, it makes no sense. Hence, the only choice is that (A-c) is a singular matrix. And singular matrix has a property that its determinant equals to 0. We will use this property to find the value of ‘c’.

          Det(A-c) = 0

          Once you find the determinant of the matrix (A-c) and equate to 0, you will get an equation in ‘c’ of the order depending upon the given matrix A. all you have to do is to find the solution of the equation. Suppose that we find solutions as ‘c1’ , ‘c2’ and so on. Put ‘c1’ in equation (1) and find the vector ‘x1’ corresponding to ‘c1’. The vector ‘x1’ that you just found is an Eigenvector of A. Now, repeat the same procedure with ‘c2’, ‘c3’ and so on.

          Code for finding EigenVectors in python

          import  numpy as np

          arr = np.arange(1,10).reshape(3,3)


          Code in R for finding Eigenvalues and Eigenvectors:

          View the code on Gist.


          147.737576 5.317459 -3.055035 [,1] [,2] [,3] [1,] -0.3948374 0.4437557 -0.74478185 [2,] -0.5497457 -0.8199420 -0.06303763 [3,] -0.7361271 0.3616296 0.66432391 5.2 Use of Eigenvectors in Data Science

          The concept of Eigenvectors is applied in a machine learning algorithm Principal Component Analysis. Suppose you have a data with a large number of features i.e. it has a very high dimensionality. It is possible that there are redundant features in that data. Apart from this, a large number of features will cause reduced efficiency and more disk space. What PCA does is that it craps some of lesser important features. But how to determine those features? Here, Eigenvectors come to our rescue.Let’s go through the algorithm of PCA. Suppose we have an ‘n’ dimensional data and we want to reduce it to ‘k’ dimensions. We will do it in steps.

          Step 1: Data is mean normalised and feature scaled.

          Step 2: We find out the covariance matrix of our data set.

          Now we want to reduce the number of features i.e. dimensions. But cutting off features means loss of information. We want to minimise the loss of information i.e. we want to keep the maximum variance. So, we want to find out the directions in which variance is maximum. We will find these directions in the next step.

          Step 4: We will select ‘k’ Eigenvectors corresponding to the ‘k’ largest Eigenvalues and will form a matrix in which each Eigenvector will constitute a column. We will call this matrix as U.

          Now it’s the time to find the reduced data points. Suppose you want to reduce a data point ‘a’ in the data set to ‘k’ dimensions.  To do so, you have to just transpose the matrix U and multiply it with the vector ‘a’. You will get the required vector in ‘k’ dimensions.

          6. Singular Value Decomposition

          Suppose you are given a feature matrix A. As suggested by name, what we do is we decompose our matrix A in three constituent matrices for a special purpose.  Sometimes, it is also said that svd is some sort of generalisation of Eigen value decomposition.  I will not go into its mathematics for the reason already explained and will stick to our plan i.e. use of svd in data science.

          Svd is used to remove the redundant features in a data set. Suppose you have a data set which comprises of 1000 features. Definitely, any real data set with such a large number of features is bound to contain redundant features. if you have run ML, you should be familiar with the fact that Redundant features cause a lots of problems in running machine learning algorithms. Also, running an algorithm on the original data set will be time inefficient and will require a lot of memory. So, what should you to do handle such a problem? Do we have a choice?  Can we omit some features? Will it lead to significant amount of information loss? Will we be able to get an efficient enough algorithm even after omitting the rows? I will answer these questions with the help of an illustration.

          Look at the pictures shown below taken from this link

          We can convert this tiger into black and white and can think of it as a matrix whose elements represent the pixel intensity as relevant location. In simpler words, the matrix contains information about the intensity of pixels of the image in the form of rows and columns. But, is it necessary to have all the columns in the intensity matrix? Will we be able to represent the tiger with a lesser amount of information? The next picture will clarify my point. In this picture, different images are shown corresponding to different ranks with different resolution. For now, just assume that higher rank implies the larger amount of information about pixel intensity. The image is taken from this link

          It is clear that we can reach to a pretty well image with 20 or 30 ranks instead of 100 or 200 ranks and that’s what we want to do in a case of highly redundant data. What I want to convey is that to get a reasonable hypothesis, we don’t have to retain all the information present in the original dataset. Even, some of the features cause a problem in reaching a solution to the best algorithm. For the example, presence of redundant features causes multi co-linearity in linear regression. Also, some features are not significant for our model. Omitting these features helps to find a better fit of algorithm along with time efficiency and lesser disk space. Singular value decomposition is used to get rid of the redundant features present in our data.

          7. End notes


          You're reading A Comprehensive Beginners Guide To Linear Algebra For Data Scientists

          A Comprehensive Guide To Reinforcement Learning

          Everyone heard when DeepMind announced its milestone project AlphaGo –

          AlphaGo is the first computer program to defeat a professional human Go player, the first to defeat a Go world champion, and is arguably the strongest Go player in history.

          This alone says a lot about how powerful the program itself is but how did they achieve it? They did it through novel approaches in Reinforcement learning!

          And it’s not just fixated on games, the applications range from –

          In this guide, I’ll walk you through the theory behind reinforcement learning, ideas based on theory, various algorithms with basic concepts, and implementation in Python!

          Table of Contents

          Fundamentals of Reinforcement learning

          Creating an environment using OpenAI Gym

          Algorithms (Concepts and Implementation)

          RL – Libraries in Python

          Challenges in Reinforcement Learning


          Fundamentals of Reinforcement Learning

          Let’s dig into the fundamentals of RL and review them step by step.

          Key elements fundamental to RL

          There are basically 4 elements – Agent, Environment, State-Action, Reward


          An agent is a program that learns to make decisions. We can say that an agent is a learner in the RL setting. For instance, a badminton player can be considered an agent since the player learns to make the finest shots with timing to win the game. Similarly, a player in FPS games is an agent as he takes the best actions to improve his score on the leaderboard.


          For instance, we discussed badminton players, here the court is the environment in which the player moves and takes appropriate shots. Same in the case of the FPS game, we have a map with all the essentials (guns, other players, ground, buildings) which is our environment to act for an agent.

          State – Action

          A state is a moment or instance in the environment at any point. Let’s understand it with the help of chess. There are 64 places with 2 sides and different pieces to move. Now this chessboard will be our environment and player, our agent. At some point after the start of the game, pieces will occupy different places in the board, and with every move, the board will differ from its previous situation. This instance of the board is called a state(denoted by s). Any move will change the state to a different one and the act of moving pieces is called action (denoted by a).


          We have seen how taking actions change the state of the environment. For each action ‘a’ the agent takes, it receives a reward (feedback). The reward is simply a numerical value assigned which could be negative or positive with different magnitude.

          Let’s take badminton example if the agent takes the shot which results in a positive score we can assign a reward as +10. But if it gets the shuttle inside his court then it will get a negative reward -10. We can further break rewards by giving small positive rewards(+2) for increasing the chances of a positive score and vice versa.

          Rough Idea to relate Reinforcement Learning problems

          Before we move on to the Math essentials, I’d like to give a bird-eye view of the reinforcement learning problem. Let’s take the analogy of training a pet to do few tricks. For every successful completion of the trick, we give our pet a treat. If the pet fails to do the same trick we don’t give him a treat. So, our pet will figure out what action caused it to receive a cookie and repeat that action. Thus, our pet will understand that completing a trick caused it to receive a treat and will attempt to repeat doing the tricks. Thus, in this way, our pet will learn a trick successfully while aiming to maximize the treats it can receive.

          Here the pet was Agent, groundfloor our environment which includes our pet. Treats given were rewards and every action pet took landed him in a different state than the previous.

          Markov Decision Process (MDP)

          The Markov Decision Process (MDP) provides a mathematical framework for solving RL problems. Almost all RL problems can be modeled as an MDP. MDPs are widely used for solving various optimization problems. But to understand what MDP is, we’d have to understand Markov property and Markov Chain.

          The Markov property and Markov chain

          Markov Property is simply put – says that future states will not depend on the past and will solely depend on the present state. The sequence of these states (obey Markov property) is called Markov Chain.

          Change from one state to another is called transition and the probability of it is transition probability. In simpler words, it means in every state we can have different choices(actions) to choose from. Each choice(action) will result in a different state and the probability of reaching the next state(s’) will be stored in our sequence.

          Now, if we add rewards in Markov Chains we get a sequence with the state, transition probability, and rewards (The Markov Reward Process). If we further extend this to include actions it will become The Markov Decision Process. So, MDP is just a sequence of . We will learn more concepts on the go as we move further.

          OpenAI Gym for Training Reinforcement Learning Agents

          OpenAI is an AI research and deployment company whose goal is to ensure that artificial general intelligence benefits all of humanity. OpenAI provides a toolkit for training RL agents called Gym.

          As we have learned that, to create an RL model we need to create an environment first. The gym comes into play here and helps us to create abstract environments to train our agents on it.

          Installing Gym

          Overview of Gym

          Creating an episode in the Gym environment

          Cart-Pole balancing with a random agent

          Installing Gym

          Its installation is simple using Pip. Though the latest version of Gym was just updated a few days ago after years, we can still use the 0.17 version.

          pip install gym

          You can also clone it from the repository.

          Creating our first environment using Gym

          We will use pre-built (in Gym) examples. One can get explore all the agents from OpenAI gym documentation. Let’s start with Mountain Car.

          First, we import Gym

          import gym

          To create an environment we use the ‘make’ function which required one parameter ID (pre-built ones can be found in the documentation)

          env = gym.make('CartPole-v0')

          To can see how our environment actually looks like using render function.


          The goal here is to balance the pole as long as possible by moving the cart left or right.

          To close rendered environment, simply use

          env.close() Cartpole-Balancing using Random Agent import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample()) # take a random action env.close()

          We created an environment, the first thing we do is to reset our environment to its default values. Then we ran it for 1000 timesteps by taking random actions. The ‘step’ function is basically transitioning our current state to the next state by taking the action our agent gives (in this case it was random).


          If we want to do better than just taking random actions, we’d have to understand what our actions are doing to the environment.

          The environment’s step function returns what we need in the form of 4 values :

          observation (object): an environment-specific object representing the observation of our environment. For example, state of the board in a chess game, pixels as data from cameras or joints torque in robotic arms.

          reward (float): the amount of reward achieved by each action taken. It varies from env to env but the end goal is always to maximize our total reward.

          done (boolean): if it’s time to reset our environment again. Most of the tasks are divided into a defined episode (completion) and if done is true it means the env has completed the episode. For example, a player wins in chess or we lose all lives in the Mario game.

          info (dict): It is simply diagnostic information that is useful for debugging. The agent does not use this for learning, although it can be used for other purposes. If we want to extract some info from each timestep or episode it can be done through this.

          This is an implementation of the classic “agent-environment loop”. With each timestep, the agent chooses an action, and the environment returns an observation and a reward with info(not used for training).

          The whole process starts by calling the reset() function, which returns an initial observation.

          import gym env = gym.make('CartPole-v0') for i_episode in range(20): observation = chúng tôi for t in range(100): env.render() #renders our cartpole env print(observation) action = env.action_space.sample() #takes random action from action space observation, reward, done, info = env.step(action) if done: #prints number of timesteps it took to finish the episode print("Episode finished after {} timesteps".format(t+1)) break env.close()

          Now, what we see here is observation at each timestep, in Cartpole env observation is a list of 4 continuous values. While our actions are just 0 or 1. To check what is observation space we can simply call this function –

          import gym env = gym.make('CartPole-v0') print(env.action_space) #type and size of action space print(env.observation_space) #type and size of observation space

          Discrete and box are the most common type of spaces in Gym env. Discrete as the name suggests has defined values while box consists of continuous values. Action values are as follows –

          Value Action 0 Push cart towards the left 1 Push cart towards the right

          Meanwhile, the observation space is a Box(4,) with 4 continuous values denoting –

          0.02002610 -0.0227738 0.01257453 0.04411007 Position of Cart Velocity of Cart Angle of Pole The velocity of Pole at the tip

          Gym environments are not just restricted to text or cart poles, its wide range is as follows –

          Atari games Box2D MuJoCo

          And many more… We can also create our own custom environment in the gym suiting to our needs.

          Popular Algorithms in Reinforcement Learning

          In this section, I will cover popular algorithms commonly used in Reinforcement Learning. Right after some basic concepts, it will be followed with implementation in python.

          Deep Q Network

          The objective of reinforcement learning is to find the optimal policy, that is, the policy that gives us the maximum return (the sum of total rewards of the episode). To compute policy we need to first compute the Q function. Once we have the Q function, then we can create a policy that selects the best action based on the maximum Q value. For instance, let’s assume we have two states A and B, we are in state A which has 4 choices, and corresponding to each choice(action) we have a Q value. In order to maximize returns, we follow the policy that has argmax (Q) for that state.

          State Action Value A left 25 A Right 35 A up 12 A down 6

          We are using a neural network to approximate the Q value hence that network is called the Q network, and if we use a deep neural network to approximate the Q value, then it is called a deep Q network or (DQN).

          Basic elements we need for understanding DQN is –

          Replay Buffer

          Loss Function

          Target Network

          Replay Buffer –

          We know that the agent makes a transition from a state s to the next state 𝑠′ by performing some action a, and then receives a reward r. We can save this transition information in a buffer called a replay buffer or experience replay. Later we sample random batches from buffer to train our agent.

          We learned that in DQN, our goal is to predict the Q value, which is just a continuous value. Thus, in DQN we basically perform a regression task. We generally use the mean squared error (MSE) as the loss function for the regression task. We can also use different functions to compute the error.

          Target Network –

          There is one issue with our loss function, we need a target value to compute the losses but when the target is in motion we can no longer get stable values of y_i to compute loss, so here we use the concept of soft update. We create another network that updates slowly as compared to our original network and computes losses since now we have frozen values of y_i. It will be better understood with the code below.

          Let’s start coding our DQN algorithm!

          import random import gym import numpy as np from collections import deque from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Flatten, Conv2D, MaxPooling2D , Dense, Activation from tensorflow.keras.optimizers import Adam env = gym.make("MsPacman-v0") state_size = (88, 80, 1) #defining state size as image input pixels action_size = env.action_space.n #number of actions to be taken

          Pre-processing to feed image in our CNN

          color = np.array([210, 164, 74]).mean() def preprocess_state(state): #creating a function to pre-process raw image from game #cropping the image and resizing it image = state[1:176:2, ::2] #converting the image to greyscale image = image.mean(axis=2) #improving contrast image[image==color] = 0 #normalize image = (image - 128) / 128 - 1 #reshape and returning the image in format of state space image = np.expand_dims(image.reshape(88, 80, 1), axis=0) return image

          We need to pre-process the raw image from the game, like removing color, cropping to the desired area, resizing it to state space as we defined previously.

          Building DQN class

          #epsilon of 0.8 denotes we get 20% random decision self.epsilon = 0.8 #define the update rate at which we update the target network self.update_rate = 1000 #building our main Neural Network self.main_network = self.build_network() #building our target network (same as our main network) self.target_network = self.build_network() #copying weights to target network self.target_network.set_weights(self.main_network.get_weights()) def build_network(self): #creating a neural net model = Sequential() model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_shape=self.state_size)) model.add(Activation('relu')) #adding hidden layer 1 model.add(Conv2D(64, (4, 4), strides=2, padding='same')) model.add(Activation('relu')) #adding hidden layer 2 model.add(Conv2D(64, (3, 3), strides=1, padding='same')) model.add(Activation('relu')) model.add(Flatten()) #feeding flattened map into our fully connected layer model.add(Dense(512, activation='relu')) model.add(Dense(self.action_size, activation='linear')) #compiling model using MSE loss with adam optimizer return model #we sample random batches of data, to store whole transition in buffer def store_transistion( self, state, action, reward, next_state, done ): self.replay_buffer.append(( state, action, reward, next_state, done)) # defining epsilon greedy function so our agent can tackle exploration vs exploitation issue def epsilon_greedy(self, state): #whenever a random value < epsilon we take random action if random.uniform(0,1) < self.epsilon: return np.random.randint(self.action_size) #then we calculate the Q value Q_values = self.main_network.predict(state) return np.argmax(Q_values[0]) #this is our main training function def train(self, batch_size): #we sample a random batch from our replay buffer to train the agent on past actions minibatch = random.sample(self.replay_buffer, batch_size) #compute Q value using target network for state, action, reward, next_state, done in minibatch: #we calculate total expected rewards from this policy if episode is not terminated if not done: target_Q = (reward + self.gamma * np.amax(self.target_network.predict(next_state))) else: target_Q = reward #we compute the values from our main network and store it in Q_value Q_values = self.main_network.predict(state) #update the target Q value for losses Q_values[0][action] = target_Q #training main network, Q_values, epochs=1, verbose=0) #update the target network weights by copying from the main network def update_target_network(self): self.target_network.set_weights(self.main_network.get_weights()

          Now we train our network after defining the values of hyper-params

          num_episodes = 500 #number of episodes to train agent on num_timesteps = 20000 #number of timesteps to be taken in each episode (until done) batch_size = 8 #taking batch size as 8 num_screens = 4 #number of past game screens we want to use dqn = DQN(state_size, action_size) #initiating the DQN class done = False #setting done to false (start of episode) time_step = 0 #begining of timestep for i in range(num_episodes): #reset total returns to 0 before starting each episode Return = 0 #preprocess the raw image from game state = preprocess_state(env.reset()) for t in range(num_timesteps): env.render() #render the env time_step += 1 #increase timestep with each loop #updating target network if time_step % dqn.update_rate == 0: dqn.update_target_network() #selection of action based on epsilon-greedy strategy action = dqn.epsilon_greedy(state) #saving the output of env after taking 'action' next_state, reward, done, _ = env.step(action) #Pre-process next state next_state = preprocess_state(next_state) #storing transition to be used later via replay buffer dqn.store_transistion(state, action, reward, next_state, done) #updating current state to next state state = next_state #calculating total reward Return += reward if done: print('Episode: ',i, ',' 'Return', Return) #if episode is completed terminate the loop break #we train if the data in replay buffer is greater than batch_size #for first 1-batch_size we take random actions dqn.train(batch_size)

          Results – Agent learned to play the game successfully.

          DDPG (Deep Deterministic Policy Gradient)

          DQN works only for discrete action space but it’s not always the case that we need discrete values. What if we want continuous action output? to overcome this situation, we start with DDPG (Timothy P. Lillicrap 2023) to deal with when both state and action space is continuous. The idea of replay buffer, target functions, loss functions will be taken from DQN but with novel techniques which I will explain in this section.

          Now, we move on to the core Actor-critic method. The original paper explains this concept quite well, but here is a rough idea. The actor takes a decision based on a policy, critic evaluates state-action pair, and gives it a Q value which is assigned to each pair. If the state-action pair is good enough according to critics, it will have a higher Q value (more preferable) and vice versa.

          Critic Network

          #creating class for critic network class CriticNetwork(nn.Module): def __init__(self, beta): super(CriticNetwork, self).__init__() #fb, insta as state of 2 dim self.input_dims = 2 #hidden layers with 256 N self.fc1_dims = 256 #hidden layers with 256 N self.fc2_dims = 256 #fb, insta spends as 2 actions to be taken self.n_actions = 2 # state + action as fully connected layer chúng tôi = nn.Linear( 2 + 2, self.fc1_dims ) #adding hidden layers chúng tôi = nn.Linear(self.fc1_dims, self.fc2_dims) #final Q value from network self.q1 = nn.Linear(self.fc2_dims, 1) #using adam optimizer with beta as learning rate self.optimizer = optim.Adam(self.parameters(), lr=beta) #device available to train on CPU/GPU self.device = T.device('cuda' if T.cuda.is_available() else 'cpu') #assigning device #Creating Critic Network with state and action as input def CriticNetwork(self, state, action): #concatinating state and action before feeding to Neural Net q1_action_value = self.fc1([state, action], dim=1 )) q1_action_value = F.relu(q1_action_value) #adding hidden layer q1_action_value = self.fc2(q1_action_value) q1_action_value = F.relu(q1_action_value) #getting final Q value q1 = self.q1(q1_action_value) return q1

          Now we move to actor-network, we created a similar network but here are some key points which you must remember while making the actor.

          Weight initialization is not necessary but generally, if we provide initialization it tends to learn faster.

          Choosing an optimizer is very very important and results can vary from the optimizer to optimizer.

          Now, how to choose the last activation function solely depends on what kind of action-space, you are using, for example, if it is small and all values are like [-1,-2,-3] to [1,2,3] you can go ahead and tanh (squashing) function, but if you have [-2,-40,-230] to [2,60,560] you might want to change the activation function or create a wrapper.


          class ActorNetwork(nn.Module): #creating actor Network def __init__(self, alpha): super(ActorNetwork, self).__init__() #fb and insta as 2 input state dim self.input_dims = 2 #first hidden layer dimension self.fc1_dims = fc1_dims #second fully connected layer dimension self.fc2_dims = fc2_dims #total number of actions self.n_actions = 2 #connecting fully connected layers chúng tôi = nn.Linear(self.input_dims, self.fc1_dims) chúng tôi = nn.Linear(self.fc1_dims, self.fc2_dims) #final output as number of action values we need (2) chúng tôi = nn.Linear(self.fc2_dims, self.n_actions) #using adam as optimizer self.optimizer = optim.Adam(self.parameters(), lr=alpha) #setting up device (CPU or GPU) to be used for computation self.device = T.device('cuda' if T.cuda.is_available() else "cpu") #connecting the device def forward(self, state): #taking state as input to our fully connected layer prob = self.fc1(state) #adding activation layer prob = F.relu(prob) #adding second layer prob = self.fc2(prob) prob = F.relu(prob) #fixing each output between 0 and 1 mu = T.sigmoid( return mu

          Note: We used 2 hidden layers since our action space was small and our environment was not very complex. Authors of DDPG used 400 and 300 neurons for 2 hidden layers but we can increase at the cost of computation power.

          Just like gym env, agent has some conditions too. We initialized our target networks with same weights as our original (A-C) networks. Since we are chasing a moving target, target networks create stability and helps original networks to train.

          We initialize all the basic requirements, as you might have noticed we have a loss function parameter too. We can use different loss functions and choose whichever works best (can be L1 smooth loss), paper used mse loss, so we will go ahead and use it as default.

          Here we include the ‘choose action’ function, you can create an evaluation function as well to cross-check values that outputs action space without noise.

          ‘Update parameter’ function, now this is where we do soft (target networks) and hard updates (original networks, complete copy). Here it takes only one parameter Tau, this is similar to how we think of learning rate.

          It is used to soft update our target networks and in the paper, they found the best tau to be 0.001 and it usually is the best across different papers.

          class Agent(object): #binding everything we did till now def __init__( self, alpha , beta, input_dims= 2, tau, env, gamma=0.99, n_actions=2, max_size=1000000, batch_size=64): #fixing discount rate gamma self.gamma = gamma #for soft updating target network, fix tau chúng tôi = tau #Replay buffer with max number of transitions to store self.memory = ReplayBuffer(max_size) #batch size to take from replay buffer self.batch_size = batch_size #creating actor network using learning rate alpha = ActorNetwork(alpha) #creating target network with same learning rate self.target_actor = ActorNetwork(alpha) #creating critic network with beta as learning rate self.target_critic = CriticNetwork(beta) #adjusting scale as std for adding noise self.scale = 1.0 self.noise = np.random.normal(scale=self.scale,size=(n_actions)) #hard updating target network weights to be same self.update_network_parameters(tau=1) #this function helps to retrieve actions by adding noise to output network def choose_action(self, observation): #get actor in eval mode #convert observation state to tensor for calcualtion observation = T.tensor(observation, dtype=T.float).to( #get the output from actor network mu = #add noise to our output from actor network mu_prime = mu + T.tensor(self.noise(),dtype=T.float).to( #set back to training mode #get the final results as array return mu_prime.cpu().detach().numpy() #training our actor and critic network from memory (Replay buffer) def learn(self): #if batch size is not filled then do not train if self.memory.mem_cntr < self.batch_size: return #otherwise take a batch from replay buffer state, action, reward, new_state, done= self.memory.sample_buffer(self.batch_size) #convert all values to tensors reward = T.tensor(reward, dtype=T.float).to(self.critic.device) done = T.tensor(done).to(self.critic.device) new_state = T.tensor(new_state, dtype=T.float).to(self.critic.device) action = T.tensor(action, dtype=T.float).to(self.critic.device) state = T.tensor(state, dtype=T.float).to(self.critic.device) #set netowrks to eval mode self.target_actor.eval() self.target_critic.eval() self.critic.eval() #fetch the output from the target network target_actions = self.target_actor.forward(new_state) #get the critic value from both networks critic_value_ = self.target_critic.forward(new_state, target_actions) critic_value = self.critic.forward(state, action) #now we will calculate total expected reward from this policy target = [] for j in range(self.batch_size): target.append(reward[j] + self.gamma*critic_value_[j]*done[j]) #convert it to tensor on respective device(cpu or gpu) target = T.tensor(target).to(self.critic.device) target = target.view(self.batch_size, 1) #to train critic value set it to train mode back self.critic.train() self.critic.optimizer.zero_grad() #calculate losses from expected value vs critic value critic_loss = F.mse_loss(target, critic_value) #backpropogate the values critic_loss.backward() #update the weights self.critic.optimizer.step() self.critic.eval() #fetch the output of actor network mu = #using formula from DDPG network to calculate actor loss actor_loss = -self.critic.forward(state, mu) #calculating losses actor_loss = T.mean(actor_loss) #back propogation actor_loss.backward() #update the weights #soft update the target network self.update_network_parameters() #since our target is continuously moving we need to soft update target network def update_network_parameters(self, tau=None): #if tau is not given then use default from class if tau is None: tau = self.tau #fetch the parameters actor_params = critic_params = self.critic.named_parameters() #fetch target parameters target_actor_params = self.target_actor.named_parameters() target_critic_params = self.target_critic.named_parameters() #create dictionary of params critic_state_dict = dict(critic_params) actor_state_dict = dict(actor_params) target_critic_dict = dict(target_critic_params) target_actor_dict = dict(target_actor_params) #update critic network with tau as learning rate (tau =1 means hard update) for name in critic_state_dict: critic_state_dict[name] = tau*critic_state_dict[name].clone() + (1-tau)*target_critic_dict[name].clone() self.target_critic.load_state_dict(critic_state_dict) #updating actor network with tau as learning rate for name in actor_state_dict: actor_state_dict[name] = tau*actor_state_dict[name].clone() + (1-tau)*target_actor_dict[name].clone() self.target_actor.load_state_dict(actor_state_dict)

          The most crucial part is the learning function. First, we feed the network with samples until it fills up to the batch size and then start sampling from batches to update our networks. Calculate critic and actor losses and then just soft update all the parameters.

          env = OurCustomEnv(sales_function, obs_range, act_range) agent = Agent(alpha= 0.000025, beta =0.00025, tau=0.001, env=env, batch_size=64, n_actions=2) score_history = [] for i in range(10000): obs = env.reset() done = False score = 0 while not done: act = agent.choose_action(obs) new_state, reward, done, info = env.step(act) agent.remember(obs, act, reward, new_state, int(done)) agent.learn() score += reward obs = new_state score_history.append(score)

          Just after some training, our agent performs very well and exhausts almost complete budget.

          Reinforcement Learning Libraries in Python

          There are plenty of libraries offering implemented RL algorithms like –

          Stable Baselines

          TF Agents




          We will explore a bit on Stable Baselines and how to use them through an example.


          pip install stable-baselines[mpi] import gym from stable_baselines import DQN env = gym.make('MountainCar-v0') agent = DQN('MlpPolicy', env, learning_rate=1e-3) agent.learn(total_timesteps=25000)

          Now we need an evaluation policy

          mean_reward, n_steps = evaluate_policy(agent, agent.get_env(), n_eval_episodes=10)"DQN_mountain_car_agent") #we can save our agent in the disk agent = DQN.load("DQN_mountain_car_agent") # or load it

          Training the Agent

          state = env.reset() for t in range(5000): action, _ = agent.predict(state) next_state, reward, done, info = env.step(action) state = next_state env.render()

          This gives us a rough idea, how to use create agents to train in our environment. Since RL is still a heavily research-oriented field, libraries updates fast. Stable baselines has the largest collection of algorithms implemented with additional features. It is suggestive to start with baselines before moving to other libraries.

          Challenges in Reinforcement Learning

          Reinforcement Learning is very easily prone to errors, local maxima/minima, and debugging it is hard as compared to other machine learning paradigms, it is because RL works on feedback loops and small errors propagate in the whole model. But that’s not it, we have the most crucial part which is assigning the reward function. Agent heavily depends upon the reward as it is the only thing by which it gets feedback. One of the classical problems in RL is exploration vs exploitation. Various novel methods are used to suppress this, for example, DDPG is prone to this issue so authors of TD3 and SAC (both are improvements over DDPG) used two additional networks (TD3) and temperature parameter(SAC) to deal with the exploration vs exploitation problem and many more novel approaches are being worked upon. Even from all the challenges, Deep RL has lots of applications in real life.


          We learned what is reinforcement learning, how we model problems into RL. Created environments using OpenAI Gym, wrote agents from scratch, and also learned how to use already build RL libraries like stable baselines. Although it has some challenges, it still helps in major fields like Robotics, Healthcare, etc. I hope you gained some knowledge or refreshed some concepts from this guide. Thanks to Phil, Andrej Karpathy, Sudarshan for their marvelous work through books and blogs.

          Reach out to me via LinkedIn (Nihal Singh)

          The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

          Linear Regression For Absolute Beginners With Implementation In Python!

          This article was published as a part of the Data Science Blogathon.

          Warning: This article is for absolute beginners, I assume you just entered into the field of machine learning with some knowledge of high school mathematics and some basic coding but that’s not even mandatory.


          Linear Regression is the most basic supervised machine learning algorithm. Supervise in the sense that the algorithm can answer your question based on labeled data that you feed to the algorithm. The answer would be like predicting housing prices, classifying dogs vs cats. Here we are going to talk about a regression task using Linear Regression. In the end, we are going to predict housing prices based on the area of the house.

          I don’t want to bore you by throwing all the machine learning jargon words, in the beginning, So let me start with the most basic linear equation (y=mx+b) that we all are familiar with since our school time.

          The figure above shows the relationship between the quantity of apple and the cost price. How much do you need to pay for 7kg of apples? I know it’s easy. If 1kg costs 5$ then 7kg cost 7*5=35$ or you will just draw a perpendicular line from point 7 along the y-axis until it touches the linear equation and the corresponding value on the y-axis is the answer as shown by the green dotted line on the graph. But we are going to solve using the formula of a linear equation.

          Now, if I have to find the price of 9.5 kg of apple then according to our model mx+b = 5 * 9.5 + 0 = $47.5 is the answer. By now you might have understood that m and b are the main ingredients of the linear equation or in other words m and b are called parameters.

          Unfortunately, this is not the machine learning problem neither linear equation is prediction algorithm, But luckily linear regression outputs the result the same way as the linear equation does. The main purpose of the linear regression algorithm is to find the value of m and b that fit the model and after that same m and b are used to predict the result for the given input data.

          Predict housing prices

          Now we are going to dive a little deeper into solving the regression problem. Look at the data samples or also termed as training examples given in the figure below.

          A company name ABC provides you a data on the houses’ size and its price. The company requires providing them a machine learning model that can predict houses’ prices for any given size. Let’s say what would be the best-estimated price for area 3000 feet square? If you are thinking to fit a line somewhere between the dataset and draw a verticle line from 3000 on the x-axis until it touches the line and then the corresponding value on the y-axis i.e 470 would be the answer, then you are on right track, it is represented by the green dotted line in the figure below.

          Let’s do it in another way, if we could find the equation of line y = mx+b that we use to fit the data represented by the blue inclined line then we can easily find the model that can predict the housing prices for any given area. In machine learning lingo function y = mx+b is also called a hypothesis function where m and b can be represented by theta0 and theta1 respectively. theta0 is also called a bias term and theta1,theta2,.. are called weights.

          See the blue line in the picture above, By taking any two samples that touch or very close to the line we can find the theta1 (slope) = 0.132 and theta zero = 80 as shown in the figure. Now we can use our hypothesis function to predict housing price for size 3000 feet square i.e 80+3000*0.132 = 476. $476,000 could be the best-estimated price for a house of size 3000 feet square and this could be a reasonable way to prepare a machine learning model when you have just 50 samples and with only one feature(size).

          But the real-world dataset could be in the order of thousands or even in millions and the number of features could range from (5–100) or even in thousands. At that time our intuition won’t be useful to find thousands of parameters just by looking at a dataset that’s why we need a machine-learning algorithm to carry out such a complex calculation. Grab a cup of coffee, refresh yourself and come back again because from now onwards you are going to understand the way the algorithm works and you will be introduced to a lot of new terminologies. Get ready!!

          Note: (i) in the equation represents the ith training example, not the power.

          If the error is too high, then the algorithm updates the parameters with a new value, if the error is high again it will update the parameters with the new value again. The algorithm continues this process until the error is minimized. To minimize the error we have a special function called Gradient Descent but before that, we are going to understand what Cost Function is and how it works?

          Here in the cost function, we are trying to find the square of the differences between the predicted value and actual value of each training example and then summing up all the differences together or in other words, we are finding the square of error of each training example and then summing up all the errors together. The output we get is simply the mean squared error of a particular set of parameters. Ok, no more words let’s do the calculation. For the simplicity of calculation, we are going to use just one parameter theta1 and a very simple dataset.

          We have three training examples (X1=1, y1=1), (X2=2, y2=2), and (X3=3, y3=3). figure on the left is of hypothesis function and on the right is cost function plotted for different values of the parameter.

          Try other values of theta1 yourself and calculate the cost for each theta1 value. Once you plot these all dots, the cost function will look like a bowl-shaped curve as shown in the figure below.

          From the figure and calculation, it is clear that the cost function is minimum at theta1=1 or at the bottom of the bowl-shaped curve. The purpose of all this hard work is not to calculate the minimum value of cost function, we have a better way to do this, instead try to understand the relationship between parameters, hypothesis function, and cost function. Please make sure you understand all these concepts before moving ahead.

          Coding Cost Function: Gradient Descent: Why do we need a Gradient Descent?

          In short to minimize the cost function, But How? Let’s see

          The cost function only works when it knows the parameters’ values, In the above sample example we manually choose the parameters’ value each time but during the algorithmic calculation once the parameters’ values are randomly initialized it’s the gradient descent who have to decide what params value to choose in the next iteration in order to minimize the error, it’s the gradient descent who decide by how much to increase or decrease the params values.

          Analogy: How Gradient Descent works?

          What did you learn from the game? In the beginning, you try with learning rate (alpha)=1 but you fail to reach the minimum, because of the larger steps it overshoots the minimum. In the next game, you try with alpha=0.1, and this time you managed to reach the bottom very safely. what if you had tried with alpha=0.01, well, in that case, you will be gradually coming down but won’t make it to the bottom, 20 jumps are not enough to reach the bottom with alpha=0.01, 100 jumps might be sufficient. while solving a real-world problem, normally alpha between 0.01–0.1 should work fine but it varies with the number of iterations that the algorithm takes, some problems might take 100 or some might even take 1000 iterations.

          Based on these factors you can try with different values of alpha. Although tuning alpha value is one of the important tasks in understanding the algorithm I would suggest you look at other parts of the algorithm also like derivative parts, minus sign, update parameters and understand what their individual’s roles are.

          Coding Gradient Descent

          Until now we are just using a single parameter to calculate cost function and algorithms. What the cost function looks like and how does the algorithm works when we have two or more parameters? See the figure below for intuitive understanding. Imagine yourself somewhere at the top of the mountain and struggling to get down the bottom of the mountain blindfolded.

          The algorithm working principle is the same for any number of parameters, it’s just that the more the parameters more the direction of the slope. In the previous example of the bowl-shaped curve, we just need to look at the slope of theta1, But now the algorithm needs to look for both directions in order to minimize the cost function. let’s code and understand the algorithm. see the figure below for reference:

          Here we go, Our model predicts 475.88*1000 = $475,880 for the house of size 3*1000 ft square. It’s very close to our prediction that we made earlier at the beginning using our intuition.


          As a beginner, it might be a little difficult to grasp all the concepts of linear regression in such a short reading time. I wouldn’t say you know all things about linear regression from this article. The purpose of this article is to make algorithms understandable in the simplest way possible. Please follow the resources’ link below for a better understanding. I hope you enjoyed reading the article. Thanks for reading.


          code link

          Gradient descent mathematics

          Linear Regression Andrew Ng


          A Complete Guide On Docker For Beginners

          This article was published as a part of the Data Science Blogathon


          It is not difficult to create a machine learning model that operates on our computers. It is more difficult when you are working with a customer who wants to use the model at scale, that is, a model that can scale and perform on all types of servers all over the world. After you have finished designing your model, it may function smoothly on your laptop or server, but not so well on other platforms, such as when you move it to the production stage or a different server. Many things can go wrong, such as performance issues, the application crashing, or the application not being effectively optimized.

          Sometimes it is not the model that is the issue but the requirement to recreate the entire stack. Docker enables you to easily replicate the training and running environment for the machine learning model from any location. Docker allows you to package your code and dependencies into containers that can be transferred to different hosts, regardless of hardware or operating system.

          Developers can use Docker to keep track of different versions of a container image, see who produced it with what, and roll back to prior versions. Finally, even if one of your machine learning application services is upgrading, fixing, or down, your machine learning application can continue to run. To update an output message integrated throughout the application, you do not have to update the whole application and disrupt other services.

          Image 1

          Let’s dig in and start investigating Docker.

          What is Docker!

          It is a software platform that makes developing, executing, managing, and distributing applications easier. That had accomplished by virtualizing the operating system of the computer it had installed.

          Docker’s first edition had launched in 2013.

          The GO programming language had used for creating Docker.

          Looking at the rich set of functionality Docker has got to offer, it’s been widely accepted by some of the world’s leading organizations and universities, such as Visa, PayPal, Cornell University and Indiana University (just to name a few) to run and manage their applications using Docker.

          Now we try to understand the problem, and solution offered by Docker


          Let us imagine you want to host three separate Python-based applications on a single server (which could either be a physical or a virtual machine). A different version of Python used by these programs, libraries and dependencies varies from application to application.

          We are unable to host all three applications on the same workstation since various versions of Python can not be installed on the same machine,


          Let’s see what we could do if we didn’t use Docker to tackle this problem. In this case, we might solve the problem with the help of three physical machines or by using a single physical computer that is powerful enough to host and run three virtual machines.

          Both approaches would help us install various versions of Python, and their associated dependencies, on each of these machines.

          Regardless of which solution we chose, the costs of purchasing and maintaining the hardware are substantial.

          Let’s look at how Docker might be a viable and cost-effective solution to this issue.

          To comprehend this, we must first examine it’s functionality.

          Image 2

          In simple terms, the system with Docker installed and running is referred to as a Docker Host or Host.

          As a result, anytime you want to deploy an application on the host, it will build a logical entity to host that application. This logical object is known as a Container or a Docker Container in the Docker nomenclature.

          There is no operating system installed or running on a Docker Container. However, a virtual replica of the process table, network interface(s), and file system mount point would be included (s).

          It is passed further from the host operating system on which the container is hosted and executing. The kernel of the host’s operating system, on the other hand, is shared by all the containers executing on it.

          It allows each container on the same host to be isolated from the others. As a result, it helps numerous containers with varied application requirements and dependencies to run on the same host as long as the operating system requirements are the same.

          In other words, rather than virtualizing hardware components, Docker would virtualize the operating system of the host on which it had installed and running.

          Pros and Cons of using Docker

          Docker allows numerous programs with varied requirements and dependencies to be hosted on the same host as long as they use the same operating system.

          Containers are typically a few megabytes in size and occupy relatively little disc space, allowing many applications hosted on the same host.

          Robustness, There is no operating system installed on a container. As a result, it uses extremely little memory when compared to a virtual machine (which would have a complete operating system installed and running on it). It cuts the bootup time to only a few seconds, whereas it takes several minutes to start a virtual machine.

          Cost is less when it comes to the hardware necessary to run Docker, and it is less demanding.

          On the same Docker Host, we can not host applications together that have various operating system needs. Let’s pretend we have four separate programs, three of which require a Linux-based operating system and one of which requires a Windows-based operating system. The three apps that require a Linux-based OS can be on a single Docker Host. The application that requires a Windows-based OS must be on a separate Docker Host.

          Docker Core Components

          Docker Engine is one of the core components and is responsible for overall functioning.

          It is a client-server based application with three main components.


          Rest API


          Image 3

          The Server executes the dockerd (Docker Daemon) daemon, which is nothing more than a process. On the Docker platform, it is in charge of creating and managing Docker Images, Containers, Networks, and Volumes.

          The REST API defines how applications can interface with server and tell it how to complete their tasks.

          The Client is a command-line interface that allows users to communicate with Docker by issuing commands.

          Docker Terminologies

          Let’s have a look at some of the terms used in the Docker world.

          Docker Images and Docker Containers are the two most key items you’ll encounter while working with Docker regularly.

          In simple terms, a Docker Image is a template that includes the program, dependencies needed to run it on Docker.

          A Docker Container, on the other hand, is a logical entity, as previously indicated. It is a functioning instance of the Docker Image in more technical terms.

          Docker Hub

          Docker Hub is the official online repository where we can find all of the Docker Images that we can use.

          If we like, we can also use Docker Hub to store and distribute our custom images. We could also make them public or private, depending on our needs.

          Note: Free users can keep one Docker Image private. More than one requires a paid subscription.


          Before we get our hands dirty with Docker, one last thing we need to know is that we need to have it installed.

          The official Docker CE installation directions are linked below. These instructions for installing Docker on your PC are straightforward.

          Do you wish to skip installation and start practicing Docker? 

          If you’re too slow to install Docker or don’t have enough resources on your PC, don’t panic – there’s a solution to your problem.

          Play with Docker, an online playground for Docker, is the best place to start. It enables users to immediately practice Docker commands without the need to install anything on their PC. The best part is that it’s easy to use and completely free.

          Docker Commands

          It’s finally time to get our hands dirty with Docker commands, as we’ve all been waiting for

          docker create

          The docker create command will be the first command we’ll look at

          We can use this command to build a new container.

          The following is the syntax for this command:

          docker create [options] IMAGE [commands] [arguments]

          Please keep in mind that everything placed in square brackets is optional. It holds for all of the instructions presented in this guide.

          The following are some examples of how to use this command:

          $ docker create fedora 02576e880a2ccbb4ce5c51032ea3b3bb8316e5b626861fc87d28627c810af03

          The docker create command in the preceding example would create a new container using the most recent Fedora image.

          It will verify if the latest official Fedora image is available on the Docker Host before building the container. If the most recent image isn’t accessible on the Docker Host, the container had initiated using the Fedora image downloaded from the Docker Hub. If the Fedora image is already present on the Docker Host, the container uses that image for creation.

          Docker results in the container ID on successful creation of the container. The container ID returned by Docker is in the above example.

          A container ID had assigned to each container. When executing various activities on the container, such as starting, stopping, resuming, and so on, we refer to it by its container ID.

          Let’s look at another example of the docker create command, this time with parameters and command supplied to it.

          $ docker create -t -i ubuntu bash 30986b73dc0022dbba81648d9e35e6e866b4356f026e75660460c3474f1ca005

          The docker create command in the preceding example builds a container using the Ubuntu image (if the image isn’t available on the Docker Host, it will download the most recent image from the Docker Hub before building the container).

          The -t and -i options tell Docker to assign a terminal to the container so that the user can interact with it. It also tells Docker to run the bash command every time the container starts.

          docker ps

          The docker ps command is the next we’ll look at

          We can use the docker ps command to see all the containers currently executing on the Docker Host.

          $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES30986b73dc00 ubuntu "bash" 45 minutes ago Up About a minute elated_franklin

          It only shows the containers that are running on the Docker Host right now.

          To view the containers created on this Docker host, regardless of their current condition, whether it is running or not, you must use the -a option, which lists all containers created on this Docker Host.

          $ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES30986b73dc00 ubuntu “bash” About an hour ago Up 29 minutes elated_franklin02576e880a2c fedora “/bin/bash” About an hour ago Created hungry_sinoussi

          Let us understand the above output of the docker ps command.

          CONTAINER ID: consists of a unique string with alphanumeric characters connected with each container.

          IMAGE: Docker Image used to create the container.

          COMMAND: After the start of the container, it runs any application-specific commands.

          CREATED: It provides the elapsed time since the creation of the container.

          STATUS: It provides the current status of the container.

          If the container is running, it will display Up along with time elapsed. (Up About an hour or Up 5 minutes)

          If the container is not running, the status will be Exited, with the exit status code enclosed in round brackets and the time expired. (Exited (0) 2 weeks ago or Exited (137) 10 seconds ago,)

          PORTS: It provides port mappings described for the container.

          NAMES: In addition to the CONTAINER ID, each container had given a unique name. A container can be identified by its container ID or by its unique name. Each container Docker generates and assigns a unique name by default. If you wish to change the container to a unique name, use the  –name option with the docker create or docker run commands.

          I hope this helps you better grasp what the docker ps command returns.

          docker start

          The command helps to start any stopped containers.

          docker start [options] CONTAINER ID/NAME [CONTAINER ID/NAME…]

          To start the container, you can specify the first unique characters of the container ID or its name.

          Below you can look at the example.

          $ docker start 30986 $ docker start elated_franklin

          docker restart

          The command helps to restart any running containers.

          docker restart [options] CONTAINER ID/NAME [CONTAINER ID/NAME…]

          Similarly, we can restart by specifying the first unique characters of the container ID or its name.

          Look at the examples using this command

          $ docker restart 30986 $ docker restart elated_franklin

          docker stop

          The command helps to stop any running containers.

          docker stop [options] CONTAINER ID/NAME [CONTAINER ID/NAME…]

          It is related to the start command.

          You can specify the first unique characters of the container ID or its name to stop the container.

          Have a look at the below examples

          $ docker stop 30986 $ docker stop elated_franklin

          docker run

          It first creates the container and then starts it. In summary, it is a combination of the docker create and start commands.

          It has a similar syntax to docker create.

          docker run [options] IMAGE [commands] [arguments] $ docker run ubuntu 30fa018c72682d78cf168626b5e6138bb3b3ae23015c5ec4bbcc2a088e67520

          In the above example, it creates a container using the latest Ubuntu image and starts the container, and immediately stops it. We can not get a chance to interact with it.

          To interact with the container, we need to specify the options -it to the docker run command, then we can interact with the container.

          $ docker run -it ubuntu

          Type exit in the terminal to come out of the container.

          docker rm

          We use this command to delete a container.

          docker rm [options] CONTAINER ID/NAME [CONTAINER ID/NAME...] $ docker rm 30fa elated_franklin

          In the above example, we are instructing docker to delete two containers in a single command. We specify the ID for the first and the name for the second container for deletion.

          The container should be in a stopped state to delete it.

          docker images

          The command lists out all docker images present on the docker host.

          $ docker images

          REPOSITORY: It describes the unique name of the docker image.

          TAG: Each image is associated with a unique tag that represents a version of the image.

          A tag had represented using a word or set of numbers or alphanumeric characters.

          IMAGE: It is a string of alphanumeric characters associated with each image.

          CREATED: It provides elapsed time since the image had been created.

          SIZE: It provides the size of the image.

          docker rmi

          This command allows us to remove images from the docker host.

          docker rmi [options] IMAGE NAME/ID [IMAGE NAME/ID...] docker rmi mysql

          The command removes image mysql from the docker host.

          The below command removes the image with ID 94e81 from the docker host.

          docker rmi 94e81

          The below command removes image ubuntu with tag trusty.

          docker rmi ubuntu:trusty

          These are some of the basic commands you come across. There are numerous other instructions to explore.

          Wind Up

          Although containerization has been around for a long time, it has only recently received the attention it deserves. Google, Amazon Web Services (AWS), Intel, Tesla are just a few leading tech businesses with their specialized container engines. They rely significantly on them to develop, run, administer, and distribute their software.

          Docker is an extremely powerful containerization engine, and it has a lot to offer when it comes to building, running, managing and distributing your applications efficiently.

          You had seen docker at a high level. There is a lot to study about docker, like

          Commands(More powerful commands)

          Docker Images are a type of container (Build your custom images)

          Networking with Docker (Setup and configure networking)

          Stack of Docker (Grouping services required by an application)

          Docker Compose is a tool that allows you to create a container (Tool for managing and running multiple containers)

          Swarm of Dockers (Grouping and managing one or more machines on which docker is running)

          If you’ve found this fascinating and want to learn more about it, I recommend enrolling in one of the courses listed below. They were educational and right to the point, in my opinion.

          If you are a complete beginner, I recommend enrolling in this course, which has been prepared specifically for you.

          Investing your time and money into studying Docker is not something you will regret.

          End Notes

          I hope you find this article helpful. Please feel free to share it. Thank you, have a great day.

          Image Source:

          The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.


          Comprehensive Guide To Devops Principles

          Introduction to DevOps Principles

          Hadoop, Data Science, Statistics & others

          It has some core key aspects and three effective ways in which they can be framed in incremental ways:

          Flow-Flow of work should be from left to right and understandable as well.

          Feedback- Continuous Improvement should occur with every release or a DevOps lifecycle. This can be achieved using feedback loops.

          Foster- FosterDevelop an environment and try to adapt it. Generate Experimentation and Risk-taking ability. Repetition of the same activity and practice to attain the goal with grace.

          Let’s walk through some in-depth DevOps Principles and Practices with real lie examples and scenarios. DevOps is not only a framework or methodology. It possesses many more facts and processes, such as agile, lean, and ITSM.

          Compared with Agile, DevOps has made a tremendous change that has helped reduce the chaos between IT and development teams by breaking them into small teams, more frequent software releases, frequent deployments, and continuous incremental improvements. DevOps also includes Lean principles such as increasing flow and reducing the IT Value stream. It also requires an Agile method for all service and project management processes to help remove bottlenecks and achieve faster lead and cycle time.

          Principles of DevOps How First Principle and Practice Work in Real Life?

          Continuous Integration – Every day, developers commit codes in a shared repository which is a good development practice.

          Continuous Delivery – Any software should be releasable throughout its lifecycle.

          Continuous Deployment – Every change in each development phase should pass all automated tests during production.

          Value Stream Mapping – A lean tool that helps depict the entire flow of information, material, and works across functional silos, including quality and time.

          Theory of Constraints – A methodology for identifying the most limiting factor to achieve a milestone and then systematically improving the constraint until it is no longer the limiting factor.

          How Feedback as a Second Principle and Practice Works?

          Production Logs: Logs are saviors or rescues to escape everyday errors.

          Automated Testing: Manual testing sometimes does not produce much of what we expect at the End phase.

          Dashboards: Dashboards such as JIRA and KANBAN for entire project management or to keep track of each team developer’s development work.

          Monitoring or Event Management: Ansible tools to monitor the overall system configuration and health check of the builds.

          Process Measurements: How to measure the flow of the entire process from development to deployment.

          How does Foster help in Attaining DevOps Principles and Practices?

          Practices and self-feedback include continuous learning and experimentation.

          Experimentation and learning

          The Deming Cycle[feedback loop]

          Using failure to improve resiliency

          A collaborative effort for learning

          Adopting the Environment is the most important factor to foster with DevOps as it never stops.

          DevOps Tools Capability

          DevOps tools deliver the following things which can be listed as follows:

          Self Service Projects via project configuration portals.

          Dependency analysis and impact analysis.

          We have automated builds, testing, and deployment. Quality code and its enhancement across environments and servers.

          Optimization of Resources

          Another essential aspect and principle of DevOps is the Optimization of Resources. How can it be done?

          By Proper scaling of the entire infrastructure.

          Re-designing of the entire global services from stacked resources instead of using and wasting new ones.

          Also, to transform a solution, it is required to apply agendas across vendors to operate the overall cost for application per user or transaction. Foundation or base is also one of the critical aspects of some reasonable values of DevOps; we can put time and effort into creating an excellent new application environment, redeploying the application, and promoting the application to a new lifecycle phase.

          One notion of getting it answered is it includes some difficult aspects to follow, such as

          Get the right people together.

          Get everyone on the same page with sync.

          Build capabilities that lead to lasting change.

          Focus on critical behaviors.

          Experiment and Learn.

          Ultimately, DevOps enables companies to deliver better software faster by improving flow, shortening and amplifying feedback loops, and fostering a culture of continuous improvement and development.

          Conclusion – DevOps Principles

          Lastly, a conclusion can be made saying that the focus to be kept should be DevOps only. Creating a complex application will help shape an organization with a transformation based on the time-space trade-off required for integrating business, process, and event processors.

          Recommended Articles

          This has been a guide to the DevOps Principles. Here we discuss its principles, tools capability, and optimization of DevOps. You may also have a look at the following articles to learn more –

          Definition of Agile DevOps

          DevOps Tools

          ITIL vs DevOps

          AngularJS Unit Testing

          30 Knn Interview Questions For Data Scientists

          K-Nearest Neighbours (kNN) and tree-based algorithms are two of the most intuitive and easy-to-understand machine learning algorithms. Both are simple to explain and demonstrate, making them perfect for those who are new to the field. For beginners, it is crucial to test their knowledge of these algorithms as they are simplistic yet immensely powerful. These are commonly asked in interviews as well. Searching for kNN interview questions and practicing them can help one gain a deeper understanding of the algorithm and its practical applications. In this article we are explaining top 30 kNN interview questions!

          Top 30 kNN Interview Questions

          Solution: A

          The training phase of the algorithm consists only of storing the feature vectors and class labels of the training chúng tôi the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples nearest to that query point – hence higher computation.

          2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-Nearest Neighbor.

          Solution: B

          Validation error is the least when the value of k is 10. So it is best to use this value of k

          Solution: F

          All of these distance metric can be used as a distance metric for k-NN.

          Solution: C

          We can also use k-NN for regression problems. In this case the prediction can be based on the mean or the median of the k-most similar instances.

          5) Which of the following statement is true about k-NN algorithm?

          k-NN performs much better if all of the data have the same scale

          k-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large

          k-NN makes no assumptions about the functional form of the problem being solved

          Solution: D

          The above mentioned statements are assumptions of kNN algorithm

          6) Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables?

          Solution: A

          k-NN algorithm can be used for imputing missing value of both categorical and continuous variables.

          Solution: A

          Manhattan Distance is designed for calculating the distance between real valued features.

          8) Which of the following distance measure do we use in case of categorical variables in k-NN?

          Hamming Distance

          Euclidean Distance

          Manhattan Distance

          Solution: A

          Both Euclidean and Manhattan distances are used in case of continuous variables, whereas hamming distance is used in case of categorical variable.

          9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?

          B) 2C) 4D) 8Solution: A

          A) 1B) 2C) 4D) 8

          sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1

          10) Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?

          B) 2C) 4D) 8Solution: A

          A) 1B) 2C) 4D) 8

          sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1

          Context: 11-12

          Suppose, you have given the following data where x and y are the 2 input variables and Class is the dependent variable.

          Below is a scatter plot which shows the above data in 2D space.

          Below is a scatter plot which shows the above data in 2D space.

          11) Suppose, you want to predict the class of new data point x=1 and y=1 using eucludian distance in 3-NN. In which class this data point belong to?

          A) + ClassB) – ClassC) Can’t say

          D) None of these

          Solution: A

          All three nearest point are of +class so this point will be classified as +class.

          12) In the previous question, you are now want use 7-NN instead of 3-KNN which of the following x=1 and y=1 will belong to?

          Solution: B

          A) + ClassB) – ClassC) Can’t say

          Now this point will be classified as – class because there are 4 – class and 3 +class point are in nearest circle.

          Context 13-14:

          Suppose you have given the following 2-class data where “+” represent a postive class and “” is represent negative class.

          13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy?

          B) 5C) Both have sameD) None of theseSolution: B

          A) 3B) 5C) Both have sameD) None of these

          5-NN will have least leave one out cross validation error.

          14) Which of the following would be the leave on out cross validation accuracy for k=5?

          B) 4/14C) 6/14D) 8/14E) None of the aboveSolution: E

          A) 2/14B) 4/14C) 6/14D) 8/14E) None of the above

          In 5-NN we will have  10/14 leave one out cross validation accuracy.

          15) Which of the following will be true about k in k-NN in terms of Bias?

          B) When you decrease the k the bias will be increasesC) Can’t sayD) None of theseSolution: A

          A) When you increase the k the bias will be increasesB) When you decrease the k the bias will be increasesC) Can’t sayD) None of these

          large K means simple model, simple model always condider as high bias

          16) Which of the following will be true about k in k-NN in terms of variance?

          B) When you decrease the k the variance will increasesC) Can’t sayD) None of theseSolution: B

          A) When you increase the k the variance will increasesB) When you decrease the k the variance will increasesC) Can’t sayD) None of these

          Simple model will be consider as less variance model

          17) The following two distances(Eucludean Distance and Manhattan Distance) have given to you which generally we used in K-NN algorithm. These distance are between two points A(x1,y1) and B(x2,Y2). Your task is to tag the both distance by seeing the following two graphs. Which of the following option is true about below graph ?

          A) Left is Manhattan Distance and right is euclidean DistanceB) Left is Euclidean Distance and right is Manhattan DistanceC) Neither left or right are a Manhattan DistanceD) Neither left or right are a Euclidian DistanceSolution: B

          A) Left is Manhattan Distance and right is euclidean DistanceB) Left is Euclidean Distance and right is Manhattan DistanceC) Neither left or right are a Manhattan DistanceD) Neither left or right are a Euclidian Distance

          Left is the graphical depiction of how euclidean distance works, whereas right one is of Manhattan distance.

          18) When you find noise in data which of the following option would you consider in k-NN?

          B) I will decrease the value of kC) Noise can not be dependent on value of kD) None of theseSolution: A

          A) I will increase the value of kB) I will decrease the value of kC) Noise can not be dependent on value of kD) None of these

          To be more sure of which classifications you make, you can try increasing the value of k.

          19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would you consider to handle such problem?

          Dimensionality Reduction

          Feature selection

          Solution: C

          In such case you can use either dimensionality reduction algorithm or the feature selection algorithm

          20) Below are two statements given. Which of the following will be true both statements?

          k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data.

          The computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario.

          Solution: C

          Both are true and self explanatory

          21) Suppose you have given the following images(1 left, 2 middle and 3 right), Now your task is to find out the value of k in k-NN in each image where k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.

          A) 1B) 2C) 3D) 5Solution: B

          A) 1B) 2C) 3D) 5

          If you keep the value of k as 2, it gives the lowest cross validation accuracy. You can try this out yourself.

          23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed this model on client side it has been found that the model is not at all accurate. Which of the following thing might gone wrong?

          A) It is probably a overfitted modelB) It is probably a underfitted modelC) Can’t sayD) None of these

          In an overfitted module, it seems to be performing well on training data, but it is not generalized enough to give the same results on a new data.

          24) You have given the following 2 statements, find which of these option is/are true in case of k-NN?

          In case of very large value of k, we may include points from other classes into the neighborhood.

          In case of too small value of k the algorithm is very sensitive to noise

          Solution: C

          Both the options are true and are self explanatory.

          25) Which of the following statements is true for k-NN classifiers?

          B) The decision boundary is smoother with smaller values of kC) The decision boundary is linearD) k-NN does not require an explicit training stepSolution: D

          A) The classification accuracy is better with larger values of kB) The decision boundary is smoother with smaller values of kC) The decision boundary is linearD) k-NN does not require an explicit training step

          Option A: This is not always true. You have to ensure that the value of k is not too high or not too low.

          Option B: This statement is not true. The decision boundary can be a bit jagged

          Option C: Same as option B

          Option D: This statement is true

          26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier?

          B) FALSESolution: A

          A) TRUEB) FALSE

          You can implement a 2-NN classifier by ensembling 1-NN classifiers

          27) In k-NN what will happen when you increase/decrease the value of k? 28) Following are the two statements given for k-NN algorthm, which of the statement(s)

          is/are true?

          We can choose optimal value of k with the help of cross validation

          Euclidean distance treats each feature as equally important

          Solution: C

          Both the statements are true

          Context 29-30:

          29) What would be the time taken by 1-NN if there are N(Very large) observations in test data?

          B) N*D*2C) (N*D)/2D) None of theseSolution: A

          A) N*DB) N*D*2C) (N*D)/2D) None of these

          The value of N is very large, so option A is correct

          30) What would be the relation between the time taken by 1-NN,2-NN,3-NN.

          B) 1-NN < 2-NN < 3-NNC) 1-NN ~ 2-NN ~ 3-NND) None of theseSolution: C

          The training time for any value of k in kNN algorithm is the same.

          Helpful Resources for kNN Interview

          Here are some resources to get in depth knowledge in the subject.

          If you are just getting started with Machine Learning and Data Science, here is a course to assist you in your journey to Master Data Science and Machine Learning. Check out the detailed course structure in the link below:

          kNN Interview Question Tips

          Understand the Basics: Before the interview, make sure you have a strong understanding of the basics of the kNN algorithm. Review the key concepts such as distance metrics, k-value selection, and the curse of dimensionality.

          Know the Applications: kNN has a variety of practical applications, including image recognition, recommender systems, and anomaly detection. Make sure you have a good understanding of these applications and how kNN is used in each of them.

          Prepare for Technical Questions: Be prepared to answer technical questions related to kNN, such as how to choose the optimal value of k, how to handle imbalanced data, and how to deal with missing data. Look up kNN interview questions online to get a sense of the types of questions that may be asked.

          Demonstrate your Problem-solving Skills: Be prepared to walk through a problem-solving exercise using kNN. This could include a real-world scenario or a hypothetical problem. Walk the interviewer through your thought process and explain how you would approach the problem using kNN.

          Practice, Practice, Practice: The best way to prepare for a kNN interview is to practice. Search for kNN interview questions and practice answering them. Consider working through example problems or participating in data science competitions to improve your kNN skills.

          End Notes

          Being prepared for kNN interview questions is crucial for anyone looking to enter the field of data science or machine learning. Understanding the basics of the kNN algorithm, its practical applications, and how to handle technical questions can help you demonstrate your knowledge and problem-solving skills. By practicing kNN interview questions and working through example problems, you can improve your understanding and feel more confident during the interview process. With these tips in mind, you can approach kNN interviews with confidence and set yourself up for success in your data science career.


          Update the detailed information about A Comprehensive Beginners Guide To Linear Algebra For Data Scientists on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!