# How a machine learns

Before I start, let me just say how excited I am about this course I'm sitting in on at MIT this week. I'll try my best to sort of repeat the most interesting tidbits I learn in a future post. I'm starting to worry about how long these posts are taking me, though, so for now, let me just provide a link to this frickin' awesome toy model of a self-driving car. Click the "Run Training" button, and in 10 seconds a neural net (using reinforcement learning) teaches the toy car to "drive" (BTW, the code runs

I haven't personally explored the reading material on the Resources tab there, but it all looks useful. Also: the Deep Learning book (Goodfellow, Bengio, Courville) mentioned on the Resources tab was independently recommended by mathematical little bro' Jon Bloom. And two independent recommendations implies it must be good. That's like a theorem, right?

*in the browser--*I mean WOW.)I haven't personally explored the reading material on the Resources tab there, but it all looks useful. Also: the Deep Learning book (Goodfellow, Bengio, Courville) mentioned on the Resources tab was independently recommended by mathematical little bro' Jon Bloom. And two independent recommendations implies it must be good. That's like a theorem, right?

(Umm....no 'tisn't. But I do trust JB's recs about this stuff, so feel confident passing them on.)

OK, back to basics. Last time, I talked about a particular type of problem a machine learning algorithm might want to tackle:

One has a bunch of data points in some high-dimensional parameter spaceRn (where the n parameters are whatever some particular human has decided would be useful to determine the answer to some yes/no question like "Is a particular e-mail spam?"), and these data points have been "pre-classified." That is, each data point has already been assigned a "yes/no" label (e.g., a human has gone through the data somehow and labeled each data point manually as "Yes-spam" or "No-not spam"). Here's a picture of what this might look like in a 2-dimensional parameter space (Red = "Y", Pencil = "N"). Sorry this is so low-tech, but I'm just a mathematician:

Now suppose we (humans) guess that the data isa⃗ ∈Rn and shift b∈R . (A reasonable assumption in the silly example pic above.)

The canonical first example of a machine learning algorithm (which you may as well think of a computer program) is one that will use

a⃗ ∈Rn and b specifying the hyperplane.

OK, so what should the machine do?

Well...what would a human do?

and spits out a number that we refer to as thea⃗ ,b . The cost function should be defined in a reasonable enough way that we really feel it's measuring how well we're doing. (High cost = bad.) It should also treat all data points equally. That is, it should be an average over individual cost functions for each data point. (It also helps if it's easy to compute its partial derivatives with respect to the parameters, but we're getting ahead of ourselves.)

Rn that parameterizes our x⃗ in this space represent e-mails). But we also have our space Rn+1 that parameterizes the possible (b,a⃗ ) in this space represent shifted hyperplanes in the data-parameterizing space Rn ).

At this point, it is notationally convenient to imbed the data-parameterizing spaceD≅Rn into D′≅Rn+1 via the map
x⃗ :=(x1,…,xn)↦x⃗ ′:=(1,x1,…,xn).
Doing this imbedding allows us to treat the b associated to a hyperplane on equal footing with the components ai of the vector a⃗ =(a1,…,an) specifying its orthogonal direction.

Now suppose we have a pointa⃗ ′:=(a0,a1,…,an)∈Rn+1 . Then remembering that if x⃗ ∈D then x⃗ ′=(1,x⃗ ) denotes its imbedding as the 1 level in D′ , the hyperplane in D determined by a⃗ ′ is the set of points x⃗ ∈D for which a⃗ ′⋅x⃗ ′=0 (this is the hyperplane specified by orthogonal vector (a1,…,an) and shift −a0 ).

Relative ignorance disclosed, I proceed:

Remember that for each data pointx⃗ , we have a "Y/N" label. Let's turn these into "1/0" labels. That is, for each data point x⃗ , define
y(x⃗ )={10if the label on x⃗ is ``Y"if the label on x⃗ is ``N"

Remember also that for a particular choice of parametersa⃗ ′∈Rn+1 specifying the separating hyperplane (see notational digression above), the output guessed by the model on a point x⃗ ∈Rn is
σ(a⃗ ′⋅x⃗ ′)∈(0,1),
where σ is the sigmoid function
σ(t)=11+e−t.
We can (and should, I suppose) think of
ha⃗ ′(x⃗ ):=σ(a⃗ ′⋅x⃗ ′)
as the a⃗ ′=(a0,a1,…,an) ).

The cross-entropy "cost" or "loss" associated to hyperplane parametersa⃗ ′ and a particular data point x⃗ is then defined as:
Ja⃗ ′(x⃗ ):=−y(x⃗ )ln(ha⃗ ′(x⃗ ))−(1−y(x⃗ ))ln(1−ha⃗ ′(x⃗ )),
and if we have N pre-labeled data points x⃗ (1),…,x⃗ (N) , we define the cost to be the average of the individual costs:
Ja⃗ {x⃗ (1),…,x⃗ (N)}:=1N∑i=1NJa⃗ ′(x⃗ (i)).

Let's pause for a sec, because it's worth noting how much less complicated this function is than it appears at first glance. Remember thaty(x⃗ ) is only ever 0 or 1 , so for each data point x⃗ (i) one of the two terms in the sum drops out and the other is just the negative of the natural log of an expression involving the difference between the 0 or 1 ) and the 0 and 1 ). As far as I can tell, the main reason for the appearance of the natural log here is because:

a⃗ ′ , calculus tells us how we might do a bit better (at least nearby).

Remember we have our spaceD=Rn that parameterizes our Rn+1 that parameterizes the possible R and locate the point(s) in our hyperplane-parameterizing space that

Here is where calculus helps us. Remember that if we have a real-valued smooth function of several parameters and we are interested in finding an

This tells us that to find an absolute minimum of the cost function we should look for places where the partial derivatives of the cost function vanish (aka

Since it is not always easy (read: computationally advantageous) to find the critical points of a cost function directly, what is done in practice is to compute the
∇(J):=(∂J∂a0,∂J∂a1…,∂J∂an,)
and take small-ish steps in the negative gradient direction until your gradient gets close enough to 0⃗ that you're pretty darned sure you're close to a critical point. This is called

(WARNING: Keep your wits about you. The gradient of the cost function is computed with respect to the

Of course, one of the big problems with gradient descent is that unless you happen to know in advance that your function has only one critical point and that it is definitely a local minimum, then the result of performing gradient descent will not necessarily find you a global minimum. I'll probably try to talk about this at greater length later. Let's ignore it for now, compute the gradient of the cross-entropy cost function, and call it a day.

σ(t):=11+e−t :
dσdt=σ(t)(1−σ(t)).
This fact also uses the chain rule in the first step. The rest is just algebra:
dσdt===e−t(1+e−t)211+e−t⋅(1+e−t)−11+e−tσ(t)(1−σ(t))

Now it will turn out that
∂J∂aj=1N∑i=1N(ha⃗ ′(x⃗ (i))−y(i))x(i)j,
where we recall that ha⃗ ′(x⃗ ):=σ(a⃗ ′⋅x⃗ ′) . This derivation has some slightly non-obvious steps, so I include it here. Note that the partial derivative of a sum is the sum of partial derivatives, so we just need to verify that the partial derivative ∂J∂aj associated to a particular data point x⃗ (i) has the form we see inside the summation above. If y(x⃗ (i))=1 , the chain rule and the fact above about the derivative of the sigmoid function tells us this is:

===−y(x⃗ (i))⋅1σ(a⃗ ′⋅(x⃗ (i))′)⋅σ(a⃗ ′⋅(x⃗ (i))′)⋅(1−σ(a⃗ ′⋅(x⃗ (i))′)⋅x(i)jy(x⃗ (i))⋅(σ(a⃗ ′⋅(x⃗ (i))′)−1)⋅x(i)j(ha⃗ (x⃗ (i))−y(x⃗ (i)))⋅x(i)j

where in the last step we are using the assumption thaty(x⃗ (i))=1 . Similarly, if y(x⃗ (i))=0 , the summand reads:

===(1−y(x⃗ (i)))⋅11−σ(a⃗ ′⋅(x⃗ (i))′)⋅σ(a⃗ ′⋅(x⃗ (i))′)⋅(1−σ(a⃗ ′⋅(x⃗ (i))′)⋅x(i)j(1−y(x⃗ (i)))⋅(σ(a⃗ ′⋅(x⃗ (i))′)⋅x(i)j(ha⃗ (x⃗ (i))−y(x⃗ (i)))⋅x(i)j

where again in the last step, we are using the assumption thaty(x⃗ (i))=0 .

OK, I'm tired now and probably you are too. I'm also sure the equations here are riddled with sign errors and other less forgivable mistakes. Please tell me if you notice any (forgivable or not). I'll try to have fewer equations in later posts.

One has a bunch of data points in some high-dimensional parameter space

Now suppose we (humans) guess that the data is

*linearly separable*, that is we guess that the*decision boundary*between the "Y" region and the "N" region of the parameter space is just a hyperplane determined by some orthogonal vectorThe canonical first example of a machine learning algorithm (which you may as well think of a computer program) is one that will use

- our pre-labeled data and
- our assumption that the data is linearly separable

OK, so what should the machine do?

Well...what would a human do?

**Cost function****That's right! Start by guessing**

*something*, then keep trying to improve. Of course, this begs the question, "Improve what?" Without getting too philosophical about it, we need to define some*cost function*whose input is- our pre-labeled data and
- our current guess
a⃗ ∈Rn andb∈R

and spits out a number that we refer to as the

*cost*associated to the guess**Minor but Important notational digression****Remember we have our space**

*data*(in our spam example, points*separating hyperplanes*(pointsAt this point, it is notationally convenient to imbed the data-parameterizing space

*shift*(often called the*bias*in the literature)Now suppose we have a point

**Cross-entropy cost function****A commonly used cost function for a problem of this type is the**

*cross-entropy cost function.*Before describing it, let me come clean and admit that because of my lack of prior training in statistics I don't yet feel in my bones why this is the "right" choice. I will say that Chapter 2: Classification and Logistic Regression in Andrew Ng's CS229 lecture notes here helped me, as did the discussion of the cross-entropy cost function in Chapter 3 of Michael Nielsen's Deep Learning book (esp. the animations). I may try to write more about this later when I'm smarter.Relative ignorance disclosed, I proceed:

Remember that for each data point

Remember also that for a particular choice of parameters

*probability that the label on*x⃗ is "Y" (according to the model specified by the vector of parametersThe cross-entropy "cost" or "loss" associated to hyperplane parameters

Let's pause for a sec, because it's worth noting how much less complicated this function is than it appears at first glance. Remember that

*actual label*(which is either*predicted label*(which is some real number between- The more wrong you are, the more you are penalized: Think of the behavior of
−ln(t) for values of t between 0 and 1. Its main feature is that ast goes to0 ,−ln(t) goes to∞ , and ast goes to1 ,−ln(t) goes to0 . As a result, when the differencey(x⃗ )−ha⃗ ′(x⃗ ) between the actual label and the predicted label is large (close to1 ), then1 minus this difference is close to0 , so when we take−ln of this number, we get something*huge.*The hugeness goes to0 as the discrepancy between the actual label and the predicted label goes to0 . - Having
ln in the expression makes it easier to take partial derivatives of the cost function: Read the next section to understand why this is.

**Gradient descent****Now that we have a way of measuring how well we've done for a particular choice of parameters**

Remember we have our space

*data*, and we have our space*separating hyperplanes*. We now want to think of our cost function as a function from our*hyperplane-parameterizing*space to*minimize cost*.Here is where calculus helps us. Remember that if we have a real-valued smooth function of several parameters and we are interested in finding an

*absolute*minimum of the function (and our domain is boundary-less), we know for sure that this absolute minimum will occur at a*local*minimum of the function. And every local minimum of the function will in particular have the property that all of the partial derivatives of the function vanish there.This tells us that to find an absolute minimum of the cost function we should look for places where the partial derivatives of the cost function vanish (aka

*critical points*of the function). As a first pass, this is a great idea as long as we remember that*not every critical point is a local minimum*, and*not every local minimum is an absolute minimum.*If you like Morse theory, you don't need to be convinced of those two statements. If you don't yet like Morse theory, the following two pics may convince you (again, low tech--sorry):A portion of the graph of a real-valued function whose domain is 2-dimensional. The partial derivatives at the black dot vanish, but the black dot is not a local minimum. |

A portion of the graph of a real-valued function whose domain is 1-dimensional. The black dot is a local minimum but not an absolute minimum. |

Since it is not always easy (read: computationally advantageous) to find the critical points of a cost function directly, what is done in practice is to compute the

*gradient*of the cost as a function of the*hyperplane-parameterizing space*:*gradient descent*, and it's what puts the "learning" in "machine learning." To first order, it's really that simple.(WARNING: Keep your wits about you. The gradient of the cost function is computed with respect to the

*hyperplane-parameterizing space*NOT the*data-parameterizing space*. The machine is trying to learn the best hyperplane to fit the data, which it is treating as a constant.)Of course, one of the big problems with gradient descent is that unless you happen to know in advance that your function has only one critical point and that it is definitely a local minimum, then the result of performing gradient descent will not necessarily find you a global minimum. I'll probably try to talk about this at greater length later. Let's ignore it for now, compute the gradient of the cross-entropy cost function, and call it a day.

**Gradient of cross-entropy cost function (+ a useful property of the sigmoid function)****Now we just dust off the ole' chain rule, keeping in mind the following useful observation about the sigmoid function**

Now it will turn out that

where in the last step we are using the assumption that

where again in the last step, we are using the assumption that

OK, I'm tired now and probably you are too. I'm also sure the equations here are riddled with sign errors and other less forgivable mistakes. Please tell me if you notice any (forgivable or not). I'll try to have fewer equations in later posts.

sumber :http://iamalearningcomputer.blogspot.com/

## No comments for "How a machine learns"