What is the Problem?
Suppose you have data x which represents a set of observations. These observations can be face and non face images, temperatures for two successive weeks or tail/head results of a coin-flipping experiment. We can argue that the data is generated from a world that has a set of states w and each state wi∈w is responsible for generating a certain number of observations o⊂x.
The problem becomes how can we find the world state wi∈w for every new observation xc that may or may not belong to x. A naive solution to this problem rises from the fact that each world state wi∈w has a certain number of observations o⊂x. So, we can find the nearest neighbor observation xn∈x to xc and give xc the same state as xn. Unfortunately, this only works in a perfect world that has its states easily distinguishable from each other.
In a real world a new observation xc may be compatible with more than one world state due to a noise in the measurement process. The best solution to this ambiguity is to calculate the tendency of the new observation xc to each world state wi∈w which can be described by a posterior probability distribution P(w|x) over the world states.
In order to form the posterior distribution, we need to define a model that relates the observations x with the world states w. This model is controlled by a set of parameters Θ that describes the relationship between x and w.
We also need a learning algorithm to fit the model parameters Θ on a given training data in the form of {(xi,wi):xi∈x,wi∈w}. Finally, an inference algorithm is required to infer the world state of a new observation xn by calculating the posterior probability distribution P(w|x) over all world states.
The important component of the solution is the model. If we modeled the data correctly, we will have a good posterior probability distribution. We can classify the models into two general categories, the generative models and the discriminative models.
The discriminative models describes the probability of the world state P(w|x) given the data. In order to model P(w|x), we usually choose a distribution P(w) over the world states and estimate the distribution parameters Θ using the given training data (xi,wi). So, the posterior distribution is now written as P(w|x,Θ). For example, if the world state is continuous, we can model P(w) using a normal distribution and make the mean μ and variance σ2 functions of the data x. Then, we can use a technique such as ML (Maximum Likelihood) to choose the best parameters ˆΘ where, ˆΘ=argmaxΘP(Θ|x,w) The generative models describes the probability of the data P(x|w) given the world state. In order to model this probability, we choose a probability distribution over the data P(x) and estimate the distribution parameters Θ as a function of the world state w. For example, we can choose a Bernoulli distribution P(x)=λx(1−λ)1−x for a binary data and make the parameter λ a function of the world states. Again, we need to fit the parameters Θ using a set of training examples (xi,wi), so our distribution now is written as P(x|w,Θ) as it also depend on the fitted parameters. Our inference algorithm will work by calculating the posterior distribution P(w|x) and incorporating a prior P(w) using Bayes' rule, P(w|x)=P(x|w)P(w)∫P(x|w)P(w)dw
Coin-1 outcome = {H,T,H,H,T,H,H,T,H,T}
Coint-2 outcome = {T,T,H,T,T,H,H,T,T,T}
where H means head and T means tail. Assume that these outcomes are our prior knowledge about how fair are the coins. For example, we can measure the bias of coin-1 toward H by dividing the number of times heads appeared to the total number of outcomes, P(w1)=610=0.6 Similar for coin-2, we have the bias equals P(w2)=0.3. Assume we have the following outcome of a new 10 flipping experiments x={T,H,T,H,T,T,H,T,H,T} but we don't know which coin was used for this outcome. Can we measure a degree of belief for each coin given the known prior?. We can infer such belief by forming a posterior distribution for the two coins as our world states. Let us take a generative model for our posterior distribution, P(w|x,Θ)=P(x|w,Θ)P(w|Θ)∑wP(x|w,Θ)P(w|Θ)Let P(x|w,Θ)=Bernx(λw) where we have chosen the Bernoulli distribution to model the posterior with λw as the only unknown parameter that needs to be estimated. Given that we have two world states w1 and w2, we have,P(x|w1,Θ)=Bernx(λ1)=n∏i=1Bern(xi,λ1)P(x|w2,Θ)=Bernx(λ2)=n∏i=1Bern(xi,λ2) where x is a set of linearly independent binary (H=1,T=0) measurements. This describes our model which can be plugged into the Bayes rule to infer the posterior distribution. But, we need to learn the model parameters λ1,λ2 using the prior knowledge of the two coins. We can do this using a maximum likelihood approach. So, consider the prior outcome of the first coin {H,T,H,H,T,H,H,T,H,T}, we have, ˆλ1=argmaxλ1P(λ1|x)=argmaxλ1Bernx(λ1)=argmaxλ1n1∏i=1Bern(xi,λ1)=argmaxλ1λ61(1−λ1)4=argmaxλ1ln(λ61(1−λ1)4)=argmaxλ16ln(λ1)+4ln(1−λ1)4 and by taking the derivative of ln(P(λ1|x)) with respect to λ1, we have,dln(P(λ1|x))dλ1=6λ1−41−λ1=0=6(1−λ1)−4λ1=0=6−10λ1=0λ1=0.6 This is actually the coin head bias. So, we can set λ1=0.6 and λ2=0.3. We will also assume a uniform prior P(w|Θ)=P(λw)=0.5, and P(λ1)=P(λ2)=0.5.
For a new x={T,H,T,H,T,T,H,T,H,T}, we can measure the tendency of these measurements to each world state as,P(x|w1,λ1)=n∏i=1λxi1(1−λ1)1−xi=λ41(1−λ1)6=0.00053P(x|w2,λ2)=n∏i=1λxi2(1−λ2)1−xi=λ42(1−λ2)6=0.00095and the posterior distribution becomes, P(w1|x)=0.00053×0.50.00053×0.5+0.00095×0.5=0.36P(w2|x)=0.00095×0.50.00053×0.5+0.00095×0.5=0.64and for another new set x={H,H,H,H,T,H,H,T,H,H}, we have, P(w1|x)=0.988P(w2|x)=0.011 We can see that our posterior model measures the tendency of the head bias of new measurements toward the prior bias of the two coins. In the first measurements, we are 0.64 certain that the new x comes from the second coin, where in the second measurements, we are 0.988 certain that x comes from the first coin.
To be continued ...
The problem becomes how can we find the world state wi∈w for every new observation xc that may or may not belong to x. A naive solution to this problem rises from the fact that each world state wi∈w has a certain number of observations o⊂x. So, we can find the nearest neighbor observation xn∈x to xc and give xc the same state as xn. Unfortunately, this only works in a perfect world that has its states easily distinguishable from each other.
In a real world a new observation xc may be compatible with more than one world state due to a noise in the measurement process. The best solution to this ambiguity is to calculate the tendency of the new observation xc to each world state wi∈w which can be described by a posterior probability distribution P(w|x) over the world states.
In order to form the posterior distribution, we need to define a model that relates the observations x with the world states w. This model is controlled by a set of parameters Θ that describes the relationship between x and w.
We also need a learning algorithm to fit the model parameters Θ on a given training data in the form of {(xi,wi):xi∈x,wi∈w}. Finally, an inference algorithm is required to infer the world state of a new observation xn by calculating the posterior probability distribution P(w|x) over all world states.
The important component of the solution is the model. If we modeled the data correctly, we will have a good posterior probability distribution. We can classify the models into two general categories, the generative models and the discriminative models.
The discriminative models describes the probability of the world state P(w|x) given the data. In order to model P(w|x), we usually choose a distribution P(w) over the world states and estimate the distribution parameters Θ using the given training data (xi,wi). So, the posterior distribution is now written as P(w|x,Θ). For example, if the world state is continuous, we can model P(w) using a normal distribution and make the mean μ and variance σ2 functions of the data x. Then, we can use a technique such as ML (Maximum Likelihood) to choose the best parameters ˆΘ where, ˆΘ=argmaxΘP(Θ|x,w) The generative models describes the probability of the data P(x|w) given the world state. In order to model this probability, we choose a probability distribution over the data P(x) and estimate the distribution parameters Θ as a function of the world state w. For example, we can choose a Bernoulli distribution P(x)=λx(1−λ)1−x for a binary data and make the parameter λ a function of the world states. Again, we need to fit the parameters Θ using a set of training examples (xi,wi), so our distribution now is written as P(x|w,Θ) as it also depend on the fitted parameters. Our inference algorithm will work by calculating the posterior distribution P(w|x) and incorporating a prior P(w) using Bayes' rule, P(w|x)=P(x|w)P(w)∫P(x|w)P(w)dw
Motivating Example
One of the simplest models for modeling the distribution of a binary random variable is the Bernoulli distribution. By random variable we mean a variable y that can take any random value, and by binary we restrict y to have only two values 0,1. The Bernoulli distribution is described by, P(x)=λx(1−λ)1−x which represents the probability of success λ if x=1 and failure 1−λ if x=0. Now, consider we have two coins, w1 and w2, one of them is a quarter and the other is a dime. Moreover, these coins aren't fair. They may have the material weight unbalanced along the coin surface. If we perform an experiment of flipping every coin 10 times and recorded the outcome, we will have something like,Coin-1 outcome = {H,T,H,H,T,H,H,T,H,T}
Coint-2 outcome = {T,T,H,T,T,H,H,T,T,T}
where H means head and T means tail. Assume that these outcomes are our prior knowledge about how fair are the coins. For example, we can measure the bias of coin-1 toward H by dividing the number of times heads appeared to the total number of outcomes, P(w1)=610=0.6 Similar for coin-2, we have the bias equals P(w2)=0.3. Assume we have the following outcome of a new 10 flipping experiments x={T,H,T,H,T,T,H,T,H,T} but we don't know which coin was used for this outcome. Can we measure a degree of belief for each coin given the known prior?. We can infer such belief by forming a posterior distribution for the two coins as our world states. Let us take a generative model for our posterior distribution, P(w|x,Θ)=P(x|w,Θ)P(w|Θ)∑wP(x|w,Θ)P(w|Θ)Let P(x|w,Θ)=Bernx(λw) where we have chosen the Bernoulli distribution to model the posterior with λw as the only unknown parameter that needs to be estimated. Given that we have two world states w1 and w2, we have,P(x|w1,Θ)=Bernx(λ1)=n∏i=1Bern(xi,λ1)P(x|w2,Θ)=Bernx(λ2)=n∏i=1Bern(xi,λ2) where x is a set of linearly independent binary (H=1,T=0) measurements. This describes our model which can be plugged into the Bayes rule to infer the posterior distribution. But, we need to learn the model parameters λ1,λ2 using the prior knowledge of the two coins. We can do this using a maximum likelihood approach. So, consider the prior outcome of the first coin {H,T,H,H,T,H,H,T,H,T}, we have, ˆλ1=argmaxλ1P(λ1|x)=argmaxλ1Bernx(λ1)=argmaxλ1n1∏i=1Bern(xi,λ1)=argmaxλ1λ61(1−λ1)4=argmaxλ1ln(λ61(1−λ1)4)=argmaxλ16ln(λ1)+4ln(1−λ1)4 and by taking the derivative of ln(P(λ1|x)) with respect to λ1, we have,dln(P(λ1|x))dλ1=6λ1−41−λ1=0=6(1−λ1)−4λ1=0=6−10λ1=0λ1=0.6 This is actually the coin head bias. So, we can set λ1=0.6 and λ2=0.3. We will also assume a uniform prior P(w|Θ)=P(λw)=0.5, and P(λ1)=P(λ2)=0.5.
For a new x={T,H,T,H,T,T,H,T,H,T}, we can measure the tendency of these measurements to each world state as,P(x|w1,λ1)=n∏i=1λxi1(1−λ1)1−xi=λ41(1−λ1)6=0.00053P(x|w2,λ2)=n∏i=1λxi2(1−λ2)1−xi=λ42(1−λ2)6=0.00095and the posterior distribution becomes, P(w1|x)=0.00053×0.50.00053×0.5+0.00095×0.5=0.36P(w2|x)=0.00095×0.50.00053×0.5+0.00095×0.5=0.64and for another new set x={H,H,H,H,T,H,H,T,H,H}, we have, P(w1|x)=0.988P(w2|x)=0.011 We can see that our posterior model measures the tendency of the head bias of new measurements toward the prior bias of the two coins. In the first measurements, we are 0.64 certain that the new x comes from the second coin, where in the second measurements, we are 0.988 certain that x comes from the first coin.
Mixture of Bernoulli Distributions
Now, suppose that we have a collection of dime coins C1 and another collection of quarter coins C2. We also know that within each collection, some coins are fair and others are biased. Consider that we flip each coin 10 times, so we have a collection of sets Ci={xq|q=1:ni and |xq|=10}, ni is the number of coins of the ith collection.
Although we can model each coin individually using a single Bernoulli distribution as we did before, we need to fit a distribution for each collection of coins. If we have a single Bernoulli distribution to describe the flipping process of each coin, we need a way to mix the distributions of all coins within a certain collection. We may also have some coins not flipped in the collection or others with several flipping sets, which requires a mixture distribution that weights the individual coin distributions toward each other.
Although we can model each coin individually using a single Bernoulli distribution as we did before, we need to fit a distribution for each collection of coins. If we have a single Bernoulli distribution to describe the flipping process of each coin, we need a way to mix the distributions of all coins within a certain collection. We may also have some coins not flipped in the collection or others with several flipping sets, which requires a mixture distribution that weights the individual coin distributions toward each other.
To be continued ...