### What is the Problem?

Suppose you have data \(x\) which represents a set of observations. These observations can be face and non face images, temperatures for two successive weeks or tail/head results of a coin-flipping experiment. We can argue that the data is generated from a world that has a set of states \(w\) and each state \(w_i \in w\) is responsible for generating a certain number of observations \(o \subset x\).

The problem becomes how can we find the world state \(w_i \in w\) for every new observation \(x_c\) that may or may not belong to \(x\). A naive solution to this problem rises from the fact that each world state \(w_i \in w\) has a certain number of observations \(o \subset x\). So, we can find the nearest neighbor observation \(x_n \in x\) to \(x_c\) and give \(x_c\) the same state as \(x_n\). Unfortunately, this only works in a perfect world that has its states easily distinguishable from each other.

In a real world a new observation \(x_c\) may be compatible with more than one world state due to a noise in the measurement process. The best solution to this ambiguity is to calculate the tendency of the new observation \(x_c\) to each world state \(w_i \in w\) which can be described by a posterior probability distribution \(P(w|x)\) over the world states.

In order to form the posterior distribution, we need to define a model that relates the observations \(x\) with the world states \(w\). This model is controlled by a set of parameters \(\Theta\) that describes the relationship between \(x\) and \(w\).

We also need a learning algorithm to fit the model parameters \(\Theta\) on a given training data in the form of \(\{(x_i, w_i) : x_i \in x, w_i \in w\}\). Finally, an inference algorithm is required to infer the world state of a new observation \(x_n\) by calculating the posterior probability distribution \(P(w|x)\) over all world states.

The important component of the solution is the model. If we modeled the data correctly, we will have a good posterior probability distribution. We can classify the models into two general categories, the generative models and the discriminative models.

The discriminative models describes the probability of the world state \(P(w|x)\) given the data. In order to model \(P(w|x)\), we usually choose a distribution \(P(w)\) over the world states and estimate the distribution parameters \(\Theta\) using the given training data \({(x_i, w_i)}\). So, the posterior distribution is now written as \(P(w|x,\Theta)\). For example, if the world state is continuous, we can model \(P(w)\) using a normal distribution and make the mean \(\mu\) and variance \(\sigma^2\) functions of the data \(x\). Then, we can use a technique such as ML (Maximum Likelihood) to choose the best parameters \(\widehat{\Theta}\) where, $$\widehat{\Theta} =\operatorname*{arg\,max}_{\Theta} P(\Theta|x,w)$$ The generative models describes the probability of the data \(P(x|w)\) given the world state. In order to model this probability, we choose a probability distribution over the data \(P(x)\) and estimate the distribution parameters \(\Theta\) as a function of the world state \(w\). For example, we can choose a Bernoulli distribution \(P(x) = \lambda ^x \left( {1 - \lambda } \right)^{1 - x}\) for a binary data and make the parameter \(\lambda\) a function of the world states. Again, we need to fit the parameters \(\Theta\) using a set of training examples \({(x_i, w_i)}\), so our distribution now is written as \(P(x|w, \Theta)\) as it also depend on the fitted parameters. Our inference algorithm will work by calculating the posterior distribution \(P(w|x)\) and incorporating a prior P(w) using Bayes' rule, $$P(w|x) = \frac{P(x|w) P(w)}{\int\limits P(x|w) P(w) dw}$$

Coin-1 outcome = \(\{H,T,H,H,T,H,H,T,H,T\}\)

Coint-2 outcome = \(\{T,T,H,T,T,H,H,T,T,T\}\)

where \(H\) means head and \(T\) means tail. Assume that these outcomes are our prior knowledge about how fair are the coins. For example, we can measure the bias of coin-1 toward \(H\) by dividing the number of times heads appeared to the total number of outcomes, $$P(w_1) = \frac{6}{10} = 0.6$$ Similar for coin-2, we have the bias equals \(P(w_2)=0.3\). Assume we have the following outcome of a new 10 flipping experiments \(x=\{T,H,T,H,T,T,H,T,H,T\}\) but we don't know which coin was used for this outcome. Can we measure a degree of belief for each coin given the known prior?. We can infer such belief by forming a posterior distribution for the two coins as our world states. Let us take a generative model for our posterior distribution, $$P(w|x,\Theta)= \frac{P(x|w,\Theta)P(w|\Theta)

}{\sum\limits_{w}P(x|w,\Theta)P(w|\Theta)}$$Let \(P(x|w,\Theta)=Bern_x(\lambda_w)\) where we have chosen the Bernoulli distribution to model the posterior with \(\lambda_w\) as the only unknown parameter that needs to be estimated. Given that we have two world states \(w_1\) and \(w_2\), we have,$$P(x|w_1,\Theta)=Bern_{x}(\lambda_1)=\prod\limits_{i = 1}^n {Bern \left( {x_i, \lambda_1 } \right)}\\P(x|w_2,\Theta)=Bern_{x}(\lambda_2)=\prod\limits_{i = 1}^n {Bern \left( {x_i, \lambda_2 } \right)}$$ where \(x\) is a set of linearly independent binary \((H=1,T=0)\) measurements. This describes our model which can be plugged into the Bayes rule to infer the posterior distribution. But, we need to learn the model parameters \(\lambda_1, \lambda_2\) using the prior knowledge of the two coins. We can do this using a maximum likelihood approach. So, consider the prior outcome of the first coin \(\{H,T,H,H,T,H,H,T,H,T\}\), we have, $$\begin{aligned}\widehat{\lambda}_1 &= \operatorname*{arg\,max}_{\lambda_1} P(\lambda_1|x) = \operatorname*{arg\,max}_{\lambda_1} Bern_x(\lambda_1)\\&= \operatorname*{arg\,max}_{\lambda_1} \prod\limits_{i = 1}^{n_1} {Bern \left( {x_i, \lambda_1 } \right)} \\ &= \operatorname*{arg\,max}_{\lambda_1} \lambda_1 ^{6} \left( {1 - \lambda_1 } \right)^{4} = \operatorname*{arg\,max}_{\lambda_1} ln( \lambda_1 ^{6} \left( {1 - \lambda_1 } \right)^{4}) \\&= \operatorname*{arg\,max}_{\lambda_1} 6 ln(\lambda_1) + 4 ln\left( {1 - \lambda_1 } \right)^{4} \end{aligned}$$ and by taking the derivative of \(ln(P(\lambda_1|x))\) with respect to \(\lambda_1\), we have,$$\begin{aligned}\frac{dln(P(\lambda_1|x))}{d\lambda_1} &= \frac{6}{\lambda_1} - \frac{4}{1-\lambda_1}=0\\ &= 6(1-\lambda_1) - 4 \lambda_1 = 0\\ &= 6-10\lambda_1 = 0\\ &\lambda_1 = 0.6 \end{aligned}$$ This is actually the coin head bias. So, we can set \(\lambda_1 = 0.6\) and \(\lambda_2 = 0.3\). We will also assume a uniform prior \(P(w|\Theta) = P(\lambda_w)=0.5\), and \(P(\lambda_1) = P(\lambda_2) =0.5\).

For a new \(x=\{T,H,T,H,T,T,H,T,H,T\}\), we can measure the tendency of these measurements to each world state as,$$P(x|w_1,\lambda_1)=\prod\limits_{i = 1}^n {\lambda_1 ^{x_i} \left( {1 - \lambda_1 } \right)^{1 - x_i}}=\lambda_1 ^{4} \left( {1 - \lambda_1 } \right)^{6}=0.00053\\P(x|w_2,\lambda_2)=\prod\limits_{i = 1}^n {\lambda_2 ^{x_i} \left( {1 - \lambda_2 } \right)^{1 - x_i}} = \lambda_2 ^{4} \left( {1 - \lambda_2 } \right)^{6}=0.00095$$and the posterior distribution becomes, $$P(w_1|x) = \frac{0.00053 \times 0.5}{0.00053 \times 0.5 + 0.00095 \times 0.5} = {0.36}\\P(w_2|x) = \frac{0.00095 \times 0.5}{0.00053 \times 0.5 + 0.00095 \times 0.5} = {0.64}$$and for another new set \(x=\{H,H,H,H,T,H,H,T,H,H\}\), we have, $$P(w_1|x) = {0.988}\\P(w_2|x) = {0.011}$$ We can see that our posterior model measures the tendency of the head bias of new measurements toward the prior bias of the two coins. In the first measurements, we are 0.64 certain that the new \(x\) comes from the second coin, where in the second measurements, we are 0.988 certain that \(x\) comes from the first coin.

To be continued ...

The problem becomes how can we find the world state \(w_i \in w\) for every new observation \(x_c\) that may or may not belong to \(x\). A naive solution to this problem rises from the fact that each world state \(w_i \in w\) has a certain number of observations \(o \subset x\). So, we can find the nearest neighbor observation \(x_n \in x\) to \(x_c\) and give \(x_c\) the same state as \(x_n\). Unfortunately, this only works in a perfect world that has its states easily distinguishable from each other.

In a real world a new observation \(x_c\) may be compatible with more than one world state due to a noise in the measurement process. The best solution to this ambiguity is to calculate the tendency of the new observation \(x_c\) to each world state \(w_i \in w\) which can be described by a posterior probability distribution \(P(w|x)\) over the world states.

In order to form the posterior distribution, we need to define a model that relates the observations \(x\) with the world states \(w\). This model is controlled by a set of parameters \(\Theta\) that describes the relationship between \(x\) and \(w\).

We also need a learning algorithm to fit the model parameters \(\Theta\) on a given training data in the form of \(\{(x_i, w_i) : x_i \in x, w_i \in w\}\). Finally, an inference algorithm is required to infer the world state of a new observation \(x_n\) by calculating the posterior probability distribution \(P(w|x)\) over all world states.

The important component of the solution is the model. If we modeled the data correctly, we will have a good posterior probability distribution. We can classify the models into two general categories, the generative models and the discriminative models.

The discriminative models describes the probability of the world state \(P(w|x)\) given the data. In order to model \(P(w|x)\), we usually choose a distribution \(P(w)\) over the world states and estimate the distribution parameters \(\Theta\) using the given training data \({(x_i, w_i)}\). So, the posterior distribution is now written as \(P(w|x,\Theta)\). For example, if the world state is continuous, we can model \(P(w)\) using a normal distribution and make the mean \(\mu\) and variance \(\sigma^2\) functions of the data \(x\). Then, we can use a technique such as ML (Maximum Likelihood) to choose the best parameters \(\widehat{\Theta}\) where, $$\widehat{\Theta} =\operatorname*{arg\,max}_{\Theta} P(\Theta|x,w)$$ The generative models describes the probability of the data \(P(x|w)\) given the world state. In order to model this probability, we choose a probability distribution over the data \(P(x)\) and estimate the distribution parameters \(\Theta\) as a function of the world state \(w\). For example, we can choose a Bernoulli distribution \(P(x) = \lambda ^x \left( {1 - \lambda } \right)^{1 - x}\) for a binary data and make the parameter \(\lambda\) a function of the world states. Again, we need to fit the parameters \(\Theta\) using a set of training examples \({(x_i, w_i)}\), so our distribution now is written as \(P(x|w, \Theta)\) as it also depend on the fitted parameters. Our inference algorithm will work by calculating the posterior distribution \(P(w|x)\) and incorporating a prior P(w) using Bayes' rule, $$P(w|x) = \frac{P(x|w) P(w)}{\int\limits P(x|w) P(w) dw}$$

### Motivating Example

One of the simplest models for modeling the distribution of a binary random variable is the Bernoulli distribution. By random variable we mean a variable \(y\) that can take any random value, and by binary we restrict \(y\) to have only two values \(0, 1\). The Bernoulli distribution is described by, $$P(x) = \lambda ^x \left( {1 - \lambda } \right)^{1 - x}$$ which represents the probability of success \(\lambda\) if \(x=1\) and failure \(1-\lambda\) if \(x=0\). Now, consider we have two coins, \(w_1\) and \(w_2\), one of them is a quarter and the other is a dime. Moreover, these coins aren't fair. They may have the material weight unbalanced along the coin surface. If we perform an experiment of flipping every coin 10 times and recorded the outcome, we will have something like,Coin-1 outcome = \(\{H,T,H,H,T,H,H,T,H,T\}\)

Coint-2 outcome = \(\{T,T,H,T,T,H,H,T,T,T\}\)

where \(H\) means head and \(T\) means tail. Assume that these outcomes are our prior knowledge about how fair are the coins. For example, we can measure the bias of coin-1 toward \(H\) by dividing the number of times heads appeared to the total number of outcomes, $$P(w_1) = \frac{6}{10} = 0.6$$ Similar for coin-2, we have the bias equals \(P(w_2)=0.3\). Assume we have the following outcome of a new 10 flipping experiments \(x=\{T,H,T,H,T,T,H,T,H,T\}\) but we don't know which coin was used for this outcome. Can we measure a degree of belief for each coin given the known prior?. We can infer such belief by forming a posterior distribution for the two coins as our world states. Let us take a generative model for our posterior distribution, $$P(w|x,\Theta)= \frac{P(x|w,\Theta)P(w|\Theta)

}{\sum\limits_{w}P(x|w,\Theta)P(w|\Theta)}$$Let \(P(x|w,\Theta)=Bern_x(\lambda_w)\) where we have chosen the Bernoulli distribution to model the posterior with \(\lambda_w\) as the only unknown parameter that needs to be estimated. Given that we have two world states \(w_1\) and \(w_2\), we have,$$P(x|w_1,\Theta)=Bern_{x}(\lambda_1)=\prod\limits_{i = 1}^n {Bern \left( {x_i, \lambda_1 } \right)}\\P(x|w_2,\Theta)=Bern_{x}(\lambda_2)=\prod\limits_{i = 1}^n {Bern \left( {x_i, \lambda_2 } \right)}$$ where \(x\) is a set of linearly independent binary \((H=1,T=0)\) measurements. This describes our model which can be plugged into the Bayes rule to infer the posterior distribution. But, we need to learn the model parameters \(\lambda_1, \lambda_2\) using the prior knowledge of the two coins. We can do this using a maximum likelihood approach. So, consider the prior outcome of the first coin \(\{H,T,H,H,T,H,H,T,H,T\}\), we have, $$\begin{aligned}\widehat{\lambda}_1 &= \operatorname*{arg\,max}_{\lambda_1} P(\lambda_1|x) = \operatorname*{arg\,max}_{\lambda_1} Bern_x(\lambda_1)\\&= \operatorname*{arg\,max}_{\lambda_1} \prod\limits_{i = 1}^{n_1} {Bern \left( {x_i, \lambda_1 } \right)} \\ &= \operatorname*{arg\,max}_{\lambda_1} \lambda_1 ^{6} \left( {1 - \lambda_1 } \right)^{4} = \operatorname*{arg\,max}_{\lambda_1} ln( \lambda_1 ^{6} \left( {1 - \lambda_1 } \right)^{4}) \\&= \operatorname*{arg\,max}_{\lambda_1} 6 ln(\lambda_1) + 4 ln\left( {1 - \lambda_1 } \right)^{4} \end{aligned}$$ and by taking the derivative of \(ln(P(\lambda_1|x))\) with respect to \(\lambda_1\), we have,$$\begin{aligned}\frac{dln(P(\lambda_1|x))}{d\lambda_1} &= \frac{6}{\lambda_1} - \frac{4}{1-\lambda_1}=0\\ &= 6(1-\lambda_1) - 4 \lambda_1 = 0\\ &= 6-10\lambda_1 = 0\\ &\lambda_1 = 0.6 \end{aligned}$$ This is actually the coin head bias. So, we can set \(\lambda_1 = 0.6\) and \(\lambda_2 = 0.3\). We will also assume a uniform prior \(P(w|\Theta) = P(\lambda_w)=0.5\), and \(P(\lambda_1) = P(\lambda_2) =0.5\).

For a new \(x=\{T,H,T,H,T,T,H,T,H,T\}\), we can measure the tendency of these measurements to each world state as,$$P(x|w_1,\lambda_1)=\prod\limits_{i = 1}^n {\lambda_1 ^{x_i} \left( {1 - \lambda_1 } \right)^{1 - x_i}}=\lambda_1 ^{4} \left( {1 - \lambda_1 } \right)^{6}=0.00053\\P(x|w_2,\lambda_2)=\prod\limits_{i = 1}^n {\lambda_2 ^{x_i} \left( {1 - \lambda_2 } \right)^{1 - x_i}} = \lambda_2 ^{4} \left( {1 - \lambda_2 } \right)^{6}=0.00095$$and the posterior distribution becomes, $$P(w_1|x) = \frac{0.00053 \times 0.5}{0.00053 \times 0.5 + 0.00095 \times 0.5} = {0.36}\\P(w_2|x) = \frac{0.00095 \times 0.5}{0.00053 \times 0.5 + 0.00095 \times 0.5} = {0.64}$$and for another new set \(x=\{H,H,H,H,T,H,H,T,H,H\}\), we have, $$P(w_1|x) = {0.988}\\P(w_2|x) = {0.011}$$ We can see that our posterior model measures the tendency of the head bias of new measurements toward the prior bias of the two coins. In the first measurements, we are 0.64 certain that the new \(x\) comes from the second coin, where in the second measurements, we are 0.988 certain that \(x\) comes from the first coin.

### Mixture of Bernoulli Distributions

Now, suppose that we have a collection of dime coins \(\mathcal{C}_1\) and another collection of quarter coins \(\mathcal{C}_2\). We also know that within each collection, some coins are fair and others are biased. Consider that we flip each coin 10 times, so we have a collection of sets \(\mathcal{C}_i = \{x_q|q=1:n_i \text{ and } |x_q|=10\}\), \(n_i\) is the number of coins of the \(i\)th collection.

Although we can model each coin individually using a single Bernoulli distribution as we did before, we need to fit a distribution for each collection of coins. If we have a single Bernoulli distribution to describe the flipping process of each coin, we need a way to mix the distributions of all coins within a certain collection. We may also have some coins not flipped in the collection or others with several flipping sets, which requires a mixture distribution that weights the individual coin distributions toward each other.

Although we can model each coin individually using a single Bernoulli distribution as we did before, we need to fit a distribution for each collection of coins. If we have a single Bernoulli distribution to describe the flipping process of each coin, we need a way to mix the distributions of all coins within a certain collection. We may also have some coins not flipped in the collection or others with several flipping sets, which requires a mixture distribution that weights the individual coin distributions toward each other.

To be continued ...