Tuesday, September 17, 2013

Fitting Mixture Models Using Expectation Maximization (A Practical Example)

What is the Problem?

Suppose you have data \(x\) which represents a set of observations. These observations can be face and non face images, temperatures for two successive weeks or tail/head results of a coin-flipping experiment. We can argue that the data is generated from a world that has a set of states \(w\) and each state \(w_i \in w\) is responsible for generating a certain number of observations \(o \subset x\).

The problem becomes how can we find the world state \(w_i \in w\) for every new observation \(x_c\) that may or may not belong to \(x\). A naive solution to this problem rises from the fact that each world state \(w_i \in w\) has a certain number of observations \(o \subset x\). So, we can find the nearest neighbor observation \(x_n \in x\) to \(x_c\) and give \(x_c\) the same state as \(x_n\). Unfortunately, this only works in a perfect world that has its states easily distinguishable from each other.

In a real world a new observation \(x_c\) may be compatible with more than one world state due to a noise in the measurement process. The best solution to this ambiguity is to calculate the tendency of the new observation \(x_c\) to each world state \(w_i \in w\) which can be described by a posterior probability distribution \(P(w|x)\) over the world states.

In order to form the posterior distribution, we need to define a model that relates the observations \(x\) with the world states \(w\). This model is controlled by a set of parameters \(\Theta\) that describes the relationship between \(x\) and \(w\).

We also need a learning algorithm to fit the model parameters \(\Theta\) on a given training data in the form of \(\{(x_i, w_i) : x_i \in x, w_i \in w\}\). Finally, an inference algorithm is required to infer the world state of a new observation \(x_n\) by calculating the posterior probability distribution \(P(w|x)\) over all world states.

The important component of the solution is the model. If we modeled the data correctly, we will have a good posterior probability distribution. We can classify the models into two general categories, the generative models and the discriminative models.

The discriminative models describes the probability of the world state \(P(w|x)\) given the data. In order to model \(P(w|x)\), we usually choose a distribution \(P(w)\) over the world states and estimate the distribution parameters \(\Theta\) using the given training data \({(x_i, w_i)}\). So, the posterior distribution is now written as \(P(w|x,\Theta)\). For example, if the world state is continuous, we can model \(P(w)\) using a normal distribution and make the mean \(\mu\) and variance \(\sigma^2\) functions of the data \(x\). Then, we can use a technique such as ML (Maximum Likelihood) to choose the best parameters \(\widehat{\Theta}\) where, $$\widehat{\Theta} =\operatorname*{arg\,max}_{\Theta} P(\Theta|x,w)$$ The generative models describes the probability of the data \(P(x|w)\) given the world state. In order to model this probability, we choose a probability distribution over the data \(P(x)\) and estimate the distribution parameters \(\Theta\) as a function of the world state \(w\). For example, we can choose a Bernoulli distribution \(P(x) = \lambda ^x \left( {1 - \lambda } \right)^{1 - x}\) for a binary data and make the parameter \(\lambda\) a function of the world states. Again, we need to fit the parameters \(\Theta\) using a set of training examples \({(x_i, w_i)}\), so our distribution now is written as \(P(x|w, \Theta)\) as it also depend on the fitted parameters. Our inference algorithm will work by calculating the posterior distribution \(P(w|x)\) and incorporating a prior P(w) using Bayes' rule, $$P(w|x) = \frac{P(x|w) P(w)}{\int\limits P(x|w) P(w) dw}$$

Motivating Example

One of the simplest models for modeling the distribution of a binary random variable is the Bernoulli distribution. By random variable we mean a variable \(y\) that can take any random value, and by binary we restrict \(y\) to have only two values \(0, 1\). The Bernoulli distribution is described by, $$P(x) = \lambda ^x \left( {1 - \lambda } \right)^{1 - x}$$ which represents the probability of success \(\lambda\) if \(x=1\) and failure \(1-\lambda\) if \(x=0\). Now, consider we have two coins, \(w_1\) and \(w_2\), one of them is a quarter and the other is a dime. Moreover, these coins aren't fair. They may have the material weight unbalanced along the coin surface. If we perform an experiment of flipping every coin 10 times and recorded the outcome, we will have something like,

Coin-1 outcome = \(\{H,T,H,H,T,H,H,T,H,T\}\)
Coint-2 outcome = \(\{T,T,H,T,T,H,H,T,T,T\}\)

where \(H\) means head and \(T\) means tail. Assume that these outcomes are our prior knowledge about how fair are the coins. For example, we can measure the bias of coin-1 toward \(H\) by dividing the number of times heads appeared to the total number of outcomes, $$P(w_1) = \frac{6}{10} = 0.6$$ Similar for coin-2, we have the bias equals \(P(w_2)=0.3\). Assume we have the following outcome of a new 10 flipping experiments \(x=\{T,H,T,H,T,T,H,T,H,T\}\) but we don't know which coin was used for this outcome. Can we measure a degree of belief for each coin given the known prior?. We can infer such belief by forming a posterior distribution for the two coins as our world states. Let us take a generative model for our posterior distribution, $$P(w|x,\Theta)= \frac{P(x|w,\Theta)P(w|\Theta)
}{\sum\limits_{w}P(x|w,\Theta)P(w|\Theta)}$$Let \(P(x|w,\Theta)=Bern_x(\lambda_w)\) where we have chosen the Bernoulli distribution to model the posterior with \(\lambda_w\) as the only unknown parameter that needs to be estimated. Given that we have two world states \(w_1\) and \(w_2\), we have,$$P(x|w_1,\Theta)=Bern_{x}(\lambda_1)=\prod\limits_{i = 1}^n {Bern \left( {x_i, \lambda_1 } \right)}\\P(x|w_2,\Theta)=Bern_{x}(\lambda_2)=\prod\limits_{i = 1}^n {Bern \left( {x_i, \lambda_2 } \right)}$$ where \(x\) is a set of linearly independent binary \((H=1,T=0)\) measurements. This describes our model which can be plugged into the Bayes rule to infer the posterior distribution. But, we need to learn the model parameters \(\lambda_1, \lambda_2\) using the prior knowledge of the two coins. We can do this using a maximum likelihood approach. So, consider the prior outcome of the first coin \(\{H,T,H,H,T,H,H,T,H,T\}\), we have, $$\begin{aligned}\widehat{\lambda}_1 &= \operatorname*{arg\,max}_{\lambda_1} P(\lambda_1|x) = \operatorname*{arg\,max}_{\lambda_1} Bern_x(\lambda_1)\\&= \operatorname*{arg\,max}_{\lambda_1} \prod\limits_{i = 1}^{n_1} {Bern \left( {x_i, \lambda_1 } \right)} \\ &= \operatorname*{arg\,max}_{\lambda_1}  \lambda_1 ^{6} \left( {1 - \lambda_1 } \right)^{4} = \operatorname*{arg\,max}_{\lambda_1} ln( \lambda_1 ^{6} \left( {1 - \lambda_1 } \right)^{4}) \\&= \operatorname*{arg\,max}_{\lambda_1}  6 ln(\lambda_1) + 4 ln\left( {1 - \lambda_1 } \right)^{4} \end{aligned}$$ and by taking the derivative of \(ln(P(\lambda_1|x))\) with respect to \(\lambda_1\), we have,$$\begin{aligned}\frac{dln(P(\lambda_1|x))}{d\lambda_1} &= \frac{6}{\lambda_1} - \frac{4}{1-\lambda_1}=0\\ &= 6(1-\lambda_1) - 4 \lambda_1 = 0\\ &= 6-10\lambda_1 = 0\\ &\lambda_1 = 0.6 \end{aligned}$$ This is actually the coin head bias. So, we can set \(\lambda_1 = 0.6\) and \(\lambda_2 = 0.3\). We will also assume a uniform prior \(P(w|\Theta) = P(\lambda_w)=0.5\), and \(P(\lambda_1) =  P(\lambda_2) =0.5\).

For a new \(x=\{T,H,T,H,T,T,H,T,H,T\}\), we can measure the tendency of these measurements to each world state as,$$P(x|w_1,\lambda_1)=\prod\limits_{i = 1}^n {\lambda_1 ^{x_i} \left( {1 - \lambda_1 } \right)^{1 - x_i}}=\lambda_1 ^{4} \left( {1 - \lambda_1 } \right)^{6}=0.00053\\P(x|w_2,\lambda_2)=\prod\limits_{i = 1}^n {\lambda_2 ^{x_i} \left( {1 - \lambda_2 } \right)^{1 - x_i}} = \lambda_2 ^{4} \left( {1 - \lambda_2 } \right)^{6}=0.00095$$and the posterior distribution becomes, $$P(w_1|x) = \frac{0.00053 \times 0.5}{0.00053 \times 0.5 + 0.00095 \times 0.5} = {0.36}\\P(w_2|x) = \frac{0.00095 \times 0.5}{0.00053 \times 0.5 + 0.00095 \times 0.5} = {0.64}$$and for another new set \(x=\{H,H,H,H,T,H,H,T,H,H\}\), we have, $$P(w_1|x) = {0.988}\\P(w_2|x) = {0.011}$$ We can see that our posterior model measures the tendency of the head bias of new measurements toward the prior bias of the two coins. In the first measurements, we are 0.64 certain that the new \(x\) comes from the second coin, where in the second measurements, we are 0.988 certain that \(x\) comes from the first coin.

Mixture of Bernoulli Distributions

Now, suppose that we have a collection of dime coins \(\mathcal{C}_1\) and another collection of quarter coins \(\mathcal{C}_2\). We also know that within each collection, some coins are fair and others are biased. Consider that we flip each coin 10 times, so we have a collection of sets \(\mathcal{C}_i = \{x_q|q=1:n_i \text{ and } |x_q|=10\}\), \(n_i\) is the number of coins of the \(i\)th collection.

Although we can model each coin individually using a single Bernoulli distribution as we did before, we need to fit a distribution for each collection of coins. If we have a single Bernoulli distribution to describe the flipping process of each coin, we need a way to mix the distributions of all coins within a certain collection. We may also have some coins not flipped in the collection or others with several flipping sets, which requires a mixture distribution that weights the individual coin distributions toward each other.

To be continued ...

Wednesday, July 10, 2013

Network Mounting of Directories For Ubuntu Machines

In this post, I'm going to discuss how to mount a directory from a master machine in a network of machines running Ubuntu. 

1. In the master station, edit the hosts file (/etc/hosts) to include IP addresses of all hosts. Here is an example for 4 machines:     localhost  m0  m1  m2  m3
I will assume that the master machine is m0. So, in each of the slave machines you also need to include the master IP address,     localhost  m0
2. Install NFS (Network File System) which allows the sharing and sync of folders between a set of network connected hosts. You can do this by installing the nfs-server on the master machine,
sudo apt-get install nfs-server
And install the nfs-client on each slave machine,
sudo apt-get install nfs-client
3. Now, we will start the mounting process. You first should locate the folder you need to mount. Let's say that the folder is /helala. We first will ask the master machine to share this folder by editing the /etc/exports file and adding the following line
/helala *(rw,sync)
You can do this in one statement by issuing,
sudo echo  /mirror *(rw,sync) >> /etc/exports
4. We will go now to the other hosts and finalize the mounting by first creating a folder to host the remote shared directory (in my case /helala). Then, we can issue the following command on each slave host,
sudo mount m0:/helala /helala
Unfortunately, we need to run this command on every boot. A better approach is to edit the /etc/fstab file and add the following line,
m0:/helala     /helala    nfs
In this way, the mount will be performed on every boot. Now, you can issue the following command to remount all shared directories,
sudo mount -a

Friday, June 7, 2013

Frame Capture Issue with OpenCV under Windows 7

A serious issue that arises when using OpenCV under windows 7 is the use of opencv_highgui or openCVFrameGrabber for capturing frames from WebCam or a video file. In this case, OpenCv uses video for windows codec which isn’t always working well under windows 7.

A solution to this problem is to use VideoInputFrameGrabber for WebCam which uses DirectShow and FFmpegFrameGrabber with a version of FFmpeg codec to read video files.

Javacv code sample:

FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(new File(video_file_path));
try {
    for (int i = 0; i < 10; i++) {
        opencv_core.IplImage currentFrame = grabber.grab();
} catch (Exception ex) {

Building a DLL With JNI support under Windows (32bit versus 64bit)

Java provides us with JNI to link java projects with C++ code. I’m using Netbeans IDE which has a good C++ support With Cygwin tools and I needed to use a C++ implementation for multilabel graphcut optimization in one of my java projects.
The problem I encountered is the different settings needed to build a DLL library under 32bit and 64bit environments. Here is the settings I used to build under the different environments,

Include: $jdk_home$/include ; $jdk_home$/include/win32

I used the g++ compiler with the following options
-mno-cygwin -Wl,—add-stdcall-alias -shared -m32

I used the x86_64-w64-mingw32-g++ compiler with the following options
-w -mno-cygwin -static -shared