Topics in AI: The EM Algorithm

I’ve decided to do a series on topics associated with artificial intelligence, mainly because there are two problems with the field. The first is the inability to distinguish what I call “high AI” from “low AI.” High AI concerns itself with the mimicking of human behavior: emotional interpretation, creativity, adaptive interaction, logical deduction. The goal here is to eventually produce something akin to Clarke’s HAL. Low AI is more pragmatic, with goals of reducing effort in solving complex problems by providing heuristics that guide the algorithm to a solution. Low AI is concerned with search and discovery.

The other reason is the opaqueness and unwarranted reverence some people have towards the algorithms of AI. I’ve already discussed Bayesian filtering which is becoming as ubiquitous as the spam it tries to defeat. The so-called “dictionary attacks” spammers are currently using suddenly seem fruitless when one understands the essence of the Bayesian approach.

Today’s discussion is in three parts. The first part on the normal distribution can be skipped; it exists only to provide some background information on common probability distributions in Euclidean space so that the interactive example provided in the second section can be better understood. The third section brings us back once more to spam filtering.

Continuous Variables & the Normal Distribution

In describing the probabilities used in spam filtering, P(nigeria | ¬spam) denoted the probability that the word “nigeria” from our vocabulary set appears when the message has been categorized as not being spam (from the two-membered boolean set { spam, ¬spam }). The implication here was that this notation was shorthand for the more verbose P(word = nigeria | class = ¬spam).

How do we denote the probability that a dart thrown by a bar patron lands exactly one inch away from the center of the dartboard? P(distance = 1in), right?

Wrong!

The problem lies with the nature of the variable in question, and in some ways with the question itself. First off, it’s almost impossible to be sure that the measurement can be exact; all scientists hedge their bets by stating that measurements are within a given range, because no ruler has infinite precision. But let us assume for the sake of argument that we have such a “magic ruler” with said infinite precision.

Since there are an uncountable number of distances from the center where the dart may land, every probability for a specific distance is effectively zero, making the question ridiculous to ask. That holds true for most situations where the variable is continuous; that is, takes on real-numbered values.

The physical analogue of this is the concept of mass versus density. Suppose you are given a solid metal block, one centimeter on each side and weighing 10 grams, and asked to find the mass at the center. There is no mass at the center; the center is a position and mass is associated with objects. There is, however, a closely related measurable quantity that is applicable to position: density.

Density is defined by lim_∆V→0 mass(∆V) / ∆V, where ∆V is the volume of substance surrounding a location. Equivalently, we can define a probability density for a given value by taking the limit of a range as the size of the range goes to zero: p(x) = lim_∆x→0 P(y ∈ (x, x + ∆x)) / ∆x.

The most famous probability density function of all is the Normal (or Gaussian) Distribution. This is the infamous “bell-shaped” curve much beloved by statisticians and college students of mediocre capability. It is generated by a formidable calculation involving exponentials and, in the case of vector variables, matrix algebra. What is important is that its shape is governed by two (possibly vector-valued) parameters: the mean, which specifies the location of the center of the distribution; and the covariance, which specifies how the distribution is shaped. From any given set of data, if the underlying distribution is believed to be Gaussian, then these parameters may be readily calculated and a model constructed. (The parameters may be calculated even if the underlying distribution is not Gaussian; the generated model will simply not be accurate.)

There are a few points of interest about the normal distribution. First, the probability density is greatest at the mean. Second, the density decreases with distance from the mean. Third, the density never goes to zero. Like the electron cloud of quantum mechanics, its range is disturbingly infinite although it maintains most of its density near the center.

The Expectation-Maximization Algorithm

The applet demonstration (go here if it fails to load; open in a new window if possible) generates one or more sets of data according to the normal distribution when the “Generate Data” button is clicked. Try it now. Note that the center of the “cloud” and its shape can vary widely (ignore the circle for now). Try clicking a few more times to generate different distributions.

Unfortunately, this applet doesn’t display the underlying model used to generate the data (I was lazy). But it was mentioned above that it is easy to estimate the model (specifically, the parameters of the model). Make sure that only one class is to be generated by setting the pop-up menu to “1.” Click “Generate Data.” Then click “Run EM Algorithm.”

The circle moves so that its center is the same as the center of the cluster and changes shape to reflect the shape of the cluster. (If the cluster is tightly packed, it may be difficult to see the shape.) This ellipse reflects the algorithm’s estimation of the generating model.

The problem arises when more than one generating model is involved. Suppose we have two unknown models, A and B. For any given data point, it was generated by model A or model B, but we don’t know which.

If we knew the source of each point, then the solution would be simple: separate the points into subsets and only use the subset associated with that model to calculate its parameters.

If we knew the models, we could guess the providence of the point by assigning it to the most likely generator. How? By using Bayes’ Theorem of course! Our decision rule is that we assign the point x to A if and only if P(A|x) > P(B|x). Bayes’ Theorem says that P(A|x) = p(x|A) ⋅ P(A) / p(x). The evidence p(x), being a non-zero term common to both sides, can be eliminated from the inequality. If both generators are equally likely, then the prior probabilities P(A) and P(B) can be removed from the decision rule as well. Now the decision rule is to choose A if and only if p(x|A) > p(x|B): in English, we choose the model that has the greater probability density for that point.

But when both the source of the data points and the parameters of the underlying models are unknown, we find ourselves in a Catch-22: we need the models to figure out the labels for the data, and we need the data correctly labeled to estimate the models.

Expectation-Maximization solves this problem using an iterative approach, designed in such a way that the likelihood of the parameters increases with each iteration until some local maximum is reached. Each iteration consists of two steps: expectation, in which the unknowns are estimated, and maximization, in which the parameters of the models are updated to maximize the likelihood of the data.

Use the pop-up menu to generate more than one cluster. Then click on “Run EM Algorithm” to see a single iteration of the algorithm. Repeated clicking will result in the estimated models (the ellipses) converging on the clusters. Here’s how it works:

The unknowns are the sources of the data points. Since any model can possibly be the source (the normal distribution has a non-zero density everywhere) we represent membership in a category as a fraction rather than a boolean. (It could be a boolean—the problem is that the algorithm takes longer to converge onto a solution.) In the expectation step, p(x|M_i) for each model M_i is calculated via Bayes’ Theorem above. The fraction of that point belonging to a model is proportional to the calculated likelihood for that model.

Now that the data points have been labeled, the models need to be updated. Note that no data point fully belongs to any one model; when calculating the parameters, the fractional membership is honored. Once the models have been updated, the next iteration may proceed. (In the applet, the color of the data point represents the category for which the point has the highest membership.)

The EM algorithm is powerful, but it has its limitations. It can easily get trapped in a local maxima which makes no sense to the high-altitude observer. Often times it will subdivide a single cluster into multiple smaller clusters while coalescing small clusters into a larger one. If this happens, the only way to proceed is to reset one or more of the models. (In the applet, using the pop-up menu to temporarily decrease and increase the number of models will reset them.)

Sometimes a model will become fixated on a point, decreasing its size until its data set consists only of that single point. Since it predicts that point with 100% accuracy, it is the most likely (in fact, infinitely likely) model for that point. This is a singular solution, and is difficult to avoid. Again, resetting these degenerate models is the only option.

Finally, the problem of model mismatch is paramount. If the model does not match the actual generating distribution, the estimated parameters will be woefully inaccurate. For instance, in the normal mixture the number of generating models must be supplied. If two normal models are used to generate the data, but six models are used for parameter estimation, the results will be inaccurate. Patterns will be located where there are none; single categories will be unnecessarily subdivided into separate categories. This problem can be illustrated in the applet by generating data using a small number of categories (1 or 2) then running EM with a large number of categories (5 or 6).

Combining the EM Algorithm with Bag-of-Words Analysis

The problem with describing the EM algorithm (and in this I too am guilty) is that everyone seems to apply it to a normal mixture model, leading to the common erroneous belief that it is only good for problems that reside in Euclidean space. This is not the case.

Take for example our naïve Bayes spam filter. The unspoken assumption was that the corpus had been pre-labeled with each message as “spam” or “not spam.” This is a reasonable assumption; for example, Apple’s Mail.app requires people to “train” the system before performing automatic spam filtering. (Less than reasonable is going through 300 back-issued bear tests and labeling each one. Oy.)

Under some situations, manual labeling may not be convenient or even possible: a popular forum may want to classify messages to see if they are on-topic, humorous, informative, et cetera; a massive hierarchical bookmarks list may want to file a URL according to its subject matter; a technology trends analyst may want to see if new topics are spreading virulently through the blogosphere. Unsupervised text classification is the goal here.

EM may not be the best solution, but it can be applied and that will illustrate the versatility of the algorithm. First, we need to figure out how to map the algorithm onto our problem space. The underlying model structure is naïve Bayes, the unknowns are the memberships in each class (fractional, like above), and the parameters are the likelihoods of each vocabulary word conditional on each class.

Whoa. That is a lot of parameters. Unfortunately, those are the tunable parameters of a bag-of-words model; nothing lesser will suffice.

The E and M steps are straightforward. In the E-step, we assign a category to each document (not word) using the current model. In the M-step, we then calculate the likelihoods of each word using the approaches described in the article on Bayesian filtering. The idea here is to discover groups of words that are likely to appear in a document together; manual inspection of prototypical documents could then assign a reasonable human interpretation to the clustering.