Accord Vision: March 2010

Wednesday, 24 March 2010

Hidden Markov Model -Based Sequence Classifiers in C#

Distinct Hidden Markov Models can be trained individually on data obtained from each class of a classification problem. If we create a set of those models and specialize each model to recognize each of the separate classes, we can then use the HMMs ability to calculate the likelihood that a given sequence belongs to the model to determine the most likely class of an unknown sequence.

Download source code
Download sample application
Download the Accord.NET Framework

This post is a continuation for the previous
article, Hidden Markov Models in C#.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Introduction

Hidden Markov Models

An introduction about Hidden Markov Models (HMM) was presented in a previous article, entitled Hidden Markov Models in C#. HMMs are models are stochastic methods to model temporal and sequence data.

They allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification). In the context of this article, we will be more interested in results from ability (3).

Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission. The picture on the left summarizes the overall definition of a HMM given in the previous article.

Hidden Markov Model -Based Sequence Classifier

Hidden Markov Models can be trained individually on data obtained from each class of a classification problem. If we create a set of models and specialize each model to recognize each of the separated classes, then we will be able to explore the HMMs ability to calculate the likelihood that a given sequence belongs to itself to perform classification of discrete sequences.

After all models have been trained, the probability of the unknown-class sequence can be computed for each model. As each model specialized in a given class, the one which outputs the highest probability can be used to determine the most likely class for the new sequence, as shown in the picture below.

Schematic representation of the sequence classification procedure.

Source Code

The implementation is very straightforward, as shown in the picture below. The Tag property of the HiddenMarkovModel class can be used to store additional information about which class the model will be representing.

Class diagram for the HMM-based sequence classifier.

Computing the most likely class for a given sequence

Computing the most likely class for a given sequence is as simple as iterating through all models in the set and selecting the one which has produced the highest likelihood value.

/// <summary>
///   Computes the most likely class for a given sequence.
/// </summary>
public int Compute(int[] sequence, out double likelihood)
{
    int label = 0;
    likelihood = 0.0;

    // For every model in the set,
    for (int i = 0; i < models.Length; i++)
    {
        // Evaluate the probability for the given sequence
        double p = models[i].Evaluate(sequence);

        // And select the one which produces the highest probability
        if (p > likelihood)
        {
            label = i;
            likelihood = p;
        }
    }

    // Returns the index of the most likely model.
    return label;
}

Training each model to recognize each of the output classes

To train each model to recognize each of the output classes we have to split the training data into subsets whose outputs corresponds to each of the classes from the classification problem.

/// <summary>
///   Trains each model to recognize each of the output labels.
/// </summary>
/// <returns>The sum likelihood for all models after training.</returns>
public double Learn(int[][] inputs, int[] outputs, int iterations, double limit)
{
    double sum = 0;

    // For each model,
    for (int i = 0; i < models.Length; i++)
    {
        // Select the input/output set corresponding
        //  to the model's specialization class
        int[] inx = outputs.Find(y => y == i);
        int[][] observations = inputs.Submatrix(inx);

        // Train the current model in the input/output subset
        sum += models[i].Learn(observations, iterations, limit);
    }

    // Returns the sum likelihood for all models.
    return sum;
}

Using the Code

Code usage is very simple. Once one has the input data available in the form of a sequence array and output data in form of a integer array, create a new HiddenMarkovClassifier object with the appropriate parameters.

Then, just call Learn() passing the input and output data and the convergence limit for the learning algorithm. After that, call Compute() passing a new integer sequence to be classified by the system.

// Declare some testing data
int[][] inputs = new int[][]
{
    new int[] { 0,1,1,0 },   // Class 0
    new int[] { 0,0,1,0 },   // Class 0
    new int[] { 0,1,1,1,0 }, // Class 0
    new int[] { 0,1,0 },     // Class 0

    new int[] { 1,0,0,1 },   // Class 1
    new int[] { 1,1,0,1 },   // Class 1
    new int[] { 1,0,0,0,1 }, // Class 1
    new int[] { 1,0,1 },     // Class 1
};

int[] outputs = new int[]
{
    0,0,0,0, // First four sequences are of class 0
    1,1,1,1, // Last four sequences are of class 1
};


// We are trying to predict two different classes
int classes = 2;

// Each sequence may have up to two symbols (0 or 1)
int symbols = 2;

// Nested models will have two states each
int[] states = new int[] { 2, 2 };

// Creates a new Hidden Markov Model Classifier with the given parameters
HiddenMarkovClassifier hmc = new HiddenMarkovClassifier(classes, symbols, states);


// Will train until convergence of the average likelihood
double likelihood = hmc.Learn(inputs, outputs, 0.001);


// Will assert the models have learned the sequences correctly.
for (int i = 0; i < inputs.Length; i++)
{
    int expected = outputs[i];
    int actual = hmc.Compute(inputs[i], out likelihood);
    Assert.AreEqual(expected, actual);
}

Sample Application

The accompanying sample application can read Excel spreadsheets containing integer sequences and labels for those sequences. An example spreadsheet contained inside the Resources folder in the compressed archive file demonstrates the expected format for this data.

Left: Input sequences data. Right: Hidden Markov Models contained in the Sequence Classifier with initial configurations.

Left: Trained models. Right: Performance test for the Hidden Markov Model Sequence Classifier.

References

Kadous, Mohammed W., “A general architecture for supervised classification of multivariate time series”, Technical Report UNSW-CSE-TR-9806. School of Computer Science & Engineering, University of New South Wales, 1998.

Tuesday, 23 March 2010

Hidden Markov Models in C#

Hidden Markov Models (HMM) are stochastic methods to model temporal and sequence data. They are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

Download source code
Download sample application
Download the Accord.NET Framework

Introduction

Definition

Algorithms

Using the code

Remarks
1. Known issues

Acknowledgements

See also

References

Introduction

Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recognition, starting in the mid-1970s. Indeed, one of the most comprehensive explanations on the topic was published in “A Tutorial On Hidden Markov Models And Selected Applications in Speech Recognition”, by Lawrence R. Rabiner in 1989. In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in the field of bioinformatics.

Dynamical systems of discrete nature assumed to be governed by a Markov chain emits a sequence of observable outputs. Under the Markov assumption, it is also assumed that the latest output depends only on the current state of the system. Such states are often not known from the observer when only the output values are observable.

Hidden Markov Models attempt to model such systems and allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification).

The “hidden” in Hidden Markov Models comes from the fact that the observer does not know in which state the system may be in, but has only a probabilistic insight on where it should be.

Definition

Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission.

Notation

Traditionally, HMMs have been defined by the following quintuple:

$\lambda = (N, M, A, B, \pi)$

where

N is the number of states for the model
M is the number of distinct observations symbols per state, i.e. the discrete alphabet size.
A is the NxN state transition probability distribution given in the form of a matrix A = {a_ij}
B is the NxM observation symbol probability distribution given in the form of a matrix B = {b_j(k)}
π is the initial state distribution vector π = {π_i}

Note that, if we opt out the structure parameters M and N we have the more often used compact notation

$\lambda = (A, B, \pi)$

Canonical problems

There are three canonical problems associated with hidden Markov models, which I'll quote from Wikipedia:

Given the parameters of the model, compute the probability of a particular output sequence. This requires summation over all possible state sequences, but can be done efficiently using the Forward algorithm, which is a form of dynamic programming.

Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences, but can similarly be solved efficiently by the Viterbi algorithm.

Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, derive the maximum likelihood estimate of the parameters of the HMM given a dataset of output sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm.

The solution for those problems are exactly what makes Hidden Markov Models useful. The ability to learn from the data (using the solution of problem 3) and then become able to make predictions (solution to problem 2) and able to classify sequences (solution of problem 2) is nothing but applied machine learning. From this perspective, HMMs can just be seem as supervisioned sequence classifiers and sequence predictors with some other useful interesting properties.

Choosing the structure

Choosing the structure for a hidden Markov model is not always obvious. The number of states depend on the application and to what interpretation one is willing to give to the hidden states. Some domain knowledge is required to build a suitable model and also to choose the initial parameters that an HMM can take. There is also some trial and error involved, and there are sometimes complex tradeoffs that have to be made between model complexity and difficulty of learning, just as is the case with most machine learning techniques.

Additional information can be found on http://www.cse.unsw.edu.au/~waleed/phd/tr9806/node12.html.

Algorithms

The solution to the three canonical problems are the algorithms that makes HMMs useful. Each of the three problems are described in the three subsections below.

Evaluation

The first canonical problem is the evaluation of the probability of a particular output sequence. It can be efficiently computed using either the Viterbi-forward or the Forward algorithms, both of which are forms of dynamic programming.

The Viterbi algorithm originally computes the most likely sequence of states which has originated a sequence of observations. In doing so, it is also able to return the probability of traversing this particular sequence of states. So to obtain Viterbi probabilities, please refer to the Decoding problem referred below.

The Forward algorithm, unlike the Viterbi algorithm, does not find a particular sequence of states; instead it computes the probability that any sequence of states has produced the sequence of observations. In both algorithms, a matrix is used to store computations about the possible state sequence paths that the model can assume. The forward algorithm also plays a key role in the Learning problem, and is thus implemented as a separate method.





/// <summary>
///   Calculates the probability that this model has generated the given sequence.
/// </summary>
/// <remarks>
///   Evaluation problem. Given the HMM  M = (A, B, pi) and  the observation
///   sequence O = {o1, o2, ..., oK}, calculate the probability that model
///   M has generated sequence O. This can be computed efficiently using the
///   either the Viterbi or the Forward algorithms.
/// </remarks>
/// <param name="observations">
///   A sequence of observations.
/// </param>
/// <param name="logarithm">
///   True to return the log-likelihood, false to return
///   the likelihood. Default is false.
/// </param>
/// <returns>
///   The probability that the given sequence has been generated by this model.
/// </returns>
public double Evaluate(int[] observations, bool logarithm)
{
    if (observations == null)
        throw new ArgumentNullException("observations");

    if (observations.Length == 0)
        return 0.0;


    // Forward algorithm
    double likelihood = 0;
    double[] coefficients;

    // Compute forward probabilities
    forward(observations, out coefficients);

    for (int i = 0; i < coefficients.Length; i++)
        likelihood += Math.Log(coefficients[i]);

    // Return the sequence probability
    return logarithm ? likelihood : Math.Exp(likelihood);
}

/// <summary>
///   Baum-Welch forward pass (with scaling)
/// </summary>
/// <remarks>
///   Reference: http://courses.media.mit.edu/2010fall/mas622j/ProblemSets/ps4/tutorial.pdf
/// </remarks>
private double[,] forward(int[] observations, out double[] c)
{
    int T = observations.Length;
    double[] pi = Probabilities;
    double[,] A = Transitions;

    double[,] fwd = new double[T, States];
    c = new double[T];


    // 1. Initialization
    for (int i = 0; i < States; i++)
        c[0] += fwd[0, i] = pi[i] * B[i, observations[0]];

    if (c[0] != 0) // Scaling
    {
        for (int i = 0; i < States; i++)
            fwd[0, i] = fwd[0, i] / c[0];
    }


    // 2. Induction
    for (int t = 1; t < T; t++)
    {
        for (int i = 0; i < States; i++)
        {
            double p = B[i, observations[t]];

            double sum = 0.0;
            for (int j = 0; j < States; j++)
                sum += fwd[t - 1, j] * A[j, i];
            fwd[t, i] = sum * p;

            c[t] += fwd[t, i]; // scaling coefficient
        }

        if (c[t] != 0) // Scaling
        {
            for (int i = 0; i < States; i++)
                fwd[t, i] = fwd[t, i] / c[t];
        }
    }

    return fwd;
}

Decoding

The second canonical problem is the discovery of the most likely sequence of states that generated a given output sequence. This can be computed efficiently using the Viterbi algorithm. A trackback is used to detect the maximum probability path travelled by the algorithm. The probability of travelling such sequence is also computed in the process.


/// <summary>
///   Calculates the most likely sequence of hidden states
///   that produced the given observation sequence.
/// </summary>
/// <remarks>
///   Decoding problem. Given the HMM M = (A, B, pi) and  the observation sequence 
///   O = {o1,o2, ..., oK}, calculate the most likely sequence of hidden states Si
///   that produced this observation sequence O. This can be computed efficiently
///   using the Viterbi algorithm.
/// </remarks>
/// <param name="observations">A sequence of observations.</param>
/// <param name="probability">The state optimized probability.</param>
/// <returns>The sequence of states that most likely produced the sequence.</returns>
public int[] Decode(int[] observations, out double probability)
{
    // Viterbi algorithm.

    int T = observations.Length;
    int states = States;
    int minState;
    double minWeight;
    double weight;

    int[,] s = new int[states, T];
    double[,] a = new double[states, T];


    // Base
    for (int i = 0; i < states; i++)
    {
        a[i, 0] = (-1.0 * System.Math.Log(pi[i])) - System.Math.Log(B[i, observations[0]]);
    }

    // Induction
    for (int t = 1; t < T; t++)
    {
        for (int j = 0; j < states; j++)
        {
            minState = 0;
            minWeight = a[0, t - 1] - System.Math.Log(A[0, j]);

            for (int i = 1; i < states; i++)
            {
                weight = a[i, t - 1] - System.Math.Log(A[i, j]);

                if (weight < minWeight)
                {
                    minState = i;
                    minWeight = weight;
                }
            }

            a[j, t] = minWeight - System.Math.Log(B[j, observations[t]]);
            s[j, t] = minState;
        }
    }


    // Find minimum value for time T-1
    minState = 0;
    minWeight = a[0, T - 1];

    for (int i = 1; i < states; i++)
    {
        if (a[i, T - 1] < minWeight)
        {
            minState = i;
            minWeight = a[i, T - 1];
        }
    }

    // Trackback
    int[] path = new int[T];
    path[T - 1] = minState;

    for (int t = T - 2; t >= 0; t--)
        path[t] = s[path[t + 1], t + 1];


    probability = System.Math.Exp(-minWeight);
    return path;
}

Learning

The third and last problem is the problem of learning the most likely parameters that best models a system given a set of sequences originated from this system. Most implementations I've seem did not consider the problem of learning from a set of sequences, but only from a single sequence at a time. The algorithm below, however, is fully suitable to learn from a set of sequences and also uses scaling, which is another thing I have not seem in other implementations.

The source code follows the original algorithm by Rabiner (1989). There are, however, some known issues with the algorithms detailed in Rabiner's paper. More information about those issues is available in a next section of this article entitled “Remarks”.




/// <summary>
///   Runs the Baum-Welch learning algorithm for hidden Markov models.
/// </summary>
/// <remarks>
///   Learning problem. Given some training observation sequences O = {o1, o2, ..., oK}
///   and general structure of HMM (numbers of hidden and visible states), determine
///   HMM parameters M = (A, B, pi) that best fit training data. 
/// </remarks>
/// <param name="iterations">
///   The maximum number of iterations to be performed by the learning algorithm. If
///   specified as zero, the algorithm will learn until convergence of the model average
///   likelihood respecting the desired limit.
/// </param>
/// <param name="observations">
///   An array of observation sequences to be used to train the model.
/// </param>
/// <param name="tolerance">
///   The likelihood convergence limit L between two iterations of the algorithm. The
///   algorithm will stop when the change in the likelihood for two consecutive iterations
///   has not changed by more than L percent of the likelihood. If left as zero, the
///   algorithm will ignore this parameter and iterates over a number of fixed iterations
///   specified by the previous parameter.
/// </param>
/// <returns>
///   The average log-likelihood for the observations after the model has been trained.
/// </returns>
public double Learn(int[][] observations, int iterations, double tolerance)
{
    if (iterations == 0 && tolerance == 0)
        throw new ArgumentException("Iterations and limit cannot be both zero.");

    // Baum-Welch algorithm.

    // The Baum–Welch algorithm is a particular case of a generalized expectation-maximization
    // (GEM) algorithm. It can compute maximum likelihood estimates and posterior mode estimates
    // for the parameters (transition and emission probabilities) of an HMM, when given only
    // emissions as training data.

    // The algorithm has two steps:
    //  - Calculating the forward probability and the backward probability for each HMM state;
    //  - On the basis of this, determining the frequency of the transition-emission pair values
    //    and dividing it by the probability of the entire string. This amounts to calculating
    //    the expected count of the particular transition-emission pair. Each time a particular
    //    transition is found, the value of the quotient of the transition divided by the probability
    //    of the entire string goes up, and this value can then be made the new value of the transition.


    int N = observations.Length;
    int currentIteration = 1;
    bool stop = false;

    double[] pi = Probabilities;
    double[,] A = Transitions;


    // Initialization
    double[][, ,] epsilon = new double[N][, ,]; // also referred as ksi or psi
    double[][,] gamma = new double[N][,];

    for (int i = 0; i < N; i++)
    {
        int T = observations[i].Length;
        epsilon[i] = new double[T, States, States];
        gamma[i] = new double[T, States];
    }


    // Calculate initial model log-likelihood
    double oldLikelihood = Double.MinValue;
    double newLikelihood = 0;


    do // Until convergence or max iterations is reached
    {
        // For each sequence in the observations input
        for (int i = 0; i < N; i++)
        {
            var sequence = observations[i];
            int T = sequence.Length;
            double[] scaling;

            // 1st step - Calculating the forward probability and the
            //            backward probability for each HMM state.
            double[,] fwd = forward(observations[i], out scaling);
            double[,] bwd = backward(observations[i], scaling);


            // 2nd step - Determining the frequency of the transition-emission pair values
            //            and dividing it by the probability of the entire string.


            // Calculate gamma values for next computations
            for (int t = 0; t < T; t++)
            {
                double s = 0;

                for (int k = 0; k < States; k++)
                    s += gamma[i][t, k] = fwd[t, k] * bwd[t, k];

                if (s != 0) // Scaling
                {
                    for (int k = 0; k < States; k++)
                        gamma[i][t, k] /= s;
                }
            }

            // Calculate epsilon values for next computations
            for (int t = 0; t < T - 1; t++)
            {
                double s = 0;

                for (int k = 0; k < States; k++)
                    for (int l = 0; l < States; l++)
                        s += epsilon[i][t, k, l] = fwd[t, k] * A[k, l] * bwd[t + 1, l] * B[l, sequence[t + 1]];

                if (s != 0) // Scaling
                {
                    for (int k = 0; k < States; k++)
                        for (int l = 0; l < States; l++)
                            epsilon[i][t, k, l] /= s;
                }
            }

            // Compute log-likelihood for the given sequence
            for (int t = 0; t < scaling.Length; t++)
                newLikelihood += Math.Log(scaling[t]);
        }


        // Average the likelihood for all sequences
        newLikelihood /= observations.Length;


        // Check if the model has converged or we should stop
        if (checkConvergence(oldLikelihood, newLikelihood,
            currentIteration, iterations, tolerance))
        {
            stop = true;
        }

        else
        {
            // 3. Continue with parameter re-estimation
            currentIteration++;
            oldLikelihood = newLikelihood;
            newLikelihood = 0.0;


            // 3.1 Re-estimation of initial state probabilities 
            for (int k = 0; k < States; k++)
            {
                double sum = 0;
                for (int i = 0; i < N; i++)
                    sum += gamma[i][0, k];
                pi[k] = sum / N;
            }

            // 3.2 Re-estimation of transition probabilities 
            for (int i = 0; i < States; i++)
            {
                for (int j = 0; j < States; j++)
                {
                    double den = 0, num = 0;

                    for (int k = 0; k < N; k++)
                    {
                        int T = observations[k].Length;

                        for (int l = 0; l < T - 1; l++)
                            num += epsilon[k][l, i, j];

                        for (int l = 0; l < T - 1; l++)
                            den += gamma[k][l, i];
                    }

                    A[i, j] = (den != 0) ? num / den : 0.0;
                }
            }

            // 3.3 Re-estimation of emission probabilities
            for (int i = 0; i < States; i++)
            {
                for (int j = 0; j < Symbols; j++)
                {
                    double den = 0, num = 0;

                    for (int k = 0; k < N; k++)
                    {
                        int T = observations[k].Length;

                        for (int l = 0; l < T; l++)
                        {
                            if (observations[k][l] == j)
                                num += gamma[k][l, i];
                        }

                        for (int l = 0; l < T; l++)
                            den += gamma[k][l, i];
                    }

                    // avoid locking a parameter in zero.
                    B[i, j] = (num == 0) ? 1e-10 : num / den;
                }
            }

        }

    } while (!stop);


    // Returns the model average log-likelihood
    return newLikelihood;
}

Using the code

Lets suppose we have gathered some sequences from a system we wish to model. The sequences are expressed as a integer array such as:


int[][] sequences = new int[][] 
{
    new int[] { 0,1,1,1,1,1,1 },
    new int[] { 0,1,1,1 },
    new int[] { 0,1,1,1,1 },
    new int[] { 0,1, },
    new int[] { 0,1,1 },
};

For us, it can be obvious to see that the system is outputting sequences that always start with a zero and have one or more ones at the end. But lets try to fit a Hidden Markov Model to predict those sequences.


// Creates a new Hidden Markov Model with 2 states for
//  an output alphabet of two characters (zero and one)
HiddenMarkovModel hmm = new HiddenMarkovModel(2, 2);

// Try to fit the model to the data until the difference in
//  the average likelihood changes only by as little as 0.01
hmm.Learn(sequences, 0.01);

Once the model is trained, lets test to see if it recognizes some sequences:


// Calculate the probability that the given
//  sequences originated from the model
double l1 = hmm.Evaluate(new int[] { 0, 1 });       // l1 = 0.9999
double l2 = hmm.Evaluate(new int[] { 0, 1, 1, 1 }); // l2 = 0.9999

double l3 = hmm.Evaluate(new int[] { 1, 1 });       // l3 = 0.0000
double l4 = hmm.Evaluate(new int[] { 1, 0, 0, 0 }); // l4 = 0.0000

Of course the model performs well as this a rather simple example. A more useful test case would consist of allowing for some errors in the input sequences in the hope that the model will become more tolerant to measurement errors.


int[][] sequences = new int[][] 
{
    new int[] { 0,1,1,1,1,0,1,1,1,1 },
    new int[] { 0,1,1,1,0,1,1,1,1,1 },
    new int[] { 0,1,1,1,1,1,1,1,1,1 },
    new int[] { 0,1,1,1,1,1         },
    new int[] { 0,1,1,1,1,1,1       },
    new int[] { 0,1,1,1,1,1,1,1,1,1 },
    new int[] { 0,1,1,1,1,1,1,1,1,1 },
};

// Creates a new Hidden Markov Model with 3 states for
//  an output alphabet of two characters (zero and one)
HiddenMarkovModel hmm = new HiddenMarkovModel(2, 3);

// Try to fit the model to the data until the difference in
//  the average likelihood changes only by as little as 0.0001
hmm.Learn(sequences, 0.0001);

// Calculate the probability that the given
//  sequences originated from the model
double l1 = hmm.Evaluate(new int[] { 0,1 });      // 0.9999  
double l2 = hmm.Evaluate(new int[] { 0,1,1,1 });  // 0.9166

double l3 = hmm.Evaluate(new int[] { 1,1 });      // 0.0000
double l4 = hmm.Evaluate(new int[] { 1,0,0,0 });  // 0.0000

double l5 = hmm.Evaluate(new int[] { 0,1,0,1,1,1,1,1,1 }); // 0.0342
double l6 = hmm.Evaluate(new int[] { 0,1,1,1,1,1,1,0,1 }); // 0.0342

We can see that, despite having a very low probability, the likelihood values for the sequences containing a simulated measurement error are greater than the likelihoods for the sequences which do not follow the sequence structure at all.

In a subsequent article, we will see that those low values for the likelihoods will not be a problem because HMMs are often used in sets to form sequence classifiers. When used in such configurations, what really matters is which HMM returns the highest probability among others in the set.

Remarks

A practical issue in the use of Hidden Markov Models to model long sequences is the numerical scaling of conditional probabilities. The probability of observing a long sequence given most models is extremely small, and the use of these extremely small numbers in computations often leads to numerical instability, making application of HMMs to genome length sequences quite challenging.

There are two common approaches to dealing with small conditional probabilities. One approach is to rescale the conditional probabilities using carefully designed scaling factors, and the other approach is to work with the logarithms of the conditional probabilities. For more information on using logarithms please see the work entitled “Numerically Stable Hidden Markov Model Implementation”, by Tobias P. Mann.

Known issues

The code on this article is based on the Tutorial by Rabiner. There are, however, some problems with the scaling and other algorithms. An errata depicting all issues is available in the website “An Erratum for ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’” and is maintained by Ali Rahimi. I have not yet verified if the implementation presented here also suffers from the same mistakes explained there. This code has worked well under many situations, but I cannot guarantee its perfectness. Please use at your own risk.

Acknowledgements

Thanks to Guilherme C. Pedroso, for the help with the Baum-Welch generalization for multiple input sequences. He has also co-written a very interesting article using hidden Markov models for gesture recognition, entitled “Automatic Recognition of Finger Spelling for LIBRAS based on a Two-Layer Architecture” published in the 25th Symposium On Applied Computing (ACM SAC 2010).

References

L. R. Rabiner, "A Tutorial on Hidden Markov Models, and Selected Applications in Speech Recognition," Proc. IEEE, Vol. 77, No. 2, pp. 257--286, Feb. 1989.

Warakagoda, Narada D., “A Hybrid ANN-HMM ASR system with NN based adaptive preprocessing”, Norges Tekniske Høgskole, 2006.

A. Rahimim, “An erratum for: A tutorial on hidden Markov models and selected applications in speech recognition”, World Wide Web. Available in: http://xenia.media.mit.edu/rahimi/rabiner/rabiner-errata/rabiner-errata.html.

T. P. Mann, "Numerically stable hidden markov model implementation," World Wide Web. Available in: http://www.phrap.org/compbio/mbt599/hmm_scaling_revised.pdf

Wikipedia contributors, "Hidden Markov model," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Hidden_Markov_model (accessed March, 2010).

Onde encontrar cabo de força para fonte de alimentação de notebooks

Geralmente, ao comprar uma fonte nova, principalmente pela internet, o que chega é somente a fonte, sem o cabo de alimentação que a conecta na tomada. Neste caso, é preciso adquiri-lo separadamente. O que não chega a ser importuno desde que na sua cidade alguem saiba onde encontrar este bendito cabo, o que não é o caso de Valinhos, por exemplo, em que me disseram que eu nunca o acharia.

Bom, se você mora em São Carlos, a boa notícia é que ele pode ser encontrado na:

Eletrônica Gaspar

Av. São Carlos, 2615, São Carlos-SP
Fone: (16) 3371-4014 / 3371-3412

É a segunda vez que anuncio o endereço da Gaspar neste blog. Um detalhe interessante é que na Gaspar você encontra dois tipos de cabos, o de dois pinos e o de três pinos.

O cabo de dois pinos, comumente usado em aparelhos de som, custa R$7,00, mas não encaixa perfeitamente na fonte. Contudo, com alguma força (ou com algum entalhamento no conector de plástico) é possível fazê-lo entrar. Vale a pena se você, por exemplo, não tiver uma fiação terra em sua casa e já estiver usando um adaptador de três pinos para poder carregar seu notebook. É uma opção também caso não consiga encontrar o cabo de três pinos em sua cidade, já que este deve estar disponível em qualquer assistência ou loja de eletrônica.

Já o cabo de três pinos encaixa perfeitamente porém custava R$21,00. Estes preços são referentes ao mês de fevereiro de 2010 na Gaspar, mas não creio que mudem muito de loja para loja nem ao decorrer dos anos.

No meu caso, resolvi comprar o de dois pinos para evitar ter de andar sempre carregando um adaptador na mala para os (quase todos) casos em que não há uma conexão terra disponível nos lugares onde vou. Após algum trabalho para fazê-lo entrar, funciona perfeitamente na minha fonte HP importada do DealExtreme.

Wednesday, 17 March 2010

Kernel Functions for Machine Learning Applications

In recent years, Kernel methods have received major attention, particularly due to the increased popularity of the Support Vector Machines. Kernel functions can be used in many applications as they provide a simple bridge from linearity to non-linearity for algorithms which can be expressed in terms of dot products. In this article, we will list a few kernel functions and some of their properties.

Many of these functions have been incorporated in Accord.NET, a extension framework for the popular AForge.NET Framework which also includes many other statistics and machine learning tools.

Kernel Methods

Kernel Functions

Source code

See also

References

Kernel Methods

Kernel methods are a class of algorithms for pattern analysis or recognition, whose best known element is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (such as clusters, rankings, principal components, correlations, classifications) in general types of data (such as sequences, text documents, sets of points, vectors, images, graphs, etc) (Wikipedia, 2010a).

The main characteristic of Kernel Methods, however, is their distinct approach to this problem. Kernel methods map the data into higher dimensional spaces in the hope that in this higher-dimensional space the data could become more easily separated or better structured. There are also no constraints on the form of this mapping, which could even lead to infinite-dimensional spaces. This mapping function, however, hardly needs to be computed because of a tool called the kernel trick.

The Kernel Trick

The kernel trick is a mathematical tool which can be applied to any algorithm which solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced by a kernel function. When properly applied, those candidate linear algorithms are transformed into a non-linear algorithms (sometimes with little effort or reformulation). Those non-linear algorithms are equivalent to their linear originals operating in the range space of a feature space φ. However, because kernels are used, the φ function does not need to be ever explicitly computed. This is highly desirable, as we noted previously, because this higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute. There are also no constraints on the nature of the input vectors. Dot products could be defined between any kind of structure, such as trees or strings.

Kernel Properties

Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix. Kernels which are said to satisfy the Mercer's theorem are positive semi-definite, meaning their kernel matrices have no non-negative Eigen values. The use of a positive definite kernel insures that the optimization problem will be convex and solution will be unique.

However, many kernel functions which aren’t strictly positive definite also have been shown to perform very well in practice. An example is the Sigmoid kernel, which, despite its wide use, it is not positive semi-definite for certain values of its parameters. Boughorbel (2005) also experimentally demonstrated that Kernels which are only conditionally positive definite can possibly outperform most classical kernels in some applications.

Kernels also can be classified as anisotropic stationary, isotropic stationary, compactly supported, locally stationary, nonstationary or separable nonstationary. Moreover, kernels can also be labeled scale-invariant or scale-dependant, which is an interesting property as scale-invariant kernels drive the training process invariant to a scaling of the data.

Choosing the Right Kernel

Choosing the most appropriate kernel highly depends on the problem at hand - and fine tuning its parameters can easily become a tedious and cumbersome task. Automatic kernel selection is possible and is discussed in the works by Tom Howley and Michael Madden.

The choice of a Kernel depends on the problem at hand because it depends on what we are trying to model. A polynomial kernel, for example, allows us to model feature conjunctions up to the order of the polynomial. Radial basis functions allows to pick out circles (or hyperspheres) - in constrast with the Linear kernel, which allows only to pick out lines (or hyperplanes).

The motivation behind the choice of a particular kernel can be very intuitive and straightforward depending on what kind of information we are expecting to extract about the data. Please see the final notes on this topic from Introduction to Information Retrieval, by Manning, Raghavan and Schütze for a better explanation on the subject.

Kernel Functions

Below is a list of some kernel functions available from the existing literature. As was the case with previous articles, every LaTeX notation for the formulas below are readily available from their alternate text html tag. I can not guarantee all of them are perfectly correct, thus use them at your own risk. Most of them have links to articles where they have been originally used or proposed.

1. Linear Kernel

The Linear kernel is the simplest kernel function. It is given by the inner product <x,y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with linear kernel is the same as standard PCA.

k(x, y) = x^T y + c

2. Polynomial Kernel

The Polynomial kernel is a non-stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized.

$k(x, y) = (\alpha x^T y + c)^d$
Adjustable parameters are the slope alpha, the constant term c and the polynomial degree d.

3. Gaussian Kernel

The Gaussian kernel is an example of radial basis function kernel.

$k(x, y) = \exp\left(-\frac{ \lVert x-y \rVert ^2}{2\sigma^2}\right)$

Alternatively, it could also be implemented using

$k(x, y) = \exp\left(- \gamma \lVert x-y \rVert ^2 )$

The adjustable parameter sigma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.

4. Exponential Kernel

The exponential kernel is closely related to the Gaussian kernel, with only the square of the norm left out. It is also a radial basis function kernel.

$k(x, y) = \exp\left(-\frac{ \lVert x-y \rVert }{2\sigma^2}\right)$

5. Laplacian Kernel

The Laplace Kernel is completely equivalent to the exponential kernel, except for being less sensitive for changes in the sigma parameter. Being equivalent, it is also a radial basis function kernel.

$k(x, y) = \exp\left(- \frac{\lVert x-y \rVert }{\sigma}\right)$

It is important to note that the observations made about the sigma parameter for the Gaussian kernel also apply to the Exponential and Laplacian kernels.

6. ANOVA Kernel

The ANOVA kernel is also a radial basis function kernel, just as the Gaussian and Laplacian kernels. It is said to perform well in multidimensional regression problems (Hofmann, 2008).

$k(x, y) = \sum_{k=1}^n \exp (-\sigma (x^k - y^k)^2)^d$

7. Hyperbolic Tangent (Sigmoid) Kernel

The Hyperbolic Tangent Kernel is also known as the Sigmoid Kernel and as the Multilayer Perceptron (MLP) kernel. The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons.

$k(x, y) = \tanh (\alpha x^T y + c)$

It is interesting to note that a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel was quite popular for support vector machines due to its origin from neural network theory. Also, despite being only conditionally positive definite, it has been found to perform well in practice.

There are two adjustable parameters in the sigmoid kernel, the slope alpha and the intercept constant c. A common value for alpha is 1/N, where N is the data dimension. A more detailed study on sigmoid kernels can be found in the works by Hsuan-Tien and Chih-Jen.

8. Rational Quadratic Kernel

The Rational Quadratic kernel is less computationally intensive than the Gaussian kernel and can be used as an alternative when using the Gaussian becomes too expensive.

$k(x, y) = 1 - \frac{\lVert x-y \rVert^2}{\lVert x-y \rVert^2 + c}$

9. Multiquadric Kernel

The Multiquadric kernel can be used in the same situations as the Rational Quadratic kernel. As is the case with the Sigmoid kernel, it is also an example of an non-positive definite kernel.

$k(x, y) = \sqrt{\lVert x-y \rVert^2 + c^2}$

10. Inverse Multiquadric Kernel

The Inverse Multi Quadric kernel. As with the Gaussian kernel, it results in a kernel matrix with full rank (Micchelli, 1986) and thus forms a infinite dimension feature space.

$k(x, y) = \frac{1}{\sqrt{\lVert x-y \rVert^2 + \theta^2}}$

11. Circular Kernel

The circular kernel is used in geostatic applications. It is an example of an isotropic stationary kernel and is positive definite in R².

$k(x, y) = \frac{2}{\pi} \arccos ( - \frac{ \lVert x-y \rVert}{\sigma}) - \frac{2}{\pi} \frac{ \lVert x-y \rVert}{\sigma} \sqrt{1 - \left(\frac{ \lVert x-y \rVert}{\sigma} \right)^2}$
$\mbox{if}~ \lVert x-y \rVert < \sigma \mbox{, zero otherwise}$

12. Spherical Kernel

The spherical kernel is similar to the circular kernel, but is positive definite in R³.

$k(x, y) = 1 - \frac{3}{2} \frac{\lVert x-y \rVert}{\sigma} + \frac{1}{2} \left( \frac{ \lVert x-y \rVert}{\sigma} \right)^3$

$\mbox{if}~ \lVert x-y \rVert < \sigma \mbox{, zero otherwise}$

13. Wave Kernel

The Wave kernel is also symmetric positive semi-definite (Huang, 2008).

$k(x, y) = \frac{\theta}{\lVert x-y \rVert \right} \sin \frac{\lVert x-y \rVert }{\theta}$

14. Power Kernel

The Power kernel is also known as the (unrectified) triangular kernel. It is an example of scale-invariant kernel (Sahbi and Fleuret, 2004) and is also only conditionally positive definite.

$k(x,y) = - \lVert x-y \rVert ^d$

15. Log Kernel

The Log kernel seems to be particularly interesting for images, but is only conditionally positive definite.

$k(x,y) = - log (\lVert x-y \rVert ^d + 1)$

16. Spline Kernel

The Spline kernel is given as a piece-wise cubic polynomial, as derived in the works by Gunn (1998).

$k(x, y) = 1 + xy + xy~min(x,y) - \frac{x+y}{2}~min(x,y)^2+\frac{1}{3}\min(x,y)^3$

However, what it actually mean is:

$k(x,y) = \prod_{i=1}^d 1 + x_i y_i + x_i y_i \min(x_i, y_i) - \frac{x_i + y_i}{2} \min(x_i,y_i)^2 + \frac{\min(x_i,y_i)^3}{3}$

With $x,y \in R^d$

17. B-Spline (Radial Basis Function) Kernel

The B-Spline kernel is defined on the interval [−1, 1]. It is given by the recursive formula:

$k(x,y) = B_{2p+1}(x-y)$

$\mbox{where~} p \in N \mbox{~with~} B_{i+1} := B_i \otimes B_0.$

In the work by Bart Hamers it is given by: $k(x, y) = \prod_{p=1}^d B_{2n+1}(x_p - y_p)$

Alternatively, B_n can be computed using the explicit expression (Fomel, 2000):

$B_n(x) = \frac{1}{n!} \sum_{k=0}^{n+1} \binom{n+1}{k} (-1)^k (x + \frac{n+1}{2} - k)^n_+$

Where x₊ is defined as the truncated power function:

$x^d_+ = \begin{cases} x^d, & \mbox{if }x > 0 \\ 0, & \mbox{otherwise} \end{cases}$

18. Bessel Kernel

The Bessel kernel is well known in the theory of function spaces of fractional smoothness. It is given by:

$k(x, y) = \frac{J_{v+1}( \sigma \lVert x-y \rVert)}{ \lVert x-y \rVert ^ {-n(v+1)} }$

where J is the Bessel function of first kind. However, in the Kernlab for R documentation, the Bessel kernel is said to be:

$k(x,x') = - Bessel_{(nu+1)}^n (\sigma |x - x'|^2)$

19. Cauchy Kernel

The Cauchy kernel comes from the Cauchy distribution (Basak, 2008). It is a long-tailed kernel and can be used to give long-range influence and sensitivity over the high dimension space.

$k(x, y) = \frac{1}{1 + \frac{\lVert x-y \rVert^2}{\sigma^2} }$

20. Chi-Square Kernel

The Chi-Square kernel comes from the Chi-Square distribution.

$k(x,y) = 1 - \sum_{i=1}^n \frac{(x_i-y_i)^2}{\frac{1}{2}(x_i+y_i)}$

21. Histogram Intersection Kernel

The Histogram Intersection Kernel is also known as the Min Kernel and has been proven useful in image classification.

$k(x,y) = \sum_{i=1}^n \min(x_i,y_i)$

22. Generalized Histogram Intersection

The Generalized Histogram Intersection kernel is built based on the Histogram Intersection Kernel for image classification but applies in a much larger variety of contexts (Boughorbel, 2005). It is given by:

$k(x,y) = \sum_{i=1}^m \min(|x_i|^\alpha,|y_i|^\beta)$

23. Generalized T-Student Kernel

The Generalized T-Student Kernel has been proven to be a Mercel Kernel, thus having a positive semi-definite Kernel matrix (Boughorbel, 2004). It is given by:

$k(x,y) = \frac{1}{1 + \lVert x-y \rVert ^d}$

24. Bayesian Kernel

The Bayesian kernel could be given as:

$k(x,y) = \prod_{l=1}^N \kappa_l (x_l,y_l)$

where

$\kappa_l(a,b) = \sum_{c \in \{0;1\}} P(Y=c \mid X_l=a) ~ P(Y=c \mid X_l=b)$

However, it really depends on the problem being modeled. For more information, please see the work by Alashwal, Deris and Othman, in which they used a SVM with Bayesian kernels in the prediction of protein-protein interactions.

25. Wavelet Kernel

The Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:

$k(x,y) = \prod_{i=1}^N h(\frac{x_i-c_i}{a}) \: h(\frac{y_i-c_i}{a})$

Where a and c are the wavelet dilation and translation coefficients, respectively (the form presented above is a simplification, please see the original paper for details). A translation-invariant version of this kernel can be given as:

$k(x,y) = \prod_{i=1}^N h(\frac{x_i-y_i}{a})$

Where in both h(x) denotes a mother wavelet function. In the paper by Li Zhang, Weida Zhou, and Licheng Jiao, the authors suggests a possible h(x) as:

$h(x) = cos(1.75x)exp(-\frac{x^2}{2})$

Which they also prove as an admissible kernel function.

Source Code

The latest version of the source code for almost all of the kernels listed above is available in the Accord.NET Framework. Some are also available in the sequel of this article, Kernel Support Vector Machines for Classification and Regression in C#. They are provided together with a comprehensive and simple implementation of SVMs (Support Vector Machines) in C#. However, for the latest sources, which may contain bug fixes and other enhancements, please download the most recent version available of Accord.NET.

References

On-Line Prediction Wiki Contributors. "Kernel Methods." On-Line Prediction Wiki. http://onlineprediction.net/?n=Main.KernelMethods (accessed March 3, 2010).
Genton, Marc G. "Classes of Kernels for Machine Learning: A Statistics Perspective." Journal of Machine Learning Research 2 (2001) 299-312.
Hofmann, T., B. Schölkopf, and A. J. Smola. "Kernel methods in machine learning." Ann. Statist. Volume 36, Number 3 (2008), 1171-1220.
Gunn, S. R. (1998, May). "Support vector machines for classification and regression." Technical report, Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science.
Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. "Kernlab – an R package for kernel Learning." (2004).
Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. "Kernlab – an S4 package for kernel methods in R." J. Statistical Software, 11, 9 (2004).
Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. "R: Kernel Functions." Documentation for package 'kernlab' version 0.9-5. http://rss.acs.unt.edu/Rdoc/library/kernlab/html/dots.html (accessed March 3, 2010).
Howley, T. and Madden, M.G. "The genetic kernel support vector machine: Description and evaluation". Artificial Intelligence Review. Volume 24, Number 3 (2005), 379-395.
Shawkat Ali and Kate A. Smith. "Kernel Width Selection for SVM Classification: A Meta-Learning Approach." International Journal of Data Warehousing & Mining, 1(4), 78-97, October-December 2005.
Hsuan-Tien Lin and Chih-Jen Lin. "A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods." Technical report, Department of Computer Science, National Taiwan University, 2003.
Boughorbel, S., Jean-Philippe Tarel, and Nozha Boujemaa. "Project-Imedia: Object Recognition." INRIA - INRIA Activity Reports - RalyX. http://ralyx.inria.fr/2004/Raweb/imedia/uid84.html (accessed March 3, 2010).
Huang, Lingkang. "Variable Selection in Multi-class Support Vector Machine and Applications in Genomic Data Analysis." PhD Thesis, 2008.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. "Nonlinear SVMs." The Stanford NLP (Natural Language Processing) Group. http://nlp.stanford.edu/IR-book/html/htmledition/nonlinear-svms-1.html (accessed March 3, 2010).
Fomel, Sergey. "Inverse B-spline interpolation." Stanford Exploration Project, 2000. http://sepwww.stanford.edu/public/docs/sep105/sergey2/paper_html/node5.html (accessed March 3, 2010).
Basak, Jayanta. "A least square kernel machine with box constraints." International Conference on Pattern Recognition 2008 1 (2008): 1-4.
Alashwal, H., Safaai Deris, and Razib M. Othman. "A Bayesian Kernel for the Prediction of Protein - Protein Interactions." International Journal of Computational Intelligence 5, no. 2 (2009): 119-124.
Hichem Sahbi and François Fleuret. “Kernel methods and scale invariance using the triangular kernel”. INRIA Research Report, N-5143, March 2004.
Sabri Boughorbel, Jean-Philippe Tarel, and Nozha Boujemaa. “Generalized histogram intersection kernel for image recognition”. Proceedings of the 2005 Conference on Image Processing, volume 3, pages 161-164, 2005.
Micchelli, Charles. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constructive Approximation 2, no. 1 (1986): 11-22.
Wikipedia contributors, "Kernel methods," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Kernel_methods&oldid=340911970 (accessed March 3, 2010).
Wikipedia contributors, "Kernel trick," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Kernel_trick&oldid=269422477 (accessed March 3, 2010).

Weisstein, Eric W. "Positive Semidefinite Matrix." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/PositiveSemidefiniteMatrix.html

Hamers B. "Kernel Models for Large Scale Applications'', Ph.D. , Katholieke Universiteit Leuven, Belgium, 2004.

Li Zhang, Weida Zhou, Licheng Jiao. Wavelet Support Vector Machine. IEEE Transactions on System, Man, and Cybernetics, Part B, 2004, 34(1): 34-39.

Citing this work

If you would like, please cite this work as: Souza, César R. "Kernel Functions for Machine Learning Applications." 17 Mar. 2010. Web. <http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html>.

Wednesday, 24 March 2010

Introduction

Hidden Markov Models

Hidden Markov Model -Based Sequence Classifier

Source Code

Computing the most likely class for a given sequence

Training each model to recognize each of the output classes

Using the Code

Sample Application

See also

References

Tuesday, 23 March 2010

Contents

Eletrônica Gaspar

Wednesday, 17 March 2010

Citing this work