Tuesday, 27 April 2010

Kernel Support Vector Machines for Classification and Regression in C#

Kernel methods in general have gained increased attention in recent years, partly due to the grown of popularity of the Support Vector Machines. Support Vector Machines are linear classifiers and regressors that, through the Kernel trick, operate in reproducing Kernel Hilbert spaces and are thus able to perform non-linear classification and regression in their input space.




The source code available here is distributed under a GPL license. The sequential minimal optimization for regression was implemented by following Smola’s Tutorial on Support Vector Regression. However, modifications had been based on GPL code by Sylvain Roy.

Contents



  1. Introduction
    1. Support Vector Machines
    2. Kernel Support Vector Machines
      1. The Kernel Trick
      2. Standard Kernels
  2. Learning Algorithms
    1. Sequential Minimal Optimization
  3. Source Code
    1. Support Vector Machine
    2. Kernel Support Vector Machine
    3. Sequential Minimal Optimization
  4. Using the code
  5. Sample application
    1. Classification
    2. Regression
  6. See also
  7. References

 

Introduction

Support Vector Machines

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

A linear support vector machine is composed of a set of given support vectors z and a set of weights w. The computation for the output of a given SVM with N support vectors z1, z2, … , zN and weights w1, w2, … , wN is then given by:

F(x) = \sum_{i=1}^N  w_i \, \left \langle z_i,x \right \rangle  + b

Kernel Support Vector Machines

The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space.

Using kernels, the original formulation for the SVM given SVM with support vectors z1, z2, … , zN and weights w1, w2, … , wN is now given by:

F(x) = \sum_{i=1}^N  w_i \, k(z_i,x) + b

It is also very straightforward to see that, using a linear kernel of the form K(z,x) = <z,x> = zTx, we recover the original formulation for the linear SVM.

The Kernel trick

The Kernel trick is a very interesting and powerful tool. It is powerful because it provides a bridge from linearity to non-linearity to any algorithm that solely depends on the dot product between two vectors. It comes from the fact that, if we first map our input data into a higher-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space.



Now, the Kernel trick is really interesting because that mapping does not need to be ever computed. If our algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space. That is where resides the "trick": wherever a dot product is used, it is replaced with a Kernel function. The kernel function denotes an inner product in feature space and is usually denoted as:

K(x,y) = <φ(x),φ(y)>

Using the Kernel function, the algorithm can then be carried into a higher-dimension space without explicitly mapping the input points into this space. This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute.

Standard Kernels

Some common Kernel functions include the linear kernel, the polynomial kernel and the Gaussian kernel. Below is a simple list with their most interesting characteristics.


Linear Kernel The Linear kernel is the simplest kernel function. It is given by the common inner product <x,y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with linear kernel is equivalent to standard PCA. k(x, y) = x^T y + c
Polynomial Kernel The Polynomial kernel is a non-stationary kernel. It is well suited for problems where all data is normalized. k(x, y) = (\alpha x^T y + c)^d
Gaussian Kernel The Gaussian kernel is by far one of the most versatile Kernels. It is a radial basis function kernel, and is the preferred Kernel when we don’t know much about the data we are trying to model. k(x, y) = \exp\left(-\frac{ \lVert x-y \rVert ^2}{2\sigma^2}\right)

For more Kernel functions, check Kernel functions for Machine Learning Applications. The accompanying source code includes definitions for over 20 distinct Kernel functions, many of them detailed in the aforementioned post.

 

Learning Algorithms

Sequential Minimal Optimization

Previous SVM learning algorithms involved the use of quadratic programming solvers. Some of them used chunking to split the problem in smaller parts which could be solved more efficiently. Platt's Sequential Minimal Optimization (SMO) algorithm puts chunking to the extreme by breaking the problem down into 2-dimensional sub-problems that can be solved analytically, eliminating the need for a numerical optimization algorithm.

The algorithm makes use of Lagrange multipliers to compute the optimization problem. Platt's algorithm is composed of three main procedures or parts:

  • run, which iterates over all points until convergence to a tolerance threshold;
  • examineExample, which finds two points to jointly optimize;
  • takeStep, which solves the 2-dimensional optimization problem analytically.

The algorithm is also governed by three extra parameters besides the Kernel function and the data points.

  • The parameter C controls the trade off between allowing some training errors and forcing rigid margins. Increasing the value of C increases the cost of misclassifications but may result in models that do not generalize well to points outside the training set.
  • The parameter ε controls the width of the ε-insensitive zone, used to fit the training data. The value of ε can affect the number of support vectors used to construct the regression function. The bigger ε, the fewer support vectors are selected and the solution becomes more sparse. On the other hand, increasing the ε-value by too much will result in less accurate models. 
  • The parameter T is the convergence tolerance. It is the criterion for completing the training process.

After the algorithm ends, a new Support Vector Machine can be created using only the points whose Lagrange multipliers are higher than zero. The expected outputs yi can be individually multiplied by their corresponding Lagrange multipliers ai to form a single weight vector w.

F(x) = \sum_{i=0}^N \{ \alpha_i y \,  k(z_i,x) \} + b = \sum_{i=0}^N \{ w_i \, k(z_i,x) \} + b

Sequential Minimal Optimization for Regression

A version of SVM for regression was proposed in 1996 by Vladimir Vapnik, Harris Drucker, Chris Burges, Linda Kaufman and Alex Smola. The method was called support vector regression and, as is the case with the original Support Vector Machine formulation, depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction that is within a tolerance threshold ε.

Platt's algorithm has also been modified for regression. Albeit still maintaining much of its original structure, the difference lies in the fact that the modified algorithm uses two Lagrange multipliers âi and ai for each input point i. After the algorithm ends, a new Support Vector Machine can be created using only points whose both Lagrange multipliers are higher than zero. The multipliers âi and ai are then subtracted to form a single weight vector w.

F(x) = \sum_{i=0}^N \{ (\hat{\alpha_i} - \alpha_i) \,  k(z_i,x) \} + b = \sum_{i=0}^N \{ w_i \, k(z_i,x) \} + b

The algorithm is also governed by the same three parameters presented above. The parameter ε, however, receives a special meaning. It governs the size of the ε-insensitive tube over the regression line. The algorithm has been further developed and adapted by Alex J. Smola, Bernhard Schoelkopf and further optimizations were introduced by Shevade et al and Flake et al.

Source Code

Here is the class diagram for the Support Vector Machine module. We can see it is very simple in terms of standard class organization.

svm-classdiagram1Class diagram for the (Kernel) Support Vector Machines module.


Support Vector Machine

Below is the class definition for the Linear Support Vector Machine. It is pretty much self explanatory.


    /// <summary>
/// Sparse Linear Support Vector Machine (SVM)
/// </summary>
public class SupportVectorMachine
{

private int inputCount;
private double[][] supportVectors;
private double[] weights;
private double threshold;

/// <summary>
/// Creates a new Support Vector Machine
/// </summary>
public SupportVectorMachine(int inputs)
{
this.inputCount = inputs;
}

/// <summary>
/// Gets the number of inputs accepted by this SVM.
/// </summary>
public int Inputs
{
get { return inputCount; }
}

/// <summary>
/// Gets or sets the collection of support vectors used by this machine.
/// </summary>
public double[][] SupportVectors
{
get { return supportVectors; }
set { supportVectors = value; }
}

/// <summary>
/// Gets or sets the collection of weights used by this machine.
/// </summary>
public double[] Weights
{
get { return weights; }
set { weights = value; }
}

/// <summary>
/// Gets or sets the threshold (bias) term for this machine.
/// </summary>
public double Threshold
{
get { return threshold; }
set { threshold = value; }
}

/// <summary>
/// Computes the given input to produce the corresponding output.
/// </summary>
/// <param name="input">An input vector.</param>
/// <returns>The ouput for the given input.</returns>
public virtual double Compute(double[] input)
{
double s = threshold;
for (int i = 0; i < supportVectors.Length; i++)
{
double p = 0;
for (int j = 0; j < input.Length; j++)
p += supportVectors[i][j] * input[j];

s += weights[i] * p;
}

return s;
}

/// <summary>
/// Computes the given inputs to produce the corresponding outputs.
/// </summary>
public double[] Compute(double[][] inputs)
{
double[] outputs = new double[inputs.Length];

for (int i = 0; i < inputs.Length; i++)
outputs[i] = Compute(inputs[i]);

return outputs;
}

}








Kernel Support Vector Machine



Here is the class definition for the Kernel Support Vector Machine. It inherits from Support Vector Machine and extends it with a Kernel property. The Compute method is also overridden to include the chosen Kernel in the model computation.






/// <summary>
/// Sparse Kernel Support Vector Machine (kSVM)
/// </summary>
/// <remarks>
/// <para>
/// The original optimal hyperplane algorithm (SVM) proposed by Vladimir Vapnik in 1963 was a
/// linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested
/// a way to create non-linear classifiers by applying the kernel trick (originally proposed
/// by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally
/// similar, except that every dot product is replaced by a non-linear kernel function.</para>
/// <para>
/// This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.
/// The transformation may be non-linear and the transformed space high dimensional; thus though
/// the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in
/// the original input space.</para>
/// <para>
/// References:
/// <list type="bullet">
/// <item><description><a href="http://en.wikipedia.org/wiki/Support_vector_machine">
/// http://en.wikipedia.org/wiki/Support_vector_machine</a></description></item>
/// <item><description><a href="http://www.kernel-machines.org/">
/// http://www.kernel-machines.org/</a></description></item>
/// </list></para>
/// </remarks>
///
/// <example>
/// <code>
/// // Example XOR problem
/// double[][] inputs =
/// {
/// new double[] { 0, 0 }, // 0 xor 0: 1 (label +1)
/// new double[] { 0, 1 }, // 0 xor 1: 0 (label -1)
/// new double[] { 1, 0 }, // 1 xor 0: 0 (label -1)
/// new double[] { 1, 1 } // 1 xor 1: 1 (label +1)
/// };
///
/// // Dichotomy SVM outputs should be given as [-1;+1]
/// int[] labels =
/// {
/// // 1, 0, 0, 1
/// 1, -1, -1, 1
/// };
///
/// // Create a Kernel Support Vector Machine for the given inputs
/// KernelSupportVectorMachine machine = new KernelSupportVectorMachine(new Gaussian(0.1), inputs[0].Length);
///
/// // Instantiate a new learning algorithm for SVMs
/// SequentialMinimalOptimization smo = new SequentialMinimalOptimization(machine, inputs, labels);
///
/// // Set up the learning algorithm
/// smo.Complexity = 1.0;
///
/// // Run the learning algorithm
/// double error = smo.Run();
///
/// // Compute the decision output for one of the input vectors
/// int decision = System.Math.Sign(svm.Compute(inputs[0]));
/// </code>
/// </example>
///
[Serializable]
public class KernelSupportVectorMachine : SupportVectorMachine
{

private IKernel kernel;


/// <summary>
/// Creates a new Kernel Support Vector Machine.
/// </summary>
///
/// <param name="kernel">The chosen kernel for the machine.</param>
/// <param name="inputs">The number of inputs for the machine.</param>
///
/// <remarks>
/// If the number of inputs is zero, this means the machine
/// accepts a indefinite number of inputs. This is often the
/// case for kernel vector machines using a sequence kernel.
/// </remarks>
///
public KernelSupportVectorMachine(IKernel kernel, int inputs) : base(inputs)
{
if (kernel == null) throw new ArgumentNullException("kernel");

this.kernel = kernel;
}

/// <summary>
/// Gets or sets the kernel used by this machine.
/// </summary>
///
public IKernel Kernel
{
get { return kernel; }
set { kernel = value; }
}

/// <summary>
/// Computes the given input to produce the corresponding output.
/// </summary>
///
/// <remarks>
/// For a binary decision problem, the decision for the negative
/// or positive class is typically computed by taking the sign of
/// the machine's output.
/// </remarks>
///
/// <param name="inputs">An input vector.</param>
/// <returns>The output for the given input.</returns>
///
public override double Compute(double[] inputs)
{
double s = Threshold;

for (int i = 0; i < SupportVectors.Length; i++)
s += Weights[i] * kernel.Function(SupportVectors[i], inputs);

return s;
}
}



Sequential Minimal Optimization



Here is the code for the Sequential Minimal Optimization (SMO) algorithm.






/// <summary>
/// Sequential Minimal Optimization (SMO) Algorithm
/// </summary>
///
/// <remarks>
/// <para>
/// The SMO algorithm is an algorithm for solving large quadratic programming (QP)
/// optimization problems, widely used for the training of support vector machines.
/// First developed by John C. Platt in 1998, SMO breaks up large QP problems into
/// a series of smallest possible QP problems, which are then solved analytically.</para>
/// <para>
/// This class follows the original algorithm by Platt as strictly as possible.</para>
///
/// <para>
/// References:
/// <list type="bullet">
/// <item><description>
/// <a href="http://en.wikipedia.org/wiki/Sequential_Minimal_Optimization">
/// Wikipedia, The Free Encyclopedia. Sequential Minimal Optimization. Available on:
/// http://en.wikipedia.org/wiki/Sequential_Minimal_Optimization </a></description></item>
/// <item><description>
/// <a href="http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf">
/// John C. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support
/// Vector Machines. 1998. Available on: http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf </a></description></item>
/// <item><description>
/// <a href="http://www.idiom.com/~zilla/Work/Notes/svmtutorial.pdf">
/// J. P. Lewis. A Short SVM (Support Vector Machine) Tutorial. Available on:
/// http://www.idiom.com/~zilla/Work/Notes/svmtutorial.pdf </a></description></item>
/// </list></para>
/// </remarks>
///
/// <example>
/// <code>
/// // Example XOR problem
/// double[][] inputs =
/// {
/// new double[] { 0, 0 }, // 0 xor 0: 1 (label +1)
/// new double[] { 0, 1 }, // 0 xor 1: 0 (label -1)
/// new double[] { 1, 0 }, // 1 xor 0: 0 (label -1)
/// new double[] { 1, 1 } // 1 xor 1: 1 (label +1)
/// };
///
/// // Dichotomy SVM outputs should be given as [-1;+1]
/// int[] labels =
/// {
/// 1, -1, -1, 1
/// };
///
/// // Create a Kernel Support Vector Machine for the given inputs
/// KernelSupportVectorMachine machine = new KernelSupportVectorMachine(new Gaussian(0.1), inputs[0].Length);
///
/// // Instantiate a new learning algorithm for SVMs
/// SequentialMinimalOptimization smo = new SequentialMinimalOptimization(machine, inputs, labels);
///
/// // Set up the learning algorithm
/// smo.Complexity = 1.0;
///
/// // Run the learning algorithm
/// double error = smo.Run();
///
/// // Compute the decision output for one of the input vectors
/// int decision = System.Math.Sign(svm.Compute(inputs[0]));
/// </code>
/// </example>
///
public class SequentialMinimalOptimization : ISupportVectorMachineLearning
{
private static Random random = new Random();

// Training data
private double[][] inputs;
private int[] outputs;

// Learning algorithm parameters
private double c = 1.0;
private double tolerance = 1e-3;
private double epsilon = 1e-3;
private bool useComplexityHeuristic;

// Support Vector Machine parameters
private SupportVectorMachine machine;
private IKernel kernel;
private double[] alpha;
private double bias;


// Error cache to speed up computations
private double[] errors;


/// <summary>
/// Initializes a new instance of a Sequential Minimal Optimization (SMO) algorithm.
/// </summary>
///
/// <param name="machine">A Support Vector Machine.</param>
/// <param name="inputs">The input data points as row vectors.</param>
/// <param name="outputs">The classification label for each data point in the range [-1;+1].</param>
///
public SequentialMinimalOptimization(SupportVectorMachine machine,
double[][] inputs, int[] outputs)
{

// Initial argument checking
if (machine == null)
throw new ArgumentNullException("machine");

if (inputs == null)
throw new ArgumentNullException("inputs");

if (outputs == null)
throw new ArgumentNullException("outputs");

if (inputs.Length != outputs.Length)
throw new ArgumentException("The number of inputs and outputs does not match.", "outputs");

for (int i = 0; i < outputs.Length; i++)
{
if (outputs[i] != 1 && outputs[i] != -1)
throw new ArgumentOutOfRangeException("outputs", "One of the labels in the output vector is neither +1 or -1.");
}

if (machine.Inputs > 0)
{
// This machine has a fixed input vector size
for (int i = 0; i < inputs.Length; i++)
if (inputs[i].Length != machine.Inputs)
throw new ArgumentException("The size of the input vectors does not match the expected number of inputs of the machine");
}


// Machine
this.machine = machine;

// Kernel (if applicable)
KernelSupportVectorMachine ksvm = machine as KernelSupportVectorMachine;
this.kernel = (ksvm != null) ? ksvm.Kernel : new Linear();


// Learning data
this.inputs = inputs;
this.outputs = outputs;

}


//---------------------------------------------


#region Properties
/// <summary>
/// Complexity (cost) parameter C. Increasing the value of C forces the creation
/// of a more accurate model that may not generalize well. Default value is the
/// number of examples divided by the trace of the kernel matrix.
/// </summary>
/// <remarks>
/// The cost parameter C controls the trade off between allowing training
/// errors and forcing rigid margins. It creates a soft margin that permits
/// some misclassifications. Increasing the value of C increases the cost of
/// misclassifying points and forces the creation of a more accurate model
/// that may not generalize well.
/// </remarks>
public double Complexity
{
get { return this.c; }
set { this.c = value; }
}


/// <summary>
/// Gets or sets a value indicating whether the Complexity parameter C
/// should be computed automatically by employing an heuristic rule.
/// </summary>
/// <value>
/// <c>true</c> if complexity should be computed automatically; otherwise, <c>false</c>.
/// </value>
public bool UseComplexityHeuristic
{
get { return useComplexityHeuristic; }
set { useComplexityHeuristic = value; }
}

/// <summary>
/// Insensitivity zone ε. Increasing the value of ε can result in fewer support
/// vectors in the created model. Default value is 1e-3.
/// </summary>
/// <remarks>
/// Parameter ε controls the width of the ε-insensitive zone, used to fit the training
/// data. The value of ε can affect the number of support vectors used to construct the
/// regression function. The bigger ε, the fewer support vectors are selected. On the
/// other hand, bigger ε-values results in more flat estimates.
/// </remarks>
public double Epsilon
{
get { return epsilon; }
set { epsilon = value; }
}

/// <summary>
/// Convergence tolerance. Default value is 1e-3.
/// </summary>
/// <remarks>
/// The criterion for completing the model training process. The default is 0.001.
/// </remarks>
public double Tolerance
{
get { return this.tolerance; }
set { this.tolerance = value; }
}
#endregion


//---------------------------------------------


/// <summary>
/// Runs the SMO algorithm.
/// </summary>
///
/// <param name="computeError">
/// True to compute error after the training
/// process completes, false otherwise. Default is true.
/// </param>
///
/// <returns>
/// The misclassification error rate of
/// the resulting support vector machine.
/// </returns>
///
public double Run(bool computeError)
{

// The SMO algorithm chooses to solve the smallest possible optimization problem
// at every step. At every step, SMO chooses two Lagrange multipliers to jointly
// optimize, finds the optimal values for these multipliers, and updates the SVM
// to reflect the new optimal values
//
// Reference: http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf


// Initialize variables
int N = inputs.Length;

if (useComplexityHeuristic)
c = computeComplexity();

// Lagrange multipliers
this.alpha = new double[N];

// Error cache
this.errors = new double[N];

// Algorithm:
int numChanged = 0;
int examineAll = 1;

while (numChanged > 0 || examineAll > 0)
{
numChanged = 0;
if (examineAll > 0)
{
// loop I over all training examples
for (int i = 0; i < N; i++)
numChanged += examineExample(i);
}
else
{
// loop I over examples where alpha is not 0 and not C
for (int i = 0; i < N; i++)
if (alpha[i] != 0 && alpha[i] != c)
numChanged += examineExample(i);
}

if (examineAll == 1)
examineAll = 0;
else if (numChanged == 0)
examineAll = 1;
}


// Store Support Vectors in the SV Machine. Only vectors which have lagrange multipliers
// greater than zero will be stored as only those are actually required during evaluation.
List<int> indices = new List<int>();
for (int i = 0; i < N; i++)
{
// Only store vectors with multipliers > 0
if (alpha[i] > 0) indices.Add(i);
}

int vectors = indices.Count;
machine.SupportVectors = new double[vectors][];
machine.Weights = new double[vectors];
for (int i = 0; i < vectors; i++)
{
int j = indices[i];
machine.SupportVectors[i] = inputs[j];
machine.Weights[i] = alpha[j] * outputs[j];
}
machine.Threshold = -bias;


// Compute error if required.
return (computeError) ? ComputeError(inputs, outputs) : 0.0;
}

/// <summary>
/// Runs the SMO algorithm.
/// </summary>
///
/// <returns>
/// The misclassification error rate of
/// the resulting support vector machine.
/// </returns>
///
public double Run()
{
return Run(true);
}

/// <summary>
/// Computes the error rate for a given set of input and outputs.
/// </summary>
///
public double ComputeError(double[][] inputs, int[] expectedOutputs)
{
// Compute errors
int count = 0;
for (int i = 0; i < inputs.Length; i++)
{
if (Math.Sign(compute(inputs[i])) != Math.Sign(expectedOutputs[i]))
count++;
}

// Return misclassification error ratio
return (double)count / inputs.Length;
}

//---------------------------------------------


/// <summary>
/// Chooses which multipliers to optimize using heuristics.
/// </summary>
///
private int examineExample(int i2)
{
double[] p2 = inputs[i2]; // Input point at index i2
double y2 = outputs[i2]; // Classification label for p2
double alph2 = alpha[i2]; // Lagrange multiplier for p2

// SVM output on p2 - y2. Check if it has already been computed
double e2 = (alph2 > 0 && alph2 < c) ? errors[i2] : compute(p2) - y2;

double r2 = y2 * e2;


// Heuristic 01 (for the first multiplier choice):
// - Testing for KKT conditions within the tolerance margin
if (!(r2 < -tolerance && alph2 < c) && !(r2 > tolerance && alph2 > 0))
return 0;


// Heuristic 02 (for the second multiplier choice):
// - Once a first Lagrange multiplier is chosen, SMO chooses the second Lagrange multiplier to
// maximize the size of the step taken during joint optimization. Now, evaluating the kernel
// function is time consuming, so SMO approximates the step size by the absolute value of the
// absolute error difference.
int i1 = -1; double max = 0;
for (int i = 0; i < inputs.Length; i++)
{
if (alpha[i] > 0 && alpha[i] < c)
{
double error1 = errors[i];
double aux = System.Math.Abs(e2 - error1);

if (aux > max)
{
max = aux;
i1 = i;
}
}
}

if (i1 >= 0 && takeStep(i1, i2)) return 1;


// Heuristic 03:
// - Under unusual circumstances, SMO cannot make positive progress using the second
// choice heuristic above. If it is the case, then SMO starts iterating through the
// non-bound examples, searching for an second example that can make positive progress.

int start = random.Next(inputs.Length);
for (i1 = start; i1 < inputs.Length; i1++)
{
if (alpha[i1] > 0 && alpha[i1] < c)
if (takeStep(i1, i2)) return 1;
}
for (i1 = 0; i1 < start; i1++)
{
if (alpha[i1] > 0 && alpha[i1] < c)
if (takeStep(i1, i2)) return 1;
}


// Heuristic 04:
// - If none of the non-bound examples make positive progress, then SMO starts iterating
// through the entire training set until an example is found that makes positive progress.
// Both the iteration through the non-bound examples and the iteration through the entire
// training set are started at random locations, in order not to bias SMO towards the
// examples at the beginning of the training set.

start = random.Next(inputs.Length);
for (i1 = start; i1 < inputs.Length; i1++)
{
if (takeStep(i1, i2)) return 1;
}
for (i1 = 0; i1 < start; i1++)
{
if (takeStep(i1, i2)) return 1;
}


// In extremely degenerate circumstances, none of the examples will make an adequate second
// example. When this happens, the first example is skipped and SMO continues with another
// chosen first example.
return 0;
}

/// <summary>
/// Analytically solves the optimization problem for two Lagrange multipliers.
/// </summary>
///
private bool takeStep(int i1, int i2)
{
if (i1 == i2) return false;

double[] p1 = inputs[i1]; // Input point at index i1
double alph1 = alpha[i1]; // Lagrange multiplier for p1
double y1 = outputs[i1]; // Classification label for p1

// SVM output on p1 - y1. Check if it has already been computed
double e1 = (alph1 > 0 && alph1 < c) ? errors[i1] : compute(p1) - y1;

double[] p2 = inputs[i2]; // Input point at index i2
double alph2 = alpha[i2]; // Lagrange multiplier for p2
double y2 = outputs[i2]; // Classification label for p2

// SVM output on p2 - y2. Check if it has already been computed
double e2 = (alph2 > 0 && alph2 < c) ? errors[i2] : compute(p2) - y2;


double s = y1 * y2;


// Compute L and H according to equations (13) and (14) (Platt, 1998)
double L, H;
if (y1 != y2)
{
// If the target y1 does not equal the target (13)
// y2, then the following bounds apply to a2:
L = Math.Max(0, alph2 - alph1);
H = Math.Min(c, c + alph2 - alph1);
}
else
{
// If the target y1 does equal the target (14)
// y2, then the following bounds apply to a2:
L = Math.Max(0, alph2 + alph1 - c);
H = Math.Min(c, alph2 + alph1);
}

if (L == H) return false;

double k11, k22, k12, eta;
k11 = kernel.Function(p1, p1);
k12 = kernel.Function(p1, p2);
k22 = kernel.Function(p2, p2);
eta = k11 + k22 - 2.0 * k12;

double a1, a2;

if (eta > 0)
{
a2 = alph2 - y2 * (e2 - e1) / eta;

if (a2 < L) a2 = L;
else if (a2 > H) a2 = H;
}
else
{
// Compute objective function Lobj and Hobj at
// a2=L and a2=H respectively, using (eq. 19)

double L1 = alph1 + s * (alph2 - L);
double H1 = alph1 + s * (alph2 - H);
double f1 = y1 * (e1 + bias) - alph1 * k11 - s * alph2 * k12;
double f2 = y2 * (e2 + bias) - alph2 * k22 - s * alph1 * k12;
double Lobj = -0.5 * L1 * L1 * k11 - 0.5 * L * L * k22 - s * L * L1 * k12 - L1 * f1 - L * f2;
double Hobj = -0.5 * H1 * H1 * k11 - 0.5 * H * H * k22 - s * H * H1 * k12 - H1 * f1 - H * f2;

if (Lobj > Hobj + epsilon) a2 = L;
else if (Lobj < Hobj - epsilon) a2 = H;
else a2 = alph2;
}

if (Math.Abs(a2 - alph2) < epsilon * (a2 + alph2 + epsilon))
return false;

a1 = alph1 + s * (alph2 - a2);

if (a1 < 0)
{
a2 += s * a1;
a1 = 0;
}
else if (a1 > c)
{
double d = a1 - c;
a2 += s * d;
a1 = c;
}


// Update threshold (bias) to reflect change in Lagrange multipliers
double b1 = 0, b2 = 0;
double new_b = 0, delta_b;

if (a1 > 0 && a1 < c)
{
// a1 is not at bounds
new_b = e1 + y1 * (a1 - alph1) * k11 + y2 * (a2 - alph2) * k12 + bias;
}
else
{
if (a2 > 0 && a2 < c)
{
// a1 is at bounds but a2 is not.
new_b = e2 + y1 * (a1 - alph1) * k12 + y2 * (a2 - alph2) * k22 + bias;
}
else
{
// Both new Lagrange multipliers are at bound. SMO algorithm
// chooses the threshold to be halfway in between b1 and b2.
b1 = e1 + y1 * (a1 - alph1) * k11 + y2 * (a2 - alph2) * k12 + bias;
b2 = e2 + y1 * (a1 - alph1) * k12 + y2 * (a2 - alph2) * k22 + bias;
new_b = (b1 + b2) / 2;
}
}

delta_b = new_b - bias;
bias = new_b;


// Update error cache using new Lagrange multipliers
double t1 = y1 * (a1 - alph1);
double t2 = y2 * (a2 - alph2);

for (int i = 0; i < inputs.Length; i++)
{
if (0 < alpha[i] && alpha[i] < c)
{
double[] point = inputs[i];
errors[i] +=
t1 * kernel.Function(p1, point) +
t2 * kernel.Function(p2, point) -
delta_b;
}
}

errors[i1] = 0f;
errors[i2] = 0f;


// Update lagrange multipliers
alpha[i1] = a1;
alpha[i2] = a2;


return true;
}

/// <summary>
/// Computes the SVM output for a given point.
/// </summary>
///
private double compute(double[] point)
{
double sum = -bias;
for (int i = 0; i < inputs.Length; i++)
{
if (alpha[i] > 0)
sum += alpha[i] * outputs[i] * kernel.Function(inputs[i], point);
}

return sum;
}


private double computeComplexity()
{
// Compute initial value for C as the number of examples
// divided by the trace of the input sample kernel matrix.
double sum = 0.0;
for (int i = 0; i < inputs.Length; i++)
sum += kernel.Function(inputs[i], inputs[i]);
return inputs.Length / sum;
}


 



Using the code



In the following example, we will be training a Polynomial Kernel Support Vector Machine to recognize the XOR classification problem. The XOR function is classic example of a pattern classification problem that is not linearly separable.






    double[][] inputs =
{
new double[] { -1, -1},
new double[] { -1, 1},
new double[] { 1, -1},
new double[] { 1, 1}
};

int[] xor = { -1, 1, 1, -1 };





Here, remember that the SVM is a margin classifier that classifies instances as either 1 or –1. So the training and expected output for the classification task should also be in this range. There are no such requirements for the inputs, though.



To create the Kernel Support Vector Machine with a Polynomial Kernel, do:






    // Create Kernel Support Vector Machine with a Polynomial Kernel of 2nd degree
KernelSupportVectorMachine machine = new KernelSupportVectorMachine(
new Polynomial(2), inputs.Length);





After the machine has been created, create a new Learning algorithm. As we are going to do classification, we will be using the standard SequentialMinimalOptimization algorithm.






    // Create the sequential minimal optimization teacher
SequentialMinimalOptimization learn = new SequentialMinimalOptimization(
machine, inputs, xor);

// Run the learning algorithm
learn.Run();





After the model has been trained, we can compute its outputs for the given inputs.






    double[] output = machine.Compute(inputs);





The machine should be able to correctly identify all of the input instances.



 



Sample application



The sample application is able to perform both Classification and Regression using Support Vector Machines. It can read Excel spreadsheets and determines the task to be performed depending on the number of the columns in the sheet. If the input table contains two columns (e.g. X and Y) it will be interpreted as a regression problem X –> Y. If the input table contains three columns (e.g. x1, x2 and Y) it will be interpreted as a classification problem <x1,x2> belongs to class Y, Y being either 1 or -1.



Classification



To perform classification, load a classification task data such as the Yin Yang classification problem.



svm2-1



Yin Yang classification problem. The goal is to create a model which best determines whether a given point belongs to class blue or green. It is a clear example of a non-linearly separable problem.



 









svm2-2 svm2-3


Creation of a Gaussian Kernel Support Vector Machine with σ = 1.2236, C = 1.0, ε = 0.001 and T = 0.001.



 


svm2-4

Classification using the created Support Vector Machine. Notice it achieves an accuracy of 97%, with sensitivity and specifity rates of 98% and 96%, respectively.



Regression



To perform regression, we can load the Gaussian noise sine wave example.



svm2-7



Noise sine wave regression problem.



 









svm2-9 svm2-8


Creation of a Gaussian Kernel Support Vector Machine with σ = 1.2236, C = 1.0, ε = 0.2 and T = 0.001.



 



After the model has been created, we can plot the model approximation for the sine wave data. The blue line shows the curve approximation for the original red training dots.




svm2-10

Regression using the created Kernel Support Vector Machine. Notice the coefficient of determination r² of 0.95. The closer to one, the better.






See also








References



Tuesday, 13 April 2010

Java over-engineered?

The following gem, that belongs into a Java programming best practices book, was found in the source of Sun's JDK.




public class StubFactoryFactoryProxyImpl extends StubFactoryFactoryDynamicBase
{
public PresentationManager.StubFactory makeDynamicStubFactory(
PresentationManager pm, PresentationManager.ClassData classData,
ClassLoader classLoader )
{
return new StubFactoryProxyImpl( classData, classLoader ) ;
}
}


Well, I must confess this is almost better than the simple Hello World in Java [cached copy].

Saturday, 10 April 2010

Partial Least Squares Analysis and Regression in C#


Partial Least Squares Regression (PLS) is a technique that generalizes and combines
features from principal component analysis and (multivariate)
multiple regression. It has been widely adopted in the field of chemometrics
and social sciences.




The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Contents



  1. Introduction

  2. Overview

    1. Multivariate Linear Regression in Latent Space

    2. Algorithm 1: Nipals

    3. Algorithm 2: Simpls



  3. Source Code

    1. Class Diagram

    2. Performing PLS using Nipals

    3. Performing PLS using Simpls

    4. Multivariate Linear Regression



  4. Using the code

  5. Sample application

  6. See also

  7. References



Introduction



Partial least squares regression (PLS-regression) is a
statistical
method that bears some relation to
principal components regression
. Its goal is to find a
linear regression model
by projecting the
predicted variables
and the
observable variables
to new, latent variable spaces. It
was developed in the 1960s by Herman
Wold
to be used in econometrics.
Today it is most commonly used for regression in the field of
chemometrics
.



In statistics, latent variables
(as opposed to observable
variables
), are variables
that are not directly observed but are rather inferred (through a
mathematical model
) from other variables that are observed (directly measured.)
Mathematical models that aim to explain observed variables in terms of latent variables
are called latent variable
models
.



A PLS model will try to find the multidimensional direction in the X space
that explains the maximum multidimensional variance direction in the Y space.
PLS-regression is particularly suited when the matrix of predictors has more variables
than observations, and when there is
multicollinearity
among X values. By contrast, standard regression
will fail in these cases.



Overview



Multivariate Linear Regression in Latent Space




linear-regression align="right" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgpSwOCSCrn9hisiwj3hRJVEkbi8gN-qKWqoaiPciV9vtOecqK_DlHszpJKQEADSl7K444skcbvfVuAj1iKl1FuXI1tN6o8X4XSGNm46EQpRHZtHkPYJBTjVwWRkI18LVs9JxzJfv-HrmL/?imgmax=800"
width="150" height="127" />
Multiple Linear Regression is a generalization
of simple linear regression
for multiple inputs. In turn,
Multivariate Linear Regression
is a generalization of Multiple Linear Regression
for multiple outputs.



The multivariate linear regression is a general linear regression model which can
map an arbitrary dimension space into another arbitrary dimension space using only
linear relationships. In the context of PLS, it is used to map the latent variable
space for X into the latent variable space for Y.



Those latent variable spaces are spawned by the loadings matrix for X and Y, commonly
denoted P and Q, respectively. To compute those matrices, we can use two different
algorithms: NIPALS and SIMPLS.



Algorithm 1: NIPALS



The following is one of the most common and efficient algorithms for NIPALS. There
are, however, many variations of the algorithm which normalize or do not normalize
certain vectors.



 




Algorithm:



  • Let X be the mean-centered input matrix,

  • Let Y be the mean-centered output matrix,

  • Let P be the loadings matrix for X, and let pi
    denote the i-th column of P;

  • Let Q be the loadings matrix for Y, and let qi
    denote the i-th column of Q;

  • Let T be the score matrix for X, and ti
    denote the i-th column of T;

  • Let U be the score matrix for Y, and ui
    denote the i-th column of U;

  • Let W be the PLS weight matrix, and d wi
    denote the i-th column of W; and

  • Let B be a diagonal matrix of diagonal coefficients bi



Then:



  1. For each factor i to be calculated:

    1. Initially choose ui as the largest column vector in
      X (having the largest sum of squares)

    2. While (ti has not converged to a desired precision)

      1. wi ∝ X'ui    
        (estimate X weights)

      2. ti ∝ Xwi     
        (estimate X factor scores)

      3. qi ∝ Y'ti    
        (estimate Y weights)

      4. ui = Yqi     
        (estimate Y scores)



    3. bi = t'u       
      (compute prediction coefficient b)

    4. pi = X't       
      (estimate X factor loadings)

    5. X = X – tp'     (deflate X)






 



For the predictor variables, the amount of variance explained by each factor can
be computed as bi². For the outputs, it can be computed as the squared
sum of the corresponding P column, i.e. as sum(pi²).



Algorithm 2: SIMPLS



In SIMPLS, the components are derived by truly maximizing the covariance criterion.
Because the construction of the weight vectors used by SIMPLS is based on the empirical
variance–covariance matrix of the joint input and output variables, outliers present
in the data will severely impact its performance.



 




Algorithm:




  • Let X be the mean-centered input matrix,


  • Let Y be the mean-centered output matrix,


  • Let P be the loadings matrix for X, and let pi denote the i-th column of P;


  • Let C be the loadings matrix for Y, and let ci denote the i-th column of C;


  • Let T be the score matrix for X, and ti denote the i-th column of T;


  • Let U be the score matrix for Y, and ui denote the i-th column of U; and


  • Let W be the PLS weight matrix, and wi denote the i-th column of W.



Then:




  1. Create the covariance matrix C = X'Y


  2. For each factor i to be calculated:



    1. Perform SVD on the covariance matrix and store the first left singular vector in wi and the first right singular value times the singular values in ci.


    2. ti ∝ X*wi           (estimate X factor scores)


    3. pi = X'*ti           (estimate X factor loadings)


    4. ci = ci/norm(ti)     (estimate Y weights)


    5. wi = wi/norm(ti)     (estimate X weights)


    6. ui = Y*ci           (estimate Y scores)


    7. vi = pi              (form the basis vector vi)


    8. Make v orthogonal to the previous loadings V


    9. Make u orthogonal to the previous scores T


    10. Deflate the covariance matrix C

      1. C = C - vi*(vi'*C)








 



Source Code



Here is presented the realization of the algorithms in C#. The models also have
been carried to a Object-oriented structure and are very suitable for direct binding
into Windows.Forms (or WPF) controls.



Class Diagram



target="_blank">
title="pls-diagram" border="0" alt="pls-diagram" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHiDUOG2ZbtyoIX-W9Ces_yRwq5mZx69Y55ut-T2wu5woCtae_afH20FbehSWWO8Fk1G8Xlu5S0LsjJ00Xisz9S35lX-4NEo3hXE-ljraayIZSESMscenimfL_CDAN1eydnR0PcQTNrpbw/?imgmax=800"
width="509" height="488" />



Class diagram for the Partial Least Squares Analysis.



Performing PLS using NIPALS



Source code for computing PLS using the nipals algorithm:





/// <summary>
/// Computes PLS parameters using NIPALS algorithm.
/// </summary>
///
private void nipals(double[,] X, double[,] Y, int factors, double tolerance)
{
// References:
// - Hervé Abdi, http://www.utdallas.edu/~herve/abdi-wireCS-PLS2010.pdf


// Initialize and prepare the data
int rows = sourceX.GetLength(0);
int xcols = sourceX.GetLength(1);
int ycols = sourceY.GetLength(1);


// Initialize storage variables
double[,] E = (double[,])X.Clone();
double[,] F = (double[,])Y.Clone();

double[,] T = new double[rows, factors];
double[,] U = new double[rows, factors];
double[,] P = new double[xcols, factors];
double[,] C = new double[ycols, factors];
double[,] W = new double[xcols, xcols];
double[] B = new double[xcols];

double[] varX = new double[factors];
double[] varY = new double[factors];


// Initialize the algorithm
bool stop = false;

#region NIPALS
for (int factor = 0; factor < factors && !stop; factor++)
{
// Select t as the largest column from X,
double[] t = E.GetColumn(largest(E));

// Select u as the largest column from Y.
double[] u = F.GetColumn(largest(F));

// Will store weights for X and Y
double[] w = new double[xcols];
double[] c = new double[ycols];


double norm_t = Norm.Euclidean(t);


#region Iteration
while (norm_t > 1e-14)
{
// Store initial t to check convergence
double[] t0 = (double[])t.Clone();


// Step 1. Estimate w (X weights): w ∝ E'*u
// (in Abdi's paper, X is referred as E).

// 1.1. Compute w = E'*u;
w = new double[xcols];
for (int j = 0; j < w.Length; j++)
for (int i = 0; i < u.Length; i++)
w[j] += E[i, j] * u[i];

// 1.2. Normalize w (w = w/norm(w))
w = w.Divide(Norm.Euclidean(w));


// Step 2. Estimate t (X factor scores): t ∝ E*w
// (in Abdi's paper, X is referred as E).

// 2.1. Compute t = E*w
t = new double[rows];
for (int i = 0; i < t.Length; i++)
for (int j = 0; j < w.Length; j++)
t[i] += E[i, j] * w[j];

// 2.2. Normalize t: t = t/norm(t)
t = t.Divide(norm_t = Norm.Euclidean(t));


// Step 3. Estimate c (Y weights): c ∝ F't
// (in Abdi's paper, Y is referred as F).

// 3.1. Compute c = F'*t0;
c = new double[ycols];
for (int j = 0; j < c.Length; j++)
for (int i = 0; i < t.Length; i++)
c[j] += F[i, j] * t[i];

// 3.2. Normalize q: c = c/norm(q)
c = c.Divide(Norm.Euclidean(c));


// Step 4. Estimate u (Y scores): u = F*q
// (in Abdi's paper, Y is referred as F).

// 4.1. Compute u = F*q;
u = new double[rows];
for (int i = 0; i < u.Length; i++)
for (int j = 0; j < c.Length; j++)
u[i] += F[i, j] * c[j];


// Recalculate norm of the difference
norm_t = 0.0;
for (int i = 0; i < t.Length; i++)
{
double d = (t0[i] - t[i]);
norm_t += d * d;
}

norm_t = Math.Sqrt(norm_t);
}
#endregion


// Compute the value of b which is used to
// predict Y from t as b = t'u [Abdi, 2010]
double b = t.InnerProduct(u);

// Compute factor loadings for X as p = E'*t [Abdi, 2010]
double[] p = new double[xcols];
for (int j = 0; j < p.Length; j++)
for (int i = 0; i < rows; i++)
p[j] += E[i, j] * t[i];

// Perform deflaction of X and Y
for (int i = 0; i < t.Length; i++)
{
// Deflate X as X = X - t*p';
for (int j = 0; j < p.Length; j++)
E[i, j] -= t[i] * p[j];

// Deflate Y as Y = Y - b*t*q';
for (int j = 0; j < c.Length; j++)
F[i, j] -= b * t[i] * c[j];
}


// Calculate explained variances
varY[factor] = b * b;
varX[factor] = p.InnerProduct(p);


// Save iteration results
T.SetColumn(factor, t);
P.SetColumn(factor, p);
U.SetColumn(factor, u);
C.SetColumn(factor, c);
W.SetColumn(factor, w);
B[factor] = b;


// Check for residuals as stop criteria
double[] norm_x = Norm.Euclidean(E);
double[] norm_y = Norm.Euclidean(F);

stop = true;
for (int i = 0; i < norm_x.Length && stop == true; i++)
{
// If any of the residuals is higher than the tolerance
if (norm_x[i] > tolerance || norm_y[i] > tolerance)
stop = false;
}
}




Performing PLS using SIMPLS



Source code for computing PLS using the simpls algorithm:






/// <summary>
/// Computes PLS parameters using SIMPLS algorithm.
/// </summary>
///
private void simpls(double[,] X, double[,] Y, int factors)
{
// References:
// - Martin Anderson, "A comparison of nine PLS1 algorithms". Journal of Chemometrics,
// 2009. Available on: http://onlinelibrary.wiley.com/doi/10.1002/cem.1248/pdf
// - Hervé Abdi, http://www.utdallas.edu/~herve/abdi-wireCS-PLS2010.pdf
// - Statsoft, http://www.statsoft.com/textbook/partial-least-squares/#SIMPLS
// - Sijmen de Jong, "SIMPLS: an alternative approach to partial least squares regression"
// - N.M. Faber and J. Ferré, “On the numerical stability of two widely used PLS algorithms,”
// J. Chemometrics, 22, pps 101-105, 2008.


// Initialize and prepare the data
int rows = sourceX.GetLength(0);
int xcols = sourceX.GetLength(1);
int ycols = sourceY.GetLength(1);

// Initialize storage variables
double[,] P = new double[xcols, factors]; // loading matrix P, the loadings for X such that X = TP + F
double[,] C = new double[ycols, factors]; // loading matrix C, the loadings for Y such that Y = TC + E
double[,] T = new double[rows, factors]; // factor score matrix T
double[,] U = new double[rows, factors]; // factor score matrix U
double[,] W = new double[xcols, factors]; // weight matrix W

double[] varX = new double[factors];
double[] varY = new double[factors];

// Orthogonal loadings
double[,] V = new double[xcols, factors];


// Create covariance matrix C = X'Y
double[,] covariance = X.TransposeAndMultiply(Y);

#region SIMPLS
for (int factor = 0; factor < factors; factor++)
{

// Step 1. Obtain the dominant eigenvector w of C'C. However, we
// can avoid computing the matrix multiplication by using the
// singular value decomposition instead, which is also more
// stable. the first weight vector w is the left singular vector
// of C=X'Y [Abdi, 2007].

var svd = new SingularValueDecomposition(covariance,
computeLeftSingularVectors: true,
computeRightSingularVectors: false,
autoTranspose: true);

double[] w = svd.LeftSingularVectors.GetColumn(0);
double[] c = covariance.TransposeAndMultiply(w);


// Step 2. Estimate X factor scores: t ∝ X*w
// Similarly to NIPALS, the T factor of SIMPLS
// is computed as T=X*W [Statsoft] [Abdi, 2010].

// 2.1. Estimate t (X factor scores): t = X*w [Abdi, 2010]
double[] t = new double[rows];
for (int i = 0; i < t.Length; i++)
for (int j = 0; j < w.Length; j++)
t[i] += X[i, j] * w[j];

// 2.2. Normalize t (X factor scores): t = t/norm(t)
double norm_t = Norm.Euclidean(t);
t = t.Divide(norm_t);


// Step 3. Estimate p (X factor loadings): p = X'*t
double[] p = new double[xcols];
for (int i = 0; i < p.Length; i++)
for (int j = 0; j < t.Length; j++)
p[i] += X[j, i] * t[j];


// Step 4. Estimate X and Y weights. Actually, the weights have
// been computed in the first step during SVD. However, since
// the X factor scores have been normalized, we also have to
// normalize weights accordingly: w = w/norm(t), c = c/norm(t)
w = w.Divide(norm_t);
c = c.Divide(norm_t);


// Step 5. Estimate u (Y factor scores): u = Y*c [Abdi, 2010]
double[] u = new double[rows];
for (int i = 0; i < u.Length; i++)
for (int j = 0; j < c.Length; j++)
u[i] += Y[i, j] * c[j];


// Step 6. Initialize the orthogonal loadings
double[] v = (double[])p.Clone();


// Step 7. Make v orthogonal to the previous loadings
// http://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process

if (factor > 0)
{
// 7.1. MGS for v [Martin Anderson, 2009]
for (int j = 0; j < factor; j++)
{
double proj = 0.0;
for (int k = 0; k < v.Length; k++)
proj += v[k] * V[k, j];

for (int k = 0; k < v.Length; k++)
v[k] -= proj * V[k, j];
}

// 7.1. MGS for u [Martin Anderson, 2009]
for (int j = 0; j < factor; j++)
{
double proj = 0.0;
for (int k = 0; k < u.Length; k++)
proj += u[k] * T[k, j];

for (int k = 0; k < u.Length; k++)
u[k] -= proj * T[k, j];
}
}

// 7.2. Normalize orthogonal loadings
v = v.Divide(Norm.Euclidean(v));


// Step 8. Deflate covariance matrix as s = s - v * (v' * s)
// as shown in simpls1 in [Martin Anderson, 2009] appendix.
double[,] cov = (double[,])covariance.Clone();
for (int i = 0; i < v.Length; i++)
{
for (int j = 0; j < v.Length; j++)
{
double d = v[i] * v[j];

for (int k = 0; k < ycols; k++)
cov[i, k] -= d * covariance[j, k];
}
}
covariance = cov;


// Save iteration
W.SetColumn(factor, w);
U.SetColumn(factor, u);
C.SetColumn(factor, c);
T.SetColumn(factor, t);
P.SetColumn(factor, p);
V.SetColumn(factor, v);

// Compute explained variance
varX[factor] = p.InnerProduct(p);
varY[factor] = c.InnerProduct(c);
}
#endregion


// Set class variables
this.scoresX = T; // factor score matrix T
this.scoresY = U; // factor score matrix U
this.loadingsX = P; // loading matrix P, the loadings for X such that X = TP + F
this.loadingsY = C; // loading matrix Q, the loadings for Y such that Y = TQ + E
this.weights = W; // the columns of R are weight vectors
this.coeffbase = W;


// Calculate variance explained proportions
this.componentProportionX = new double[factors];
this.componentProportionY = new double[factors];

double sumX = 0.0, sumY = 0.0;
for (int i = 0; i < rows; i++)
{
// Sum of squares for matrix X
for (int j = 0; j < xcols; j++)
sumX += X[i, j] * X[i, j];

// Sum of squares for matrix Y
for (int j = 0; j < ycols; j++)
sumY += Y[i, j] * Y[i, j];
}

// Calculate variance proportions
for (int i = 0; i < factors; i++)
{
componentProportionY[i] = varY[i] / sumY;
componentProportionX[i] = varX[i] / sumX;
}

}



Multivariate Linear Regression



Multivariate Linear Regression is computed in a similar manner to a Multiple Linear
Regression. The only difference is that, instead of having a weight vector and a
intercept, we have a weight matrix and a intercept vector.




    /// <summary> 
/// Computes the Multiple Linear Regression for an input vector.
/// </summary>
/// <param name="input">The input vector.</param>
/// <returns>The calculated output.</returns>
public double[] Compute( class="kwrd">double[] input)
{
int N = input.Length;
int M = coefficients.GetLength(1);

double[] result = new double[M];
for (int i = 0; i < M; i++)
{
result[i] = intercepts[i];

for (int j = 0; j < N; j++)
result[i] += input[j] * coefficients[j, i];
}

return result;
}



The weight matrix and the intercept vector are computed in the PartialLeastSquaresAnalysis
class by the CreateRegression method. In case the analyzed data already
was mean centered before being fed to the analysis, the constructed intercept vector
will consist only of zeros.




    /// <summary> 
/// Creates a Multivariate Linear Regression model using
/// coefficients obtained by the Partial Least Squares.
/// </summary>
public MultivariateLinearRegression CreateRegression( class="kwrd">int factors)
{
int rows = sourceX.GetLength(0);
int xcols = sourceX.GetLength(1);
int ycols = sourceY.GetLength(1);

// Compute regression coefficients B of Y on X as B = RQ'
double[,] B = new double[xcols, ycols];
for (int i = 0; i < xcols; i++)
for (int j = 0; j < ycols; j++)
for (int k = 0; k < factors; k++)
B[i, j] += coeffbase[i, k] * loadingsY[j, k];

// Divide by standard deviation if X has been normalized
if (analysisMethod == AnalysisMethod.Correlation)
for (int i = 0; i < xcols; i++)
for (int j = 0; j < ycols; j++)
B[i, j] = B[i, j] / stdDevX[i];

// Compute regression intercepts A as A = meanY - meanX'*B
double[] A = new double[ycols];
for (int i = 0; i < ycols; i++)
{
double sum = 0.0;
for (int j = 0; j < xcols; j++)
sum += meanX[j] * B[j, i];
A[i] = meanY[i] - sum;
}

return new MultivariateLinearRegression(B, A, class="kwrd">true);
}


 



Using the code



As an example, lets consider the
example data from Hervé Abdi
, where the goal is to predict the subjective
evaluation of a set of 5 wines. The dependent variables that we want to predict
for each wine are its likeability, and how well it goes with meat, or dessert (as
rated by a panel of experts). The predictors are the price, the sugar, alcohol,
and acidity content of each wine.



    double[,] dependent = { 
// Wine | Hedonic | Goes with meat | Goes with dessert
{ 14, 7, 8 },
{ 10, 7, 6 },
{ 8, 5, 5 },
{ 2, 4, 7 },
{ 6, 2, 4 },
};

double[,] predictors = {
// Wine | Price | Sugar | Alcohol | Acidity
{ 7, 7, 13, 7 },
{ 4, 3, 14, 7 },
{ 10, 5, 12, 5 },
{ 16, 7, 11, 3 },
{ 13, 3, 10, 3 },
};


Next, we proceed to create the Partial Least Squares Analysis using the Covariance
method (data will only be mean centered but not normalized) and using the SIMPLS
algorithm.



    // Create the analysis using the covariance 
method and the SIMPLS algorithm

PartialLeastSquaresAnalysis pls = new PartialLeastSquaresAnalysis(predictors, dependent,
AnalysisMethod.Covariance, PartialLeastSquaresAlgorithm.SIMPLS);

// Compute the analysis
pls.Compute();

After the analysis has been computed, we can proceed and create the regression model.
    // Create the Multivariate Linear Regression 
model

MultivariateLinearRegression regression = pls.CreateRegression();

// Compute the regression outputs for the predictor variables
double[][] aY = regression.Compute(predictors.ToArray());


Now after the regression has been computed, we can find how well it has performed.
The coefficient
of determination r²
for the variables Hedonic, Goes with Meat
and Goes with Dessert can be computed by the CoefficientOfDetermination
method of the MultivariateRegressionClass and will be, respectively,
0.9999
, 0.9999 and 0.8750 - the closer
to one, the better.



 



Sample application



The accompanying sample application performs Partial Least Squares Analysis and
Regression in Excel worksheets. The predictors and dependent variables can be selected
once the data has been loaded in the application.










target="_blank">
alt="Wine example from Hervé Abdi" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXAGB_e3lYChScdFcdGnBYwyqpNsjlDMc5kwUW55m46bM-9svPnBuiY8YHKrImOUOXEmmkuvavGozBgZS34rOYKe5z72h9Ed6kkKinPhjSZYA3T3XihiKBoj7d5s645Qtr4GRyqci2aq1s/?imgmax=800"
width="299" />



border="0" alt="Variance explained by PLS using the SIMPLS algorithm" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0arTljU5KQGirNJTG6DO9EBlSTwjYz2bYLsHZS_9aUWwPcuZ6f3TtifKRVoip6deKVHE-MHVYJnm2HSDCU8L6_WMLBh81S0c_MiMD2gNBrCos8gW-kYYozzcjZpKB49XxuXo2qJB8pcMG/?imgmax=800"
width="299" />



Left: Wine example from Hervé Abdi. Right: Variance explained by PLS using the SIMPLS
algorithm



 










target="_blank">
border="0" alt="Partial Least Squares Analysis results and regression coefficients for the full regression model"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-hqNVtNaS5QaxBbas90It89yFhbHeTtMzeiDWSSs3kTdp1z_Ez1MxQT9n5QNMlieou1vHVd1YDEKgcUsS9Xg1EsOSmWTpPSXFWM3YO9Vy0anCrf-phiROPmubwBbiYWtFpb_KTV6EfqaX/?imgmax=800"
width="299" />


target="_blank">
border="0" alt="Projection of the dependent and predictors variables using the three first factors"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWN5PoNBFcdltKxSBsoeELBRZV9WVUFtfUhopfJ4t3rpD617AZjdc9t9OUA8VgBNAp5d_0L6lwx2LRqUfMj9bjQFF6k-YyeZzMJc1puFl8K37GozwUW1X0n4HVNxcyy8VxsfnQqt1nRzqh/?imgmax=800"
width="299" />



Left: Partial Least Squares Analysis results and regression coefficients for the
full regression model. Right: projection of the dependent and predictors variables
using the three first factors.



 



target="_blank">
border="0" alt="Results from the Multivariate Linear Regression performed in Latent Space using three factors from PLS"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh971MdN2jqOokTOTuG2cZf6KBM7SDRZomSrqZ1DltFBiEa6d7Bb8VTAgFRdUArKJghlCCjO-kiBEscDKH6W5rCVb8KSXYy00WDaAVsiw8cNfH7W0XQu5MUIXiDtkxQtpjtc5whGY-o76lw/?imgmax=800"
width="605" />



Results from the Multivariate Linear Regression performed in Latent Space using
three factors from PLS.



 



 










target="_blank">
alt="Example data from Geladi and Kowalski in the PLS sample application" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4NNWX_xEBDJg2x7gjl9ELza1rzvXmNqb3_Q_f_g0WRbbF5yTFKiHCU84gP_brW4HhJNT4a-Grh0JDNa-PNjgFm7pHCooBW7DfSYnmzpmlu8AYe5i-_wTzeNNGR6Use0qXoEIUWF43RcBN/?imgmax=800"
width="299" />


target="_blank">
alt="Analysis results for the data using NIPALS. We can see that just two factors are enough to explain the whole variance of the set."
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBtHtylO66oYdai8xvyKEzxdqb4C9xxuDDWRr8sXcqIbOMryOTJD9hyphenhyphenbsTL96VifCigtlwnqfBqQa-RVXufHe4HBhl3qIhcP0sTXcyEIjp8USoNunzon421-F4J_AZEv8n0iHyAX13jYHR/?imgmax=800"
width="299" />



Left: Example data from Geladi
and Kowalski
. Right: Analysis results for the data using NIPALS. We can
see that just two factors are enough to explain the whole variance of the set.



 










target="_blank">
Projections to the latent spaces detected by the Partial Least Squares Analysis src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoZSNbSYOoaWmfxPw8buWy_qB8VL3FLMgb0WJ-OH5Wt8fG5rudrE3vVkUPKtn2DjPr_1GJQa9eSB7RISkpughg6TcMGlrCp-u7oiO6WBluqQc8ZZo_PndeY0sWfyIUTxzQq6xF4YtmnG8B/?imgmax=800"
width="299" />


target="_blank">
border="0" alt="Loadings and coefficients for the two factor Multivariate Linear Regression (MLR) model discovered by Partial Least Squares (PLS)"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRIfcPO8TnNpavJg2OdLHvRq_eE9j9fSjG-KANHZS7b3wYQVQTM29_EOzj9jPOhhhtxxSO12qqxWPPw8B1QsJCgz9oOH4z5ZvHkE2uM5w73yq2UwI8E7kuVbFASIY1aY5TOri1diz_VbMT/?imgmax=800"
width="299" />



Left: Projections to the latent spaces. Right: Loadings and coefficients for the
two factor Multivariate Linear Regression model.



target="_blank">
border="0" alt="Results from the Partial Least Squares Regression showing a perfect fit using only the first two factors"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjahgSDSVTOe0BkkXgn7E55QVKiLJrlnzsJemfu-1csiZ8Z1fKSTDWUaRCJbRTB7Xev8DY798P-LThyphenhyphenDoROl1DkeNON3o1UhYRohJOF7t4Zay0aa6-UtlKhNbOxExiqC357OQKfWd9q3M_r/?imgmax=800"
width="605" />



Results from the Partial Least Squares Regression showing a perfect fit using only
the first two factors.



 



See also




 



References