Sunday, 27 June 2010

Generalized Eigenvalue Decomposition in C#

For quite some time I have been trying to find a Java or C# implementation of the Generalized Eigenvalue Decomposition. I just wanted something simple, that didn’t depend on any external libraries such as BLAS or LAPACK. As I could not find any simple and understandable implementation of this decomposition, I decided to implement one myself.





Introduction


As Wikipedia says, a generalized Eigen value problem is the problem of finding a vector v that obeys

 A\mathbf{v} = \lambda B \mathbf{v} \quad \quad

where A and B are matrices. If v obeys this equation, with some λ, then we call v the generalized eigenvector of A and B, and λ is called the generalized eigenvalue of A and B which corresponds to the generalized eigenvector v. The possible values of λ must obey the following equation

\det(A - \lambda B)=0.\,

An interesting feature of the generalized eigenvalue decomposition is that it finds the eigenvectors of the matrix B-1A even if B is singular and does not have an inverse. If B is nonsingular, the problem could be solved by reducing it to a standard eigenvalue problem of the form B-1Ax=λx. However, because B can be singular, an alternative algorithm, called the QZ method, is necessary.



Solving the problem using the QZ method is equivalent to computing the Eigen decomposition of B-1A without the need of inverting B, which could be impossible or ill-conditioned if B is singular or near-singular. It is also computationally simpler as it saves the time and memory needed for inverting and storing the inverse of B.




EISPACK


EISPACK is a software library for numerical computation of eigenvalues and eigenvectors of matrices, written in FORTRAN. It was originally written around 1972–1973, based heavily on algorithms originally implemented in ALGOL. Because those algorithms are much more human-readable than their LAPACK counterparts, I decided to port the related functions from EISPACK.



The functions which implement the Generalized Eigenvalue Decomposition in EISPACK are called QZHES, QZIT, QZVAL and QZVEC. As their name implies, they use the QZ method for finding the generalized eigenvalues of a matrix pair (A,B).



Source code

The source code is available in the download link in the upper part of this article. It can also be found in the latest versions of the Accord.NET Framework. The algorithm presented in the code is equivalent to the eig(A,B) command of MATLAB.

Because the code has been ported from FORTRAN, many sections of the algorithms have stayed as they appear in their original functions. Please be aware that it may contain lots of labels and goto's, as most structured programming features we take for granted today were not commonplace at the time.





Using the code

Using the code is much simpler than dealing with LAPACK functions. Because the functions have been implemented directly in C#, the code is somewhat more human readable and has also been wrapped in the same nice object-oriented approach as the other matrix decompositions that can be found in MAPACK and JAMA.

Using the given sources, the following MATLAB code:




     [V,D] = eig(A,B)


Becomes equivalent to the C# code:




     var gevd = new GeneralizedEigenvalueDecomposition(A,B);
     var V = gevd.Eigenvectors;
     var D = gevd.DiagonalMatrix;


in the sense that both satisfy the identity A*V = B*V*D .



References


Wednesday, 2 June 2010

RANdom Sample Consensus (RANSAC) in C#

RANSAC is an iterative method to build robust estimates for parameters of a mathematical model from a set of observed data which is known to contain outliers. The RANSAC algorithm is often used in computer vision, e.g., to simultaneously solve the correspondence problem and estimate the fundamental matrix related to a pair of stereo cameras.




This code has also been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

Introduction

RANSAC is an abbreviation for "RANdom SAmple Consensus". It is an iterative method to estimate parameters of a mathematical model from a set of observed data which may contains outliers. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles in 1981.

The basic assumption is that the data consists of "inliers", i.e., data whose distribution can be explained by some mathematical model, and "outliers" which are data that do not fit the model. Outliers could be considered points which come from noise, erroneous measurements or simply incorrect data. RANSAC also assumes that, given a set of inliers, there exists a procedure which can estimate the parameters of a model that optimally explains or fits this data.

Example: Fitting a simple linear regression

We can use RANSAC to robustly fit a linear regression model using noisy data. Consider the example below, in which we have a cloud of points that seems to belong to a line. These are the inliers of the data. The other points, which can be seem as measurement errors or extreme noise values, are points expected to be considered outliers.

ransac-7

Linear structure contained in noisy data.

RANSAC is able to automatically distinguish the inliers from the outliers through the evaluation of the linear regression model. To do so, it randomly selects subsets from the data and attempts to fit linear regression models using them. The model which best explains most of the data will then be returned as the most probably correct model fit.

The image below shows the result of fitting a linear regression directly (as shown by the red line) and using RANSAC (as shown by the blue line). We can see that the red line represents poorly the data structure because it considers all points in order to fit the regression model. The blue line seems to be a much better representation of the linear relationship hidden inside the overall noisy data.

ransac-8

Hidden linear structure inside noisy data. The red line shows the fitting of a linear regression model directly considering all data points. The blue line shows the same result using RANSAC.

Source code

The code below implements RANSAC using a generic approach. Models are considered to be of the reference type TModel and the type of data manipulated by this model is considered to be of the type TData. This approach allows for the creation of a general purpose RANSAC algorithm which can be used in very different contexts, be it the fitting of linear regression models or the estimation of homography matrices from pair of points in different images.





    /// <summary>
/// Computes the model using the RANSAC algorithm.
/// </summary>
public TModel Compute(TData[] points, out int[] inliers)
{
// We are going to find the best model (which fits
// the maximum number of inlier points as possible).
TModel bestModel = null;
int[] bestInliers = null;
int maxInliers = 0;

// For this we are going to search for random samples
// of the original points which contains no outliers.

int count = 0; // Total number of trials performed
double N = maxEvaluations; // Estimative of number of trials needed.

// While the number of trials is less than our estimative,
// and we have not surpassed the maximum number of trials
while (count < N && count < maxEvaluations)
{
int[] idx;
TModel model = null;
int samplings = 0;

// While the number of samples attempted is less
// than the maximum limit of attempts
while (samplings < maxSamplings)
{
// Select at random s datapoints to form a trial model.
idx = Statistics.Tools.Random(points.Length, s);
TData[] sample = points.Submatrix(idx);

// If the sampled points are not in a degenerate configuration,
if (!degenerate(sample))
{
// Fit model using the random selection of points
model = fitting(sample);
break; // Exit the while loop.
}

samplings++; // Increase the samplings counter
}

// Now, evaluate the distances between total points and the model returning the
// indices of the points that are inliers (according to a distance threshold t).
idx = distances(model, points, t);

// Check if the model was the model which highest number of inliers:
if (idx.Length > maxInliers)
{
// Yes, this model has the highest number of inliers.

maxInliers = idx.Length; // Set the new maximum,
bestModel = model; // This is the best model found so far,
bestInliers = idx; // Store the indices of the current inliers.

// Update estimate of N, the number of trials to ensure we pick,
// with probability p, a data set with no outliers.
double pInlier = (double)idx.Length / (double)points.Length;
double pNoOutliers = 1.0 - System.Math.Pow(pInlier, s);

N = System.Math.Log(1.0 - probability) / System.Math.Log(pNoOutliers);
}

count++; // Increase the trial counter.
}

inliers = bestInliers;
return bestModel;
}




Besides the generic parameters, the class utilizes three delegated functions during execution.



  • The Fitting function, which should accept a subset of the data and use it to fit a model of the chosen type, which should be returned by the function;

  • The Degenerate function, which should check if a subset of the training data is already known to result in a poor model, to avoid unnecessary computations; and

  • The Distance function, which should accept a model and a subset of the training data to compute the distance between the model prediction and the expected value for a given point. It should return the indices of the points only whose predicted and expected values are within a given threshold of tolerance apart.



Using the code


In the following example, we will fit a simple linear regression of the form x→y using RANSAC. The first step is to create a RANSAC algorithm passing the generic type parameters of the model to be build, i.e. SimpleLinearRegression and of the data to be fitted, i.e. a double array.


In this case we will be using a double array because the first position will hold the values for the input variable x. The second position will be holding the values for the output variables y. If you are already using .NET 4 it is possible to use the Tuple type instead.





    // Create a RANSAC algorithm to fit a simple linear regression
var ransac = new RANSAC<SimpleLinearRegression, double[]>(minSamples);
ransac.Probability = probability;
ransac.Threshold = errorThreshold;
ransac.MaxEvaluations = maxTrials;




After the creation of the RANSAC algorithm, we should set the delegate functions which will tell RANSAC how to fit a model, how to tell if a set of samples is degenerate and how to check for inliers in data.





    // Set the RANSAC functions to evaluate and test the model

ransac.Fitting = // Define a fitting function
delegate(double[][] sample)
{
// Retrieve the training data
double[] inputs = sample.GetColumn(0);
double[] outputs = sample.GetColumn(1);

// Build a Simple Linear Regression model
var r = new SimpleLinearRegression();
r.Regress(inputs, outputs);
return r;
};

ransac.Degenerate = // Define a check for degenerate samples
delegate(double[][] sample)
{
// In this case, we will not be performing such checkings.
return false;
};

ransac.Distances = // Define a inlier detector function
delegate(SimpleLinearRegression r, double[][] sample, double threshold)
{
List<int> inliers = new List<int>();
for (int i = 0; i < sample.Length; i++)
{
// Compute error for each point
double input = sample[i][0];
double output = sample[i][1];
double error = r.Compute(input) - output;

// If the squared error is below the given threshold,
// the point is considered to be an inlier.
if (error * error < threshold)
inliers.Add(i);
}
return inliers.ToArray();
};




Finally, all we have to do is call the Compute method passing the data. The best model found will be returned by the function, while the given set of inliers indices for this model will be returned as an out parameter.




    // Finally, try to fit the regression model using RANSAC
int[] idx; SimpleLinearRegression rlr = ransac.Compute(data, out idx);


Sample application


The accompanying source application demonstrates the fitting of the simple linear regression model with and without using RANSAC. The application accepts Excel worksheets containing the independent values in the first column and the dependent variables in the second column.



ransac-9



 



References