Saturday 18 February 2012

Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) for Unconstrained Optimization in C#

The Limited-memory Broyden-Fletcher-Goldfarb-Shanno method is an optimization method belonging to the family of quasi-Newton methods for unconstrained non-linear optimization. In short terms, it is an off-the-shelf optimizer for seeking either minimum or maximum points of a any differentiable and possibly non-linear function, requiring only an expression of the function and its gradient. The goal of this article is to provide and demonstrate a C# port of the L-BFGS method originally written by Jorge Nocedal in Fortran.

Introduction

Function optimization is a common problem found in many numerical applications. Suppose we have a differentiable function f : Rn → R and we would like to obtain its minimal or maximum value while traversing its space of possible input values. Those kind of problems arise often in parameter optimization of machine learning models and other statistic related applications (and in a myriad of other applications too – however, we will pay special attention to cases related to machine learning).

The problem of maximizing or minimizing a function can be simply stated as max f(x) or min f(x). When there aren’t any constraints limiting possible values for x, it is widely known from calculus that the maximum or a minimum of such a function would occur when the first derivatives f’(x) of the function f(x) in respect to x are zero. Newton’s method is a common method for finding the roots of a differentiable function and is indeed a suitable method for finding those values such that first derivatives of a function are zero.

 






Surface of the example function. Contour lines of the example function.


Example of a convex, but non-linear function f(x,y) = exp{-(x-1)²} + exp{-(y-2)²/2}. Images have been created using Wolfram Alpha.

 

However, while Newton’s method may work very well with functions of a single variable, the generalization to higher dimensions will require the replacement of the first derivative f’(x) with the function's gradient vector g of partial derivatives, and the second derivative f’’(x) with the inverse Hessian matrix H-1. The problem is that the Hessian is a square matrix with as many rows and columns as parameters of our function f, and, besides being very costly and challenging to compute, even its size may be infeasible to accommodate in main memory when dealing with very high dimensional problems.

To overcome those limitations, several methods attempt to replace the Hessian matrix with an approximation extracted from the first partial derivatives alone. Those methods are called quasi-Newton methods, as they replace H in Newton’s method by an approximation instead of using the full Hessian. In case the function being optimized has any particular form or special characteristic which could be exploited, it is also possible to use more specialized methods such as the Gauss-Newton, Levenberg-Marquardt or even lower-bound approximation methods to avoid computing the full Hessian.

The quasi-Newton BFGS method for non-linear optimization builds an approximation for the Hessian matrix based on estimates extracted solely from first order information. However, even being cheaper to compute, the memory requirements are still high as the method still needs to accommodate a dense N x N matrix for the inverse Hessian approximation (where N is the number of variables for the function). To overcome this problem, the L-BFGS method, or the limited-memory version of the BFGS method, never forms or stores the approximation matrix, but only a few vectors which can be used to represent it implicitly.

Source code

The code presented here is based upon a direct translation of the original code by Jorge Nocedal. His original code was released under the permissive BSD license, and honoring the original license and the author's considerations, this code is released under the BSD as well.



L-BFGS Class diagrams



The L-BFGS method is implemented within the BroydenFletcherGoldfarbShanno class. Upon object creation, it is possible to specify a function and its gradient through either the use of Lambda expressions or by specifying handles for named functions. After the object has been initialized, a simple call to Minimize runs the standard L-BFGS method until convergence. The details for the Minimize method is given below.

    /// <summary>
/// Optimizes the defined function.
/// </summary>
///
/// <param name="values">The initial guess values for the parameters.</param>
/// <returns>The values of the parameters which optimizes the function.</returns>
///
public unsafe double Minimize(double[] values)
{
if (values == null)
throw new ArgumentNullException("values");

if (values.Length != numberOfVariables)
throw new DimensionMismatchException("values");

if (Function == null) throw new InvalidOperationException(
"The function to be minimized has not been defined.");

if (Gradient == null) throw new InvalidOperationException(
"The gradient function has not been defined.");


// Initialization
x = (double[])values.Clone();
int n = numberOfVariables, m = corrections;


// Make initial evaluation
f = getFunction(x);
g = getGradient(x);

this.iterations = 0;
this.evaluations = 1;


// Obtain initial Hessian
double[] diagonal = null;

if (Diagonal != null)
{
diagonal = getDiagonal();
}
else
{
diagonal = new double[n];
for (int i = 0; i < diagonal.Length; i++)
diagonal[i] = 1.0;
}


fixed (double* w = work)
{
// The first N locations of the work vector are used to
// store the gradient and other temporary information.

double* rho = &w[n]; // Stores the scalars rho.
double* alpha = &w[n + m]; // Stores the alphas in computation of H*g.
double* steps = &w[n + 2 * m]; // Stores the last M search steps.
double* delta = &w[n + 2 * m + n * m]; // Stores the last M gradient diferences.


// Initialize work vector
for (int i = 0; i < g.Length; i++)
steps[i] = -g[i] * diagonal[i];

// Initialize statistics
double gnorm = Norm.Euclidean(g);
double xnorm = Norm.Euclidean(x);
double stp = 1.0 / gnorm;
double stp1 = stp;

// Initialize loop
int nfev, point = 0;
int npt = 0, cp = 0;
bool finish = false;

// Make initial progress report with initialization parameters
if (Progress != null) Progress(this, new OptimizationProgressEventArgs
(iterations, evaluations, g, gnorm, x, xnorm, f, stp, finish));


// Start main
while (!finish)
{
iterations++;
double bound = iterations - 1;

if (iterations != 1)
{
if (iterations > m)
bound = m;

double ys = 0;
for (int i = 0; i < n; i++)
ys += delta[npt + i] * steps[npt + i];

// Compute the diagonal of the Hessian
// or use an approximation by the user.

if (Diagonal != null)
{
diagonal = getDiagonal();
}
else
{
double yy = 0;
for (int i = 0; i < n; i++)
yy += delta[npt + i] * delta[npt + i];
double d = ys / yy;

for (int i = 0; i < n; i++)
diagonal[i] = d;
}


// Compute -H*g using the formula given in:
// Nocedal, J. 1980, "Updating quasi-Newton matrices with limited storage",
// Mathematics of Computation, Vol.24, No.151, pp. 773-782.

cp = (point == 0) ? m : point;
rho[cp - 1] = 1.0 / ys;
for (int i = 0; i < n; i++)
w[i] = -g[i];

cp = point;
for (int i = 1; i <= bound; i += 1)
{
if (--cp == -1) cp = m - 1;

double sq = 0;
for (int j = 0; j < n; j++)
sq += steps[cp * n + j] * w[j];

double beta = alpha[cp] = rho[cp] * sq;
for (int j = 0; j < n; j++)
w[j] -= beta * delta[cp * n + j];
}

for (int i = 0; i < diagonal.Length; i++)
w[i] *= diagonal[i];

for (int i = 1; i <= bound; i += 1)
{
double yr = 0;
for (int j = 0; j < n; j++)
yr += delta[cp * n + j] * w[j];

double beta = alpha[cp] - rho[cp] * yr;
for (int j = 0; j < n; j++)
w[j] += beta * steps[cp * n + j];

if (++cp == m) cp = 0;
}

npt = point * n;

// Store the search direction
for (int i = 0; i < n; i++)
steps[npt + i] = w[i];

stp = 1;
}

// Save original gradient
for (int i = 0; i < g.Length; i++)
w[i] = g[i];


// Obtain the one-dimensional minimizer of f by computing a line search
mcsrch(x, ref f, ref g, &steps[point * n], ref stp, out nfev, diagonal);

// Register evaluations
evaluations += nfev;

// Compute the new step and
// new gradient differences
for (int i = 0; i < g.Length; i++)
{
steps[npt + i] *= stp;
delta[npt + i] = g[i] - w[i];
}

if (++point == m) point = 0;


// Check for termination
gnorm = Norm.Euclidean(g);
xnorm = Norm.Euclidean(x);
xnorm = Math.Max(1.0, xnorm);

if (gnorm / xnorm <= tolerance)
finish = true;

if (Progress != null) Progress(this, new OptimizationProgressEventArgs
(iterations, evaluations, g, gnorm, x, xnorm, f, stp, finish));
}
}

return f; // return the minimum value found (at solution x)
}


The version of the code detailed here only supports Minimization problems. However, it is possible to note that any minimization problem can be converted into a maximization problem and vice-versa by taking the opposite of the function and its gradient.



Using the code



Code usage is rather simple. Suppose we would like to maximize the function g(x,y) = exp{-(x-1)²} + exp{-(y-2)²/2}:





Since we would like to perform a maximization, we first have to convert it to a minimization problem. The minimization version of the function is simply given by taking f(x,y) = –g(x,y):





Which is a convex function, as can be seen by plotting its surface. As it can be seen, the minimum of the function lies on the point (x,y) = (1,2). As expected, this point coincides with the roots of the partial derivative functions, as shown in the line plots below:



 



Example optimization cast as an minimization problem.



The minimization problem min f(x,y) = -(exp(-(x-1)²) + exp(-(y-2)²/2)), computed by Wolfram Alpha.











Partial derivative for x. Partial derivative for y.


The roots of the partial derivatives in respective to x (left) and y (right). It is possible to see that the roots occur at 1 and 2, respectively. Images computed using Wolfram Alpha. 



 



The first step in solving this optimization problem automatically is first to specify the target function. The target function f, in this case, can be specified through a lambda function in the form:




    Func<double[], double> f =
(x) => -Math.Exp(-Math.Pow(x[0] - 1, 2)) - Math.Exp(-0.5 * Math.Pow(x[1] - 2, 2));


The next step is to specify the gradient g for the function f. In this particular case, we can manually compute the partial derivatives to be df/dx = -2 e^(-(x-1)^2) (x-1) and df/dy = -e^(-1/2 (y-2)^2) (y-2), in respect to x and y, respectively. Writing a lambda function to compute the gradient vector g, we have:



    Func<double[], double[]> g = (x) => new double[] 
{
2 * Math.Exp(-Math.Pow(x[0] - 1, 2)) * (x[0] - 1),
Math.Exp(-0.5 * Math.Pow(x[1] - 2, 2)) * (x[1] - 2)
};


We can also note that this example was rather simple, so the gradient vector was easy to calculate. However, the gradient could also have been computed automatically using Accord.NET's FiniteDifferences class. In either case, all we have to do now is to create our L-BFGS solver and call its Minimize() method to begin optimization:




    // Create the L-BFGS solver
BroydenFletcherGoldfarbShanno lbfgs =
new BroydenFletcherGoldfarbShanno(numberOfVariables: 2, function: f, gradient: g);

// Minimize the function
double minValue = lbfgs.Minimize(); // should be -2


After the optimization is complete, the solution parameter will be available in the Solution property of the lbfgs object, and will be equal to { 1.0, 2.0 }. And also in case one is interested in progress reports (such as in the case of optimizing very large functions), it is also possible to register a listener to the Progress event handler. The complete version of the sample application accompanying the source code is given below:




    // Suppose we would like to find the minimum of the function
//
// f(x,y) = -exp{-(x-1)²} - exp{-(y-2)²/2}
//

// First we need write down the function either as a named
// method, an anonymous method or as a lambda function:

Func<double[], double> f = (x) =>
-Math.Exp(-Math.Pow(x[0] - 1, 2)) - Math.Exp(-0.5 * Math.Pow(x[1] - 2, 2));

// Now, we need to write its gradient, which is just the
// vector of first partial derivatives del_f / del_x, as:
//
// g(x,y) = { del f / del x, del f / del y }
//

Func<double[], double[]> g = (x) => new double[]
{
// df/dx = {-2 e^(- (x-1)^2) (x-1)}
2 * Math.Exp(-Math.Pow(x[0] - 1, 2)) * (x[0] - 1),

// df/dy = {- e^(-1/2 (y-2)^2) (y-2)}
Math.Exp(-0.5 * Math.Pow(x[1] - 2, 2)) * (x[1] - 2)
};

Console.WriteLine("Solving:");
Console.WriteLine();
Console.WriteLine(" min f(x,y) = -exp{-(x-1)²} - exp{-(y-2)²/2}");
Console.WriteLine();

// Finally, we can create the L-BFGS solver, passing the functions as arguments
var lbfgs = new BroydenFletcherGoldfarbShanno(numberOfVariables: 2, function: f, gradient: g);

// And then minimize the function:
double minValue = lbfgs.Minimize();
double[] solution = lbfgs.Solution;

// The resultant minimum value should be -2, and the solution
// vector should be { 1.0, 2.0 }. The answer can be checked on
// Wolfram Alpha by clicking the following the link:

// http://www.wolframalpha.com/input/?i=maximize+%28exp%28-%28x-1%29%C2%B2%29+%2B+exp%28-%28y-2%29%C2%B2%2F2%29%29

Console.WriteLine("The minimum value of {0:0} occurs at the solution point ({1:0},{2:0})",
minValue, solution[0], solution[1]);



Conclusion



In this post, we showed how to use a reduced set of the Accord.NET Framework's to perform non-linear optimization. This routine is also used by the Conditional Random Fields and Hidden Conditional Random Fields trainers to optimize parameters for such models. The solver routines have been adapted from the original Fortran's source code from Nocedal, which, tracing back to a 2001 message from the author, also have been reported to be available under the public domain. Nevertheless, the reduced set of the framework available for download within this post is available under a BSD license, as an alternative to the version available within the Accord.NET Framework which is available only under a LGPL license.



As always, I hope someone will find it useful :-)



References



No comments:

Post a Comment