From: spellucci@mathematik.tu-darmstadt.de (Peter Spellucci)
Subject: Re: nonlinear least squares (levenberg-marquardt)
Date: 16 Mar 2001 17:16:31 GMT
Newsgroups: sci.math.num-analysis
Summary: How is the Levenberg-Marquardt process used for fitting to data?

In article <98ropv01t59@news1.newsguy.com>,
 student@maths.edu (student) writes:

|> You say not to use normal equations for least squares, yet 
|> LM requires normal equations (or so I understand).  So what
|> you are saying then is do not use LM?  This is very confusing,
|> since the texts I have looked at (not just NR) recommend LM 
|> for nonlinear least squares.  Are they all wrong?

no, although in most of these texts an important point is overlooked (see below).
What modern LM-version does:

   Problem:   ||F(x)||_2^2 = min_x 
   where F is a smooth function with m components of n unknowns x,
   m>>n usually. m>=n definitely required.

   For a fit problem, the fit data are incorporated into the component, i.e.
   "i", e.g. 
   F_i(x)= y_i - ( x(1)*exp(-x(2)*t_i) + x(3) )
   where (t_i,y_i) are the given data to be fitted by one exponential term
   plus asymptotic term.

   It is simpler to describe anything in terms of F.
   Given a guess x^k for the optimal x* (problems of local versus global
   optimizers set aside) one linearizes the problem under a trust region 
   constraint:
   step k:
    minimize ||F(x^k)+J_F(x^k)*d||_2^2  under the constraint ||d||<= Delta_k
    the norm is typically a scaled euclidean norm, I will write the constraint
    now as 
             d'*B_k'*B_k*d <= Delta_k^2
    with a regular  matrix B_k, and the prime denotes transposition.
    J_F is the Jacobian matrix of F

    the solution process is a follows:
    First compute the min without the constraint. This is a usual
    linear least squares problem, which can be solved via the svd (minimum
    norm solution) even if J_F is rank deficient.
    If it happens that this d satisfies the constraint, we take it as d^k.
    Otherwise it follows that the constraint is active. Then the multiplier rule
    of Kuhn and Tucker says that there is a positive lambda, such that for
    the d^k sought there holds

     (J_F(x^k)'*J_F(x^k)+lambda*B_k'*B_k)*d^k=-J_F(x_k)'*F(x^k)  (*)

    This lambda is unique. Hence the problem boils down to a unidimensional 
    zero-finding problem: find lambda such that (*) holds and 
    in addition
         d'*B_k'*B_k*d = Delta_k^2   (observe: d=d(lambda)) 
    Given a guess for lambda, we can compute d^k for (*) from a linear
    least squares problem 

          || J_F(x^k)*d+F(x^k) ||
          ||  sqrt(lambda)*B_k*d + O ||  = min_d   
    O is a zero vector of dimension n. Hence again here is a point to use
    the svd (although other solutions, e.g. QR-decomposition would also
    be feasible)
    With d^k computed we now test whether we can accept d^k:
    we check whether 

        actual reduction: ||F(x^k)^2-||F(x^k+d^k)||^2 
    is sufficiently positive, and compare it with the predicted reduction:
        ||F(x^k)||^2-||F(x^k)+J_F(x^k)*d^k||^2 (>0)
    If the quotient  of the two is  
        >= rho_0 (e.g. =10^{-4}) accept x^k+d^k as x^{k+1}
        < rho_0 : reject, reduce Delta_k and repeat this step.
    If the quotient is in the interval [rho_0,rho_1] : 
    take Delta_k as Delta_{k+1}  
    If the quotient is >= rho_1 (typically 0.75 or so) then 
    take Delta_{k+1}=2*Delta_k (increase).
    There is point:  For a bad fit problem, it might be that due to 
    the neglect of the true second order information in the model 
    (the part sum_i Hessian(F_i(x))*F_i(x) is missing) we never 
     get a value >0.75. If sometimes we got a small Delta_k, this
     will remain small all the time, and a small Delta_k may force a 
     lambda>0, but a lambda>0 produces slow progress (like a pure 
     gradient step), this in turn will produce a huge number of 
     steps until the constraint becomes inactive)

     The zero problem with respect to lambda can be solved by a variety of
     ways, for example  a special adaptation of Newtons method
     as proposed by Hebden, I propose simply Brent-Decker.   

     (There are some subtle points in numerically evaluating the
      expressions above, i leave them out)

      Hence you see: the svd is helpful indeed here in order to
      avoid illconditioning which would result in a corrupted direction
      of correction.    


|> 
|> Also, could you recommend an on-line source that explains
|> (in layman's terms) how one would go about using SVD to solve
|> a nonlinear least-squares problem without forming the normal
|> equations?  And if you solve by that method, do you still get
|> a covariance matrix for error propagation analysis?

   If the problem is not "almost linear", i.e. 
 
   ||sum_i Hessian(F_i(x))*F_i(x)|| is not small compared to the smallest 
   eigenvalue of J_F(x*)'*J_F(x*)
   you must compute the Hessian of ||F(x)||^2 at x* anyway,
   since otherwise your error propagation analysis will be corrupted.
    I.e. J_F'*J_F does not suffice in general. 
    If you want to take it nevertheless,
    you also get 
    inverse(J_F'*J_F) from the svd of J_F quite easily:

    If 
       J_F= U*S*V'  U, V unitary, S of the same shape with the 
    singular values of J_F along its diagonal and zero otherwise, then

       inverse(J_F'*J_F) = V*inverse(S'*S)*V'
    and inverse(S'*S) =diag(1/singular values^2)

hope that helps
peter