From: orsier@cuisun38.unige.ch (Bruno Orsier)
Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math
Subject: Re: Questions about my new NN program
Date: 14 Mar 1996 08:47:56 GMT

>>>>> "Emmanuel" == Emmanuel Mogenet <mgix@aw.sgi.com> writes:

>>  3: What algorithm would you recommend? Is the the Powell algorithm in
>> Numerical Recipes for C (p. 417) a good one? Are there others in books or
>> on the net that might be better? Does Nash's book _Compact Numerical
>> Methods for Computers..._ come with a disk?  Is it better?

Emmanuel> I have been trying various optimization method .  Here are my humble
Emmanuel> personnal conclusions:

[ ... ]

Emmanuel> 	- Second order methods. Better.  I have been happyly using SCG
Emmanuel> (scaled conjugate gradient). The original paper by M. Moller is
Emmanuel> extremely clear. It displays the *rare* property for a research
Emmanuel> paper that you can code the algorithm from the paper and it works
Emmanuel> right away.

Emmanuel> 		The original article can be found at:
Emmanuel> ftp://ftp.funet.fi/pub/sci/neural/neuroprose/moller.conjugate-gradient.ps.gz

I wholly agree. I think it is worth "advertising" this algorithm. 
So, here a some more refs on SCG:
@PhdThesis{MOLLER93-phd,
  author = 	 "Martin F. Moller",
  title = 	 "Efficient training of feed-forward neural networks",
  school = 	 "Computer Science Dept., Aarhus University",
  year = 	 1993,
  key =		 "DAIMI PB - 464",
  month =	 dec
} also extremely clear and useful.
@Article{MOLLER93,
  author = 	 "Martin F. Moller",
  title = 	 "A Scaled Conjugate Gradient Algorithm for Fast Supervised
		  Learning",
  journal =	 "Neural Networks",
  year =	 1993,
  volume =	 6,
  pages =	 "525-533"
}
@Book{BISH95,
  author = 	 "Christopher M. Bishop",
  title = 	 "Neural Networks for Pattern Recognition",
  publisher = 	 "Oxford University Press",
  year = 	 1995
}
@Book{WASSERMAN95,
  author = 	 "Philip D. Wasserman",
  title = 	 "Advanced Methods in Neural Computing",
  publisher = 	 "Van Nostrand Reinhold",
  year = 	 1995
}

Emmanuel> 		You can find an implementation of SCG in SNNS. That
Emmanuel> can be found at ftp.informatik.uni-stuttgart.de:/pub/SNNS. Careful
Emmanuel> though. I didn't get it to work.

I am the author of this implementation in SNNS. Several people have reported
to me that they have good results with it. 

[...] 

Emmanuel> Any way, apart from simulated annealing or Genetic Algorithm, very
Emmanuel> few optimization methods will spare you the trouble of local
Emmanuel> minimas. And both those methods are slow when used in their raw
Emmanuel> form.

Emmanuel> I think (haven't tried it) the best way (if there's any such thing)
Emmanuel> would be a hybrid mehtod combining Second Order method and simulated
Emmanuel> annealing.

I have developped such an hybrid method, named P*/SCG. 
Also available for SNNS (software + my test problems). 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See URL http://cuiwww.unige.ch/AI-group/staff/orsier-include-3.html for
details. Here is an abstract of a technical report available at this URL:


"Another hybrid algorithm for finding a global mimimum of MLP error functions"
Technical Report UNIGE-AI-95-6
Bruno ORSIER, CUI, University of Geneva --- orsier@cui.unige.ch


ABSTRACT:
  This report presents \pstar, a new global optimization method for
  training multilayered perceptrons. Instead of local minima, global minima of
  the error function are found. This new method is hybrid in the sense that it
  combines three very different optimization techniques: Random Line Search,
  Scaled Conjugate Gradient and a 1-dimensional minimization algorithm named
  P$^*$. The best points of each component are retained by the hybrid method:
  simplicity of Random Line Search, efficiency of Scaled Conjugate Gradient,
  efficiency and convergence toward a global minimum for P$^*$. \pstar\ is
  empirically shown to perform better or much better than three other 
  global random optimization methods and  a global deterministic
  optimization method. 


Bruno Orsier		     			    E-mail: orsier@cui.unige.ch
University of Geneva	  WWW:http://cuiwww.unige.ch/AI-group/staff/orsier.html

==============================================================================

From: jdadson@ix.netcom.com(Jive Dadson )
Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math
Subject: Moller's Scaled conjugate gradient (Re: Questions about my new NN program
Date: 15 Mar 1996 08:36:06 GMT

B.D. Ripley's new book _Pattern Recognition and Neural Networks_
appears to be exceedingly well researched, and mentions Moller only
in passing. Page 345:

   Moller (1993) has developed one particular version of
   conjugate gradients which seems well known in the neural
   network field; it uses the out-dated Hestenes-Stiefel
   formula for [beta-sub-i] with a particular line-search
   algorithm.

He does not give the Hestenes-Stiefel formula, but lists
two others, Polak-Ribiere and Fletcher-Reeves, saying Polak-
Ribiere is generally preferred. I have used the Polak-Ribiere
method from Numerical Recipes in C with very poor results. The
Davidon-Fletcher-Powell algorithm, which builds up an approximation
to the Hessian, is much faster on the little problems I have
tested on.

                 - Jive

==============================================================================

From: adams@bellini.eecs.berkeley.edu (Adam L. Schwartz)
Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math
Subject: Re: Moller's Scaled conjugate gradient (Re: Questions about my new NN program
Date: 15 Mar 1996 17:08:00 GMT

Jive Dadson  <jdadson@ix.netcom.com> wrote:

>I have used the Polak-Ribiere method from Numerical Recipes in C with
>very poor results. The Davidon-Fletcher-Powell algorithm, which
>builds up an approximation to the Hessian, is much faster on the
>little problems I have tested on.

The conjugate-gradient method is not superlinearly convergent like the
DFP (quasi-Newton) method.  It's advantage is that the operations in
each iteration are much faster.  This advantage shows up when solving
problems with a large number of decision variables.  There is also an
algorithm called the limited-memory BFGS algorithm which is sort of
half-way between a conjugate-gradient method and a quasi-Newton
method.  I've used it with much success.


-- 
      Adam Schwartz        |    The opinions expressed here are my own 
 adams@eecs.berkeley.edu   |    and those of anyone who agrees with me.

        WWW Home Page: http://robotics.eecs.berkeley.edu/~adams

==============================================================================

From: Emmanuel Mogenet <mgix@aw.sgi.com>
Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math
Subject: Re: Moller's Scaled conjugate gradient
Date: Sat, 16 Mar 1996 20:26:10 -0800

Jive Dadson wrote:
> 
> B.D. Ripley's new book _Pattern Recognition and Neural Networks_
> appears to be exceedingly well researched, and mentions Moller only
> in passing. Page 345:
>
>    Moller (1993) has developed one particular version of
>    conjugate gradients which seems well known in the neural
>    network field; it uses the out-dated Hestenes-Stiefel
>    formula for [beta-sub-i] with a particular line-search
>    algorithm.
> 
> He does not give the Hestenes-Stiefel formula, but lists
> two others, Polak-Ribiere and Fletcher-Reeves, saying Polak-
> Ribiere is generally preferred. I have used the Polak-Ribiere
> method from Numerical Recipes in C with very poor results. The
> Davidon-Fletcher-Powell algorithm, which builds up an approximation
> to the Hessian, is much faster on the little problems I have
> tested on.

Yeap. Little.
It builds an approximate to the hessian.
That's O(n2) in space.
Extremely expensive as soon as you scale up.

M. Moller's SCG just builds H.v, i.e. an approximation to the
hessian times a vector. Much cheaper: space consumption = O(n).

I also tried to hack M. Moller's algorithm in order to
replace this approximation of the hessian by an exact calculation, but
it doesn't train networks as fast as SCG. I think it's
because SCG approximates the Hessian of ``where we want to go'' and
my method exactly calulates the hessian of ``where we are''.

But, I also noticed that the networks trained with my adaptation of
SCG generalized much better (and network that don't generalize well
are nothing but a memory).

The method to exactly calculate the product of the hessian and a
arbitrary vector is given in yet another (IMHO) very good article:

"Fast Exact Multiplication by the Hessian" by Barak A. Pearlmutter.

I think this article is also available in
ftp://ftp.funet.fi/pub/sci/neural/neuroprose/
but I don't remember the name of the file.

	- Mgix