From: orsier@cuisun38.unige.ch (Bruno Orsier) Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math Subject: Re: Questions about my new NN program Date: 14 Mar 1996 08:47:56 GMT >>>>> "Emmanuel" == Emmanuel Mogenet writes: >> 3: What algorithm would you recommend? Is the the Powell algorithm in >> Numerical Recipes for C (p. 417) a good one? Are there others in books or >> on the net that might be better? Does Nash's book _Compact Numerical >> Methods for Computers..._ come with a disk? Is it better? Emmanuel> I have been trying various optimization method . Here are my humble Emmanuel> personnal conclusions: [ ... ] Emmanuel> - Second order methods. Better. I have been happyly using SCG Emmanuel> (scaled conjugate gradient). The original paper by M. Moller is Emmanuel> extremely clear. It displays the *rare* property for a research Emmanuel> paper that you can code the algorithm from the paper and it works Emmanuel> right away. Emmanuel> The original article can be found at: Emmanuel> ftp://ftp.funet.fi/pub/sci/neural/neuroprose/moller.conjugate-gradient.ps.gz I wholly agree. I think it is worth "advertising" this algorithm. So, here a some more refs on SCG: @PhdThesis{MOLLER93-phd, author = "Martin F. Moller", title = "Efficient training of feed-forward neural networks", school = "Computer Science Dept., Aarhus University", year = 1993, key = "DAIMI PB - 464", month = dec } also extremely clear and useful. @Article{MOLLER93, author = "Martin F. Moller", title = "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning", journal = "Neural Networks", year = 1993, volume = 6, pages = "525-533" } @Book{BISH95, author = "Christopher M. Bishop", title = "Neural Networks for Pattern Recognition", publisher = "Oxford University Press", year = 1995 } @Book{WASSERMAN95, author = "Philip D. Wasserman", title = "Advanced Methods in Neural Computing", publisher = "Van Nostrand Reinhold", year = 1995 } Emmanuel> You can find an implementation of SCG in SNNS. That Emmanuel> can be found at ftp.informatik.uni-stuttgart.de:/pub/SNNS. Careful Emmanuel> though. I didn't get it to work. I am the author of this implementation in SNNS. Several people have reported to me that they have good results with it. [...] Emmanuel> Any way, apart from simulated annealing or Genetic Algorithm, very Emmanuel> few optimization methods will spare you the trouble of local Emmanuel> minimas. And both those methods are slow when used in their raw Emmanuel> form. Emmanuel> I think (haven't tried it) the best way (if there's any such thing) Emmanuel> would be a hybrid mehtod combining Second Order method and simulated Emmanuel> annealing. I have developped such an hybrid method, named P*/SCG. Also available for SNNS (software + my test problems). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ See URL http://cuiwww.unige.ch/AI-group/staff/orsier-include-3.html for details. Here is an abstract of a technical report available at this URL: "Another hybrid algorithm for finding a global mimimum of MLP error functions" Technical Report UNIGE-AI-95-6 Bruno ORSIER, CUI, University of Geneva --- orsier@cui.unige.ch ABSTRACT: This report presents \pstar, a new global optimization method for training multilayered perceptrons. Instead of local minima, global minima of the error function are found. This new method is hybrid in the sense that it combines three very different optimization techniques: Random Line Search, Scaled Conjugate Gradient and a 1-dimensional minimization algorithm named P$^*$. The best points of each component are retained by the hybrid method: simplicity of Random Line Search, efficiency of Scaled Conjugate Gradient, efficiency and convergence toward a global minimum for P$^*$. \pstar\ is empirically shown to perform better or much better than three other global random optimization methods and a global deterministic optimization method. Bruno Orsier E-mail: orsier@cui.unige.ch University of Geneva WWW:http://cuiwww.unige.ch/AI-group/staff/orsier.html ============================================================================== From: jdadson@ix.netcom.com(Jive Dadson ) Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math Subject: Moller's Scaled conjugate gradient (Re: Questions about my new NN program Date: 15 Mar 1996 08:36:06 GMT B.D. Ripley's new book _Pattern Recognition and Neural Networks_ appears to be exceedingly well researched, and mentions Moller only in passing. Page 345: Moller (1993) has developed one particular version of conjugate gradients which seems well known in the neural network field; it uses the out-dated Hestenes-Stiefel formula for [beta-sub-i] with a particular line-search algorithm. He does not give the Hestenes-Stiefel formula, but lists two others, Polak-Ribiere and Fletcher-Reeves, saying Polak- Ribiere is generally preferred. I have used the Polak-Ribiere method from Numerical Recipes in C with very poor results. The Davidon-Fletcher-Powell algorithm, which builds up an approximation to the Hessian, is much faster on the little problems I have tested on. - Jive ============================================================================== From: adams@bellini.eecs.berkeley.edu (Adam L. Schwartz) Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math Subject: Re: Moller's Scaled conjugate gradient (Re: Questions about my new NN program Date: 15 Mar 1996 17:08:00 GMT Jive Dadson wrote: >I have used the Polak-Ribiere method from Numerical Recipes in C with >very poor results. The Davidon-Fletcher-Powell algorithm, which >builds up an approximation to the Hessian, is much faster on the >little problems I have tested on. The conjugate-gradient method is not superlinearly convergent like the DFP (quasi-Newton) method. It's advantage is that the operations in each iteration are much faster. This advantage shows up when solving problems with a large number of decision variables. There is also an algorithm called the limited-memory BFGS algorithm which is sort of half-way between a conjugate-gradient method and a quasi-Newton method. I've used it with much success. -- Adam Schwartz | The opinions expressed here are my own adams@eecs.berkeley.edu | and those of anyone who agrees with me. WWW Home Page: http://robotics.eecs.berkeley.edu/~adams ============================================================================== From: Emmanuel Mogenet Newsgroups: sci.math.num-analysis,comp.ai.neural-nets,sci.stat.math Subject: Re: Moller's Scaled conjugate gradient Date: Sat, 16 Mar 1996 20:26:10 -0800 Jive Dadson wrote: > > B.D. Ripley's new book _Pattern Recognition and Neural Networks_ > appears to be exceedingly well researched, and mentions Moller only > in passing. Page 345: > > Moller (1993) has developed one particular version of > conjugate gradients which seems well known in the neural > network field; it uses the out-dated Hestenes-Stiefel > formula for [beta-sub-i] with a particular line-search > algorithm. > > He does not give the Hestenes-Stiefel formula, but lists > two others, Polak-Ribiere and Fletcher-Reeves, saying Polak- > Ribiere is generally preferred. I have used the Polak-Ribiere > method from Numerical Recipes in C with very poor results. The > Davidon-Fletcher-Powell algorithm, which builds up an approximation > to the Hessian, is much faster on the little problems I have > tested on. Yeap. Little. It builds an approximate to the hessian. That's O(n2) in space. Extremely expensive as soon as you scale up. M. Moller's SCG just builds H.v, i.e. an approximation to the hessian times a vector. Much cheaper: space consumption = O(n). I also tried to hack M. Moller's algorithm in order to replace this approximation of the hessian by an exact calculation, but it doesn't train networks as fast as SCG. I think it's because SCG approximates the Hessian of ``where we want to go'' and my method exactly calulates the hessian of ``where we are''. But, I also noticed that the networks trained with my adaptation of SCG generalized much better (and network that don't generalize well are nothing but a memory). The method to exactly calculate the product of the hessian and a arbitrary vector is given in yet another (IMHO) very good article: "Fast Exact Multiplication by the Hessian" by Barak A. Pearlmutter. I think this article is also available in ftp://ftp.funet.fi/pub/sci/neural/neuroprose/ but I don't remember the name of the file. - Mgix