From: ionels@mailcity.com (Ionel Santa)
Subject: Re: ECM factoring
Date: 5 Mar 00 23:11:52 GMT
Newsgroups: sci.math.numberthy
Summary: [missing]

Some days ago, I asked something about the elliptic curve method of
factoring large numbers.  Paul Zimmermann recommended me the paper
"Factorization of the tenth Fermat number" by Richard Brent, available
at:
ftp://ftp.comlab.ox.ac.uk/pub/Documents/techpapers/Richard.Brent/rpb161.ps.gz

Also, a lot of informations about the ECM and gmp-ecm (a fast ECM
implementation by Paul Zimmermann) are available at:
http://www.loria.fr/~zimmerma/records/ecmnet.html

And last but not least, these are a part of the answers that Gerhard
Niklasch, one of the authors of the PARI package, kindly sent me in a
couple of emails.

+-------------------------+
|  Choosing the exponent
+-------------------------+

> What I do for the moment: supposing I try to factor N
> which has n bits, I generate x0,y0, A, B such as y0^2 = x0^3 + A*x0 +
> B (mod N). I know how to calculate (x0,y0)^M (in the group structure
> which results), except that I don't know exactly how to optimally
> choose M (for different n values).

There's a fair lot of theory and some experimentation here, but mainly
M should be chosen highly composite:  the product of all prime powers
not exceeding some bound B1 depending on the size of N, and instead of
computing (x0,y0)^M in one swoop, it is better to apply the prime fac-
tors of the desired M one after another  (although minor gains are
ultimately possible by combining factors depending on their bit patterns,
but that is an advanced optimization).

> ... if I understood correctly:
> if I want for example to search for a factor of length 30, since
> I've found in Paul Zimmermann's page that optimal B1 for 30 is
> about 250,000,

this depends slightly on the implementation, but his numbers are
certainly good starting points

> I calculate M=product (p^e_p) where e_p = log_base_p(B1) and then
> (x0,y0)^M in the group. But it is better instead of pre-calculating
> M, to calculate  (((x0,y0)^(2^e_2)) ^ (3^e_3) ... )^ (Pmax^e_Pmax),
> where Pmax=the largest prime less than B1.

Yes.

+-------------------------+
|  Optimisations
+-------------------------+

> Also, I don't know the Peter Montgomery's fast modular
> multiplication algorithm for speeding up the process

One part of it is working on several curves in parallel, which allows
you to trade divisions mod N for multiplications mod N, 3 of the latter
for one of the former.  The other is a fancy way of decomposing the
(x0,y0)^p operation into additions and doublings of points on the curve
in such a way as to keep the number of operations low.  It is good, but
not always optimal;  on the other hand, it is far simpler than the optimal
procedure  (for which the order of operations essentially needs to be
determined in advance for each single case, and stored in a table).

> For calculating (x0,y0)^n, currently I decompose n in base 2, and I
> make one addition when it is 1 in the base 2 representation, and one
> doubling for every step.

Right.  There's a constant factor to be gained here on average.  One
strategy is to write (x0,y0)^n = A^d+B^e (eek, this power notation is
ghastly - most of the literature would use [n]P to denote the result
of multiplying the point P by the integer n, thus [2]P is doubling on
the curve) where A,B,d,e are initialized in some simple way, more or
less A = (x0,y0) and B = [2]A and d=n and e=0, and then transformed
in a sequence of simple steps until one of A,B contains the result.
This is _one_ of Montgomery's many contributions!

The simple steps here depend on properties of the current d,e mod 2
and mod 3, and amount to stuff like doubling one of A,B, replacing
one of them by their sum on the curve, etc.

I don't remember the details offhand  (and they depend on the way points
on the curve are represented - the algorithm is more complex when homo-
geneous coordinates (x,y,z) are used, but can then dispense with most
divisions mod N;  in the PARI environment, we use affine coordinates
(x,y) and a simpler form of the algorithm).

This Montgomery trick gains, on average, a constant factor over the
obvious binary shift algorithm, in terms of total number of doublings
and additions performed.  (And by the way, it works only when the
multiplier is prime.)  Slightly more can be squeezed out by constructing
really optimal sequences of addings and doublings for each possible
multiplier.  Ony may need more than two intermediate points for these,
though  (instead of just the A,B in Montgomery's method).

_Another_ optimization is to observe that, when using affine coordinates,
the computation time is likely to be dominated by the divisions mod N,
and when N is at all large, a single division takes more time than three
multiplications mod N.  On the other hand, we want to look at very many
curves (with the same M, before switching to a larger M).  Montgomery's
next trick here is to observe that you can compute a^(-1), b^(-1), ...
x^(-1) mod N simultaneously, with just one division mod N, trading three
multiplications for each of the other divisions.  Compute and remember
ab, abc,..., abc...x mod N, then take (abc...wx)^(-1), then recover x^(-1)
from  (abc...w)*(abc...wx)^(-1) and (abc...w)^(-1) from x*(abc...wx)^(-1),
and proceed backwards until you have a^(-1).

> The fast modular multiplication which I didn't know about was:

This is _yet_ _another_ of Montgomery's bright ideas!
(Also starring in Knuth, TAOCP Vol.2, 4.3.1, exercise 41.)

> If we want to make multiplication mod N, where N has n bits and
> it is odd, we use the relations:
> A -> A'    =   A*2^n (mod N)
> A' [*] B'  =   A'*B'*2^(-n) (mod N)
> A' -> A    =   A' [*] 1
> where [*] is a new opperation, and * the usual product.
> I knew about the relations, but I couldn't figure out how this
> may help, until I realized that for calculating A' [*] B', we
> calculate A'*B' and then by adding proper multiples on N we can
> make the last n bits to be equal to 0, and so we don't need
> anymore to make divisions after the product. For my implementation
> it resulted in a factor gain of 30 for the multiplication mod N
> speed.

The magnitude of the gain I find surprising - perhaps the division-
with-remainder implementation wasn't quite optimal...?  (Part of the
gain should come from the possibility of conserving space, by inter-
lacing the additions which make up the multiplication with fixing
up those low-order bits which are already determined.)

> You were right. After I looked over the division function, I've found
> that I was making quite a lot of useless shifts. After fixing that,
> the factor gain was ~ 8.

+-------------------------+
|  Second phase of ECM
+-------------------------+

> and I don't know which is supposed to be the second phase of
> ECM.

The second phase consists in observing that many composite numbers have
only one large prime factor, and the second largest is a lot smaller.
Therefore it is advantegeous to shoot for curves and points (x0,y0)
on them whose orders are not divisors of M, but rather, of the form
(divisor of M) * (one prime between B1 and B2), where B2 is typically
chosen about 100*B1.  This can be checked a _lot_ faster than the first
phase.  Instead of powering ((x0,y0)^M)^p for each prime in the range,
we proceed from ((x0,y0)^M)^p to ((x0,y0)^M)^p' by "multiplying" the
former with ((x0,y0)^M)^(p'-p).  (Actually, this amounts to an addition
on the curve.)  (x0,y0)^M remains fixed throughout this phase, and since
p'-p frequently takes small even values and many of these repeatedly,
a lot of work is saved by precomputing these.  Moreover, we do not
actually need to finish the computation of any of the ((x0,y0)^M)^p;
we only need to determine whether one of them will give us a factor
of N.  Thus it is enough to compute denominators and forget about the
numerators in the addition formula.

> I use for additions the formulas:
> x'' = -(x+x')+m^2,
> y'' = -[y+m*(x''-x)],
> where:
> m = (y'-y)/(x'-x)  if y<>y'
> m = (3x^2+A)/(2y)  if y=y' (when it is the same point)
>
> What means to forget about the numerators?

If you want to know whether the computation of P+Q on the curve will
reveal a factor of N, you don't have to compute P+Q.  You only need
to work out x'-x (and take its gcd with N).  Provided you don't need
P+Q for later computations, of course;  but during the second phase
most such sum-points will indeed never be needed again.

If I already have computed (first phase) Q=[M]P from my original P,
with B=181 say (just a toy example), and now want to check whether
any of [191]Q, [193]Q, [197]Q, [199]Q, [211]Q, [223]Q, [229]Q, [233]Q
will lead to a factor of N, I compute [211]Q along with a table of
small even multiples of Q, and then check the x-x' arising from
  [211]Q-[20]Q, [211]Q-[18]Q, [211]Q-[14]Q, [211]Q-[12]Q,
  [211]Q+[12]Q, [211]Q+[18]Q, [211]Q+[22]Q
without working out the x,y coordinates of any of these points.
(In fact, I only need a table of x-coordinates of small even multiples
of Q.)

(And you don't want to compute gcd(x-x',N) for each individual pair
x,x'.  Instead, accumulate a product of a few hundred x-x' mod N,
and then take the gcd of the product with N.  If small prime factors,
less than 10^8 say, have been removed from N by trial division and
Pollard rho before launching ECM, the chances that you might miss
a factorization N=RS by picking up one x-x' divisible by R and another
divisible by S are going to be remote, and if it happens, you still
have the option to go back and take individual gcds, or simply throw
the current curves away and start with a fresh set of curves.)

+-------------------------+
|  Miscellaneous
+-------------------------+

> I've found somewhere that it is possible to write the same curve
> in different ways. It is better to use other formulas?

That depends highly on the properties of your implementation.  Con-
ventional wisdom has it that projective coordinates should lead to a
faster algorithm, mainly by getting rid of almost all the divisions
mod N at the cost of an increased number of multiplications, but
when I re-wrote the PARI implementation a couple of years ago, I
found this was not borne out in practice, given PARI's underlying
long-integer and modular arithmetic.--  Another aspect is that
something can be gained by working only with curves where one knows
in advance that the order of the initial point will be divisible by
4;  however, the price comes in the form of a more complicated
equation for the curve, and thus more complicated addition and
doubling formulas.  (More work on each curve, but you expect that
you won't have to check as many curves before you hit upon a factor
of N.)

> Or it is better to work with fractions like x=xni/xdi, yi=yni/ydi?

If you do it consistently, then you are working with projective
coordinates on the curve. ;)

For an implementation along these lines, check out the file
src/basemath/ifactor1.c in the PARI/GP distribution.
(http://www.parigp-home.de/ and follow your nose from there. :)

... and by all means check out Paul's code for comparison!
The PARI code is heavily commented, but of course it is quite a
mouthful...

Standard reference for much of this, besides Knuth, is Henri Cohen's
A Course in Computational Algebraic Number Theory, Springer GTM 138.