From: stevenj@alum.mit.edu (Steven G. Johnson)
Newsgroups: sci.math
Subject: Re: primitive roots and parallel FFT
Date: Sat, 25 Oct 1997 20:36:39 -0400

kpolit@irio.pz.zgora.pl (Kamil Politowicz) wrote:
>2. The only FFT algorithm I know requires that the vector's length n
>is power of 2. Is there any other algorithm suitable without this
>limitation?

Many other FFT algorithms exist, most of which are not limited to powers of
two.   In fact, there are even FFT algorithms for prime lengths.

The most commonly-implemented FFT algorithm is Cooley-Tukey, which takes a
DFT of composite size N = N1 * N2 and expresses it as N2 DFTs of size N1
followed by N1 DFTs of size N2 (with some multiplications by phase factors
in between).  You then do this recursively to get an O(N lg N) algorithm.

The algorithm you are thinking of is just Cooley-Tukey in the special case
of N2  = 2, which is known as the "radix-2" Cooley-Tukey algorithm. 
Implementations which handle other factors are typically called
"mixed-radix" FFTs.

Of course, the recursion in Cooley-Tukey has to stop when you reach a prime
length.  Most implementations require that the N be factorizable into small
primes (e.g. < 10), and simply use the O(N^2) algorithm when those sizes
are reached.

However, there are algorithms to compute FFTs of prime lengths in O(N lg N)
time.  These are typically faster than the O(N^2) algorithm for anything
more than very small primes, but are much slower than FFTs of nearby
composite lengths.  The most commonly-implemented algorithm of this sort is
known as the "chirp-z" algorithm and is due to Bluestein.

There are many other FFT algorithms, e.g. prime-factor algorithms,
Winograd's algorithm, Bruun algorithms, and the QFT algorithm, to name a
few.  It is actually possible...and potentially useful...to mix several of
these algorithms in a single program.  I would recommend that you stick to
Cooley-Tukey in the beginning, though; you can always implement other
algorithms later.

(Note that all the algorithms I am alluding to here are exact...I am not
referring to any sort of zero-padding, which introduces errors into the
spectrum.)

Most parallel 1D FFT implementations that I am aware of do something like
the following:

Do a standard Cooley-Tukey algorithm  using N1 = sqrt(N) (or the closest
factor of N to sqrt(N)):

1) Do the N2 FFTs of size N1.  You can set up the array so that these
transforms are all uniprocessor transforms.

2) Perform a global transpose of the array (viewed as an N2 by N1 2D
array).  This requires communication among the processors.

3) Multiply by the phase factors ("twiddle factors").

4) Do the N1 FFTs of size N2.  Again, because of the transpose, these will
all be uniprocessor transforms, and require no communication.

5) Optionally, transpose the array back to its original order.

This method is sometimes called the "4-step" FFT, and was first described
explicitly by Bailey.  It is, however, essentially just a specialized form
of Cooley & Tukey's 1965 FFT algorithm.

You can find links to more information on FFTs, along with source code, at
the FFTW home page:

http://theory.lcs.mit.edu/~fftw

FFTW, a C FFT library (not limited to powers of two), also contains
parallel, distributed memory, multi-dimensional FFT code which could be
adapted to do parallel 1D FFTs (by the above method) without great
difficulty.

I would recommend that you make use of some pre-existing FFT
implementation, rather than writing one yourself from scratch, as
implementing a fast FFT program turns out to be non-trivial.  If FFTW
doesn't suit your needs for some reason, the FFTW home page contains links
to many other sources of FFT code on the web.

Good luck!

Cordially,
Steven G. Johnson

PS. Unfortunately, the information available on-line regarding FFT
algorithms other than radix-2 Cooley-Tukey is extremely poor.  If you
really want to get into this, you'll have to dig into the print literature. 
As a starting point, you might look at the references listed in our paper
on FFTW (available at the above web site).