From: fred@genesis.demon.co.uk (Lawrence Kirby)
Subject: Re: Faster integer square root -- modified merge of two routines.
Date: Tue, 11 Jan 2000 12:32:51 GMT
Newsgroups: comp.graphics.algorithms,sci.math.num-analysis,comp.lang.c
Summary: Coding integer square root algorithms

In article <MPG.12e16fc23fecaeea9896c4@news.pacbell.net>
           DONT.qed.SPAM.ME@pobox.com "Paul Hsieh" writes:

>Whoops!  I decided to dig a little deeper into this and look at the real 
>performance of these loops.  It looks like Dann's solution is not too 
>bad.  It really seems to exploit the low decree of accuracy required for 
>integer results.  
>
>In any event, I was able to modify my solution to produce something that 
>is indeed faster, does not have divide by zero problems (*ahem*) and 
>which is not off by one with repect to the input (*ahem* again.)  My 
>own testing using WATCOM C/C++ on an Athlon system shows my code to be 
>consistently faster than Dann's.
>
>int isqrt0(int r) {

There's an immediate problem here in that this only works for signed
integer values. To make this equivalent to Dann's code you'll need
to fix it to work on the full range of 32 bit unsigned integers.

>double x,y;
>float tempf;
>unsigned long *tfptr = (unsigned long *)&tempf;
>int is;
>
>    tempf = r;
>    *tfptr=(0xbe6f0000-*tfptr)>>1;

Obviously this is not portable. I guess it should be fairly portable
between IEEE systems where unsigned long is 32 bits wide, since the
byte order of integers and floating point representations is usually
consistent.

There is a problem with this as far as C is concerned because you are
strictly not allowed to access, for example, a float defined object
using an lvalue with other than float or a character type. A C compiler
could, for example, no bother to reload tempf because it knows
that tempf cannot have been legitimately changed by the expression
above. Using volatile may be an improvement but that may of course
interfere with the compiler's optimisations.

>    x=tempf;
>    y=r*0.5; 
>    x*=1.5-x*x*y; 
>    if( r>101123 ) x*=1.5-x*x*y;
>    x*=r;
>
>    /* This method of float=>int conversion is much faster than the 
>       standard C cast, however it round instead of truncates. */

Doesn't that depend on the rounding mode that happens to have been set
in the FPU (i.e. you can't assume that it won't truncate)?

>    _asm {
>        fld x
>        fistp is
>    }
>
>    if( is*is>r ) is--;
>
>    return is;
>}

Bearing in mind that "portable" C is desirable (at least because this
is cross-posed to comp.lang.c) I've included is a portable (as far as
possible) test program for the algorithms in the thread, plus
a couple more. Here are two sets of results

1. Compiled by an oldish gcc (2.7.2.1) running on a 200Mhz PPro system
   gcc -O2 -fomit-frame-pointer

RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=200.0MHz

Function      Overall Time    Clocks per call   Checksum

 dummy            0.470              9.2        00000000
 ldiv            10.500            205.1        004d1d7a
 Dann             8.640            168.8        004d1d7a
 Fred             4.140             80.9        004d1d7a
 Stephen          8.940            174.6        004d1d7a
 Paul             8.680            169.5        004d1d7a
 Paul2            8.170            159.6        004d1d7a
 float            8.400            164.1        004d1d7a
 floati           7.560            147.7        004d1d7a

1. Compiled by MSVC6 running on a dual PIII 450Mhz system.
   cl -O2

RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=450.0MHz

Function      Overall Time    Clocks per call   Checksum

 dummy            0.203              8.9        00000000
 ldiv             4.734            208.0        005334f9
 Dann             3.922            172.4        005334f9
 Fred             1.829             80.4        005334f9
 Stephen          3.860            169.6        005334f9
 Paul             3.547            155.9        005334f9
 Paul2            3.859            169.6        005334f9
 float            3.484            153.1        005334f9
 floati           2.984            131.1        005334f9

And here's the code


#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>


#define CPU_CLOCK    450.0             /* The clock rate of your CPU (MHz) */
#define REPEAT_COUNT 10000L            /* Times to run through all the data */
#define RANDARRAY_SIZE 1024            /* Number of random test values */


static void init_randarray(unsigned long *);
static void test(unsigned long *,
                 const char *,
                 unsigned (*)(unsigned long));
static void test_correct(const char *,
                         unsigned (*)(unsigned long),
                         unsigned long);

static unsigned dummy(unsigned long);
static unsigned ldiv_sqrt(unsigned long);
static unsigned dann_sqrt(unsigned long);
static unsigned fred_sqrt(unsigned long);
static unsigned stephen_sqrt(unsigned long);
static unsigned paul_sqrt(unsigned long);
static unsigned paul2_sqrt(unsigned long);
static unsigned float_sqrt(unsigned long);
static unsigned floati_sqrt(unsigned long);


/***************************************************************************
 */

int main(void)
{
    unsigned long randarray[RANDARRAY_SIZE];

    srand((unsigned)time(NULL));
    init_randarray(randarray);

    printf("RANDARRAY_SIZE=%d REPEAT_COUNT=%ld CPU_CLOCK=%.1fMHz\n\n",
           RANDARRAY_SIZE, REPEAT_COUNT, (double)CPU_CLOCK);
    printf("Function      Overall Time    Clocks per call   Checksum\n\n");

    test(randarray, "dummy", dummy);
    test(randarray, "ldiv", ldiv_sqrt);
    test(randarray, "Dann", dann_sqrt);
    test(randarray, "Fred", fred_sqrt);
    test(randarray, "Stephen", stephen_sqrt);
    test(randarray, "Paul", paul_sqrt);
    test(randarray, "Paul2", paul2_sqrt);
    test(randarray, "float", float_sqrt);
    test(randarray, "floati", floati_sqrt);

#if TEST_CORRECT
    test_correct("fred_sqrt", fred_sqrt, 0xffffffff);
#endif

    return 0;
}


/***************************************************************************
 * Creates an array of random numbers for quick access. The distribution
 * is intended to produce a good mix of small and large numbers amd
 * make life difficult for Dann's algorithm.
 */

static void init_randarray(unsigned long *randarray)
{
    const double k = log(0xffffffff) / (RAND_MAX + 1.0);
    size_t i;

    for (i = 0; i < RANDARRAY_SIZE; i++) {
        unsigned long value = exp(k * rand());

        if (value >= 0x80000000)        /* For signed integer algs */
            value /= 2;
        randarray[i] = value;
    }
}


/***************************************************************************
 * Test and time the specified function and output the results. The
 * checksum is to do something with the result and give a rough indication
 * that the functions are producing the same vslues.
 */

static void test(
    unsigned long *randarray,
    const char *name,
    unsigned (*func)(unsigned long))
{
    long rc;
    unsigned long checksum;
    clock_t start, end;
    double interval, clocks_per_call;

    fflush(stdout);
    
    end = clock();
    while ((start = clock()) == end)        /* For very granular clocks */
        ;

    for (rc = 0; rc < REPEAT_COUNT; rc++) {
        size_t i;

        checksum = 0;
        for (i = 0; i < RANDARRAY_SIZE; i++)
            checksum += (*func)(randarray[i]);
    }

    end = clock();

    interval = (double)(end-start)/(double)CLOCKS_PER_SEC;
    clocks_per_call = interval * CPU_CLOCK * 1000000
                                          / REPEAT_COUNT / RANDARRAY_SIZE;

    printf(" %-10s   %9.3f          %7.1f        %.8lx\n",
             name,  interval,  clocks_per_call,  checksum);
}


/***************************************************************************
 * Test the correctness of a function over its entire input range.
 * This does give some time statistics however it can take a long time
 * and on some systems it is possible for clock() to wrap before
 * execution is complete giving a nonsense result
 */

static void test_correct(
    const char *name,
    unsigned (*func)(unsigned long),
    unsigned long highval)
{
    unsigned long i;
    clock_t start, end;
    double interval, clocks_per_call;

    printf("Testing %s for correctness\n", name);
    fflush(stdout);
    
    end = clock();
    while ((start = clock()) == end)        /* For very granular clocks */
        ;

    i = 0;
    do {
        unsigned long result = fred_sqrt(i);
        unsigned long lobound = result * result;
        unsigned long hibound = lobound + 2*result;

        if (result > 65535 || i < lobound || i > hibound)
            printf("Error: %lu %lu %.3f\n", i, result, sqrt(i));
    } while (++i <= highval && i != 0);

    end = clock();

    interval = (double)(end-start)/(double)CLOCKS_PER_SEC;
    clocks_per_call = interval * CPU_CLOCK * 1000000 / (highval + 1.0);

    printf(" %-10s   %9.3f          %7.1f\n", name, interval, clocks_per_call);
}


/***************************************************************************
 * Bare function to test loop and function call overhead
 */

static unsigned dummy(unsigned long x)
{
    return 0;
}


/***************************************************************************
 * An algorithm based on an old "long division" pen and paper technique
 */

static unsigned ldiv_sqrt(unsigned long n)
{
    int i;
    unsigned result;
    unsigned long tmp, acc;

    result = 0;
    acc = 0;
    for (i = 30; i >= 0; i -= 2) {
        acc = acc*4 + ((n>>i)&3);
        tmp = result*4UL + 1;
        result <<= 1;
        if (acc >= tmp) {
            acc -= tmp;
            result++;
        }
    }
    return result;
}


/***************************************************************************
 * Dann Corbitt's function "fixed" and pulled into one function (which
 * appears a little faster in my tests)
 */

static unsigned dann_sqrt(unsigned long x)
{
    static const unsigned char sqq_table[] = {
        0, 16, 22, 27, 32, 35, 39, 42, 45, 48, 50, 53, 55, 57, 59, 61, 64, 65,
        67, 69, 71, 73, 75, 76, 78, 80, 81, 83, 84, 86, 87, 89, 90, 91, 93, 94,
        96, 97, 98, 99, 101, 102, 103, 104, 106, 107, 108, 109, 110, 112, 113,
        114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 128,
        128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141,
        142, 143, 144, 144, 145, 146, 147, 148, 149, 150, 150, 151, 152, 153,
        154, 155, 155, 156, 157, 158, 159, 160, 160, 161, 162, 163, 163, 164,
        165, 166, 167, 167, 168, 169, 170, 170, 171, 172, 173, 173, 174, 175,
        176, 176, 177, 178, 178, 179, 180, 181, 181, 182, 183, 183, 184, 185,
        185, 186, 187, 187, 188, 189, 189, 190, 191, 192, 192, 193, 193, 194,
        195, 195, 196, 197, 197, 198, 199, 199, 200, 201, 201, 202, 203, 203,
        204, 204, 205, 206, 206, 207, 208, 208, 209, 209, 210, 211, 211, 212,
        212, 213, 214, 214, 215, 215, 216, 217, 217, 218, 218, 219, 219, 220,
        221, 221, 222, 222, 223, 224, 224, 225, 225, 226, 226, 227, 227, 228,
        229, 229, 230, 230, 231, 231, 232, 232, 233, 234, 234, 235, 235, 236,
        236, 237, 237, 238, 238, 239, 240, 240, 241, 241, 242, 242, 243, 243,
        244, 244, 245, 245, 246, 246, 247, 247, 248, 248, 249, 249, 250, 250,
        251, 251, 252, 252, 253, 253, 254, 254, 255
    };

    unsigned long xn;

    if (x >= 0x10000)
        if (x >= 0x1000000)
            if (x >= 0x10000000)
                if (x >= 0x40000000)
                    xn = (sqq_table[x >> 24] << 8);
                else
                    xn = (sqq_table[x >> 22] << 7);
            else
                if (x >= 0x4000000)
                    xn = (sqq_table[x >> 20] << 6);
                else
                    xn = (sqq_table[x >> 18] << 5);
        else
            if (x >= 0x100000)
                if (x >= 0x400000)
                    xn = (sqq_table[x >> 16] << 4);
                else
                    xn = (sqq_table[x >> 14] << 3);
            else
                if (x >= 0x40000)
                    xn = (sqq_table[x >> 12] << 2);
                else
                    xn = (sqq_table[x >> 10] << 1);
    else
        if (x >= 0x100)
            if (x >= 0x1000)
                if (x >= 0x4000)
                    xn = (sqq_table[x >>  8] >> 0);
                else
                    xn = (sqq_table[x >>  6] >> 1);
            else
                if (x >= 0x400)
                    xn = (sqq_table[x >>  4] >> 2);
                else
                    xn = (sqq_table[x >>  2] >> 3);
        else
            if (x >= 0x10)
                if (x >= 0x40)
                    xn = (sqq_table[x >>  0] >> 4);
                else
                    xn = (sqq_table[x <<  2] >> 5);
            else
                if (x >= 0x4)
                    xn = (sqq_table[x <<  4] >> 6);
                else {
                    if (x <= 1)
                        return x;
                    xn = (sqq_table[x <<  6] >> 7);
                }

    xn = (xn + x / xn) / 2;
    xn = (xn + x / xn) / 2;
    xn = (xn + x / xn) / 2;

    if (xn * xn > x)
        xn--;

    return xn;
}


/***************************************************************************
 * Once Dann's algorithm is in one function it becomes clear that it
 * is doing too much work in many cases (e.g. performing the NR step
 * 3 times even for small numbers. Since the if-else tree has already
 * established magnitude information we can use this to tailor the
 * subsequent work to be done, for a significant performance gain.
 * I'm sure that further tweaking is possible but this version is
 * at least proved correct (by the TEST_CORRECT code in main).
 */

static unsigned fred_sqrt(unsigned long x)
{
    static const unsigned char sqq_table[] = {
       0,  16,  22,  27,  32,  35,  39,  42,  45,  48,  50,  53,  55,  57,
      59,  61,  64,  65,  67,  69,  71,  73,  75,  76,  78,  80,  81,  83,
      84,  86,  87,  89,  90,  91,  93,  94,  96,  97,  98,  99, 101, 102,
     103, 104, 106, 107, 108, 109, 110, 112, 113, 114, 115, 116, 117, 118,
     119, 120, 121, 122, 123, 124, 125, 126, 128, 128, 129, 130, 131, 132,
     133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 144, 145,
     146, 147, 148, 149, 150, 150, 151, 152, 153, 154, 155, 155, 156, 157,
     158, 159, 160, 160, 161, 162, 163, 163, 164, 165, 166, 167, 167, 168,
     169, 170, 170, 171, 172, 173, 173, 174, 175, 176, 176, 177, 178, 178,
     179, 180, 181, 181, 182, 183, 183, 184, 185, 185, 186, 187, 187, 188,
     189, 189, 190, 191, 192, 192, 193, 193, 194, 195, 195, 196, 197, 197,
     198, 199, 199, 200, 201, 201, 202, 203, 203, 204, 204, 205, 206, 206,
     207, 208, 208, 209, 209, 210, 211, 211, 212, 212, 213, 214, 214, 215,
     215, 216, 217, 217, 218, 218, 219, 219, 220, 221, 221, 222, 222, 223,
     224, 224, 225, 225, 226, 226, 227, 227, 228, 229, 229, 230, 230, 231,
     231, 232, 232, 233, 234, 234, 235, 235, 236, 236, 237, 237, 238, 238,
     239, 240, 240, 241, 241, 242, 242, 243, 243, 244, 244, 245, 245, 246,
     246, 247, 247, 248, 248, 249, 249, 250, 250, 251, 251, 252, 252, 253,
     253, 254, 254, 255
    };

    unsigned long xn;

    if (x >= 0x10000)
        if (x >= 0x1000000)
            if (x >= 0x10000000)
                if (x >= 0x40000000) {
                    if (x >= 65535UL*65535UL)
                        return 65535;
                    xn = sqq_table[x>>24] << 8;
                } else
                    xn = sqq_table[x>>22] << 7;
            else
                if (x >= 0x4000000)
                    xn = sqq_table[x>>20] << 6;
                else
                    xn = sqq_table[x>>18] << 5;
        else {
            if (x >= 0x100000)
                if (x >= 0x400000)
                    xn = sqq_table[x>>16] << 4;
                else
                    xn = sqq_table[x>>14] << 3;
            else
                if (x >= 0x40000)
                    xn = sqq_table[x>>12] << 2;
                else
                    xn = sqq_table[x>>10] << 1;

            goto nr1;
        }
    else
        if (x >= 0x100) {
            if (x >= 0x1000)
                if (x >= 0x4000)
                    xn = (sqq_table[x>>8] >> 0) + 1;
                else
                    xn = (sqq_table[x>>6] >> 1) + 1;
            else
                if (x >= 0x400)
                    xn = (sqq_table[x>>4] >> 2) + 1;
                else
                    xn = (sqq_table[x>>2] >> 3) + 1;

            goto adj;
        } else
            return sqq_table[x] >> 4;

    xn = (xn + 1 + x / xn) / 2;
nr1:
    xn = (xn + 1 + x / xn) / 2;
adj:
    if (xn * xn > x)
        xn--;

    return xn;
}


/***************************************************************************
 * This algorithm is based on the code posted by Stephen Bye
 */

static unsigned stephen_sqrt(unsigned long n)
{
    int i;
    unsigned long trial;
    unsigned d = 1U<<15;
    unsigned result = 0;
    unsigned long result2 = 0;

    for (i = 15; i >= 0; i--) {
        trial = result2 + ((2UL*result + d) << i);
        if (trial <= n) {
            result += d;
            result2 = trial;
        }
        d >>= 1;
    }

    return result;
}


/***************************************************************************
 * Use the built-in floating point sqrt() function
 */

static unsigned float_sqrt(unsigned long n)
{
    return sqrt(n);
}


/***************************************************************************
 * Use the built-in sqrt() function converting between floating point and
 * signed integers. Has reduced range but can be faster than float_sqrt().
 */

static unsigned floati_sqrt(unsigned long n)
{
    return (long)sqrt((long)n);
}


/***************************************************************************
 * Essentially the algorithm posted by Paul Hsieh. This is limited to
 * signed values and I've had to remove the inline assembly by default
 * for portability between compilers. It still makes non-portable
 * assumptions however.
 */

static unsigned paul_sqrt(unsigned long r)
{
    double x, y;
    float tempf;
    unsigned long *tfptr = (unsigned long *)&tempf;
    long is;

    tempf = (long)r;
    *tfptr = (0xbe6f0000-*tfptr)>>1;
    x = tempf;
    y = r*0.5;
    x *= 1.5-x*x*y;
    if (r >101123)
        x *= 1.5-x*x*y;
    x *= r;
    
#if USE_ASM
    asm {
        fld x
        fistp is
    }
#else
    is = x + 0.5;
#endif

    if (is*is > r)
        is--;

    return is;
}
   

/***************************************************************************
 * As paul_sqrt except that the float/unsigned long aliasing is done
 * directly through casts rather than using a separate pointer variable.
 */

static unsigned paul2_sqrt(unsigned long r)
{
    double x, y;
    float tempf;
    long is;

    tempf = (long)r;
    *(unsigned long *)&tempf = (0xbe6f0000-*(unsigned long *)&tempf)>>1;
    x = tempf;
    y = r*0.5;
    x *= 1.5-x*x*y;
    if (r >101123)
        x *= 1.5-x*x*y;
    x *= r;
    
#if USE_ASM
    asm {
        fld x
        fistp is
    }
#else
    is = x + 0.5;
#endif

    if (is*is > r)
        is--;

    return is;
}

-- 
-----------------------------------------
Lawrence Kirby | fred@genesis.demon.co.uk
Wilts, England | 70734.126@compuserve.com
-----------------------------------------
==============================================================================

From: DONT.qed.SPAM.ME@pobox.com (Paul Hsieh)
Subject: Re: Faster integer square root -- modified merge of two routines.
Date: Wed, 12 Jan 2000 09:53:58 -0800
Newsgroups: comp.graphics.algorithms,sci.math.num-analysis,comp.lang.c

Lawrence Kirby wrote:
> DONT.qed.SPAM.ME@pobox.com "Paul Hsieh" writes:
> >    x=tempf;
> >    y=r*0.5; 
> >    x*=1.5-x*x*y; 
> >    if( r>101123 ) x*=1.5-x*x*y;
> >    x*=r;
> >
> >    /* This method of float=>int conversion is much faster than the 
> >       standard C cast, however it round instead of truncates. */
> 
> Doesn't that depend on the rounding mode that happens to have been set
> in the FPU (i.e. you can't assume that it won't truncate)?

The rounding mode I was referring to was the default one.  The rounding 
mode is not typically changed at run time -- however the nVidia code 
appears to have coded a work around in their solution that could be used.
(I've mirrored their code at: http://www.pobox.com/~qed/fastmath.cpp 
since its kind of hard to find on their site.)

> >    _asm {
> >        fld x
> >        fistp is
> >    }

So as nVidia suggests, perhaps this could be changed to

  _asm {
    fld x
    frndint
    fistp is
  }

to more robust in light of dynamic rounding mode changes.  In any event I 
was mostly interested in removing the dilapidated float => int code that 
the compiler is forced to use to work around the x87's inane design.

> >
> >    if( is*is>r ) is--;
> >
> >    return is;
> >}
> 
> Bearing in mind that "portable" C is desirable (at least because this
> is cross-posed to comp.lang.c) I've included is a portable (as far as
> possible) test program for the algorithms in the thread, plus
> a couple more. Here are two sets of results
> 
> 1. Compiled by an oldish gcc (2.7.2.1) running on a 200Mhz PPro system
>    gcc -O2 -fomit-frame-pointer
> 
> RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=200.0MHz
> 
> Function      Overall Time    Clocks per call   Checksum
> 
>  dummy            0.470              9.2        00000000
>  ldiv            10.500            205.1        004d1d7a
>  Dann             8.640            168.8        004d1d7a
>  Fred             4.140             80.9        004d1d7a
>  Stephen          8.940            174.6        004d1d7a
>  Paul             8.680            169.5        004d1d7a
>  Paul2            8.170            159.6        004d1d7a
>  float            8.400            164.1        004d1d7a
>  floati           7.560            147.7        004d1d7a

The measurable difference between "Paul" and "Paul2" is telling of the 
quality of the gcc compiler here.
 
> 1. Compiled by MSVC6 running on a dual PIII 450Mhz system.
>    cl -O2
> 
> RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=450.0MHz
> 
> Function      Overall Time    Clocks per call   Checksum
> 
>  dummy            0.203              8.9        00000000
>  ldiv             4.734            208.0        005334f9
>  Dann             3.922            172.4        005334f9
>  Fred             1.829             80.4        005334f9
>  Stephen          3.860            169.6        005334f9
>  Paul             3.547            155.9        005334f9
>  Paul2            3.859            169.6        005334f9
>  float            3.484            153.1        005334f9
>  floati           2.984            131.1        005334f9

Dude, you do know that the PPro and P-!!! are pretty much the same cores 
right?  I don't consider these two as seperate data points.  (Except for
watching MSVC suffering from the opposite problem as gcc -- *sigh*.)

I recently got some mail from Aleks Jakulin that appears to have 
vindicated me at least somewhat.  Here are his results on a DEC Alpha:

  RANDARRAY_SIZE=1024 REPEAT_COUNT=10000

  Function      Overall Time    Clocks per call   Checksum

   dummy            0.309             13.6        00000000
   ldiv             3.802            167.1        0057ddd9
   Dann             7.265            319.3        0057ddd9
   Fred             3.220            141.5        0057ddd9
   Stephen          3.600            158.2        0057ddd9
   Paul             2.681            117.8        0057ddd9
   Paul2            2.680            117.8        0057ddd9
   float           21.424            941.5        0057ddd9
   floati          21.407            940.7        0057ddd9

On my Athlon 500mhz using WATCOM C/C++ as well as some code modifications 
I get:

   dummy            0.220              9.7        00000000
   ldiv             4.390            192.9        0049848d
   Dann             3.350            147.2        0049848d
   Fred             1.490             65.5        0049848d
   Stephen          4.400            193.4        0049848d
   Paul             3.680            161.7        0049848d
   Paul2            3.680            161.7        0049848d
   Paul3            2.960            130.1        0049848d
   float            2.470            108.5        0049848d
   floati           1.430             62.8        0049848d
   floati2          0.760             33.4        0049848d

(Notice how Paul and Paul2 have the same performance in the above two 
results.)

The Athlon's square root hardware laughs in all our collective faces.  Oh 
well. :o)  The "Fred" solution looks like its pretty consistently good as
well, so I will have to add it to my square root page (if anyone cares 
its at http://www.pobox.com/~qed/sqroot.html )

My changes amounted to adding Paul3 and floati2:

static unsigned paul3_sqrt(unsigned long r)
{
    double x, y;
    float tempf;
    long is;

    tempf = (long)r;
    *(unsigned long *)&tempf = (0xbe6f0000-*(unsigned long *)&tempf)>>1;
    x = tempf;
    y = r*0.5;
    x *= 1.5-x*x*y;
    if (r > 101123)
        x *= 1.5-x*x*y;
    x *= r;
    
#if defined(__WIN32__) | defined(__WATCOMC__)
    _asm {
        fld x
        fistp is
    }
#else
    is = x+0.5;
#endif
    is += (((signed int)(r-is*is))>>31);

    return is;
}

// Drum roll please ...

static unsigned floati_sqrt2(unsigned long n)
{
#if defined(__WIN32__) | defined(__WATCOMC__)
    float half=0.4999999;
    int r;

    _asm {
        fild n
        fsqrt
        fsub half
        fistp r
    }
    return r;
#else
    return (long)sqrt((long)n);
#endif
}

--
Paul Hsieh
http://www.pobox.com/~qed/mailme.html