From: fred@genesis.demon.co.uk (Lawrence Kirby) Subject: Re: Faster integer square root -- modified merge of two routines. Date: Tue, 11 Jan 2000 12:32:51 GMT Newsgroups: comp.graphics.algorithms,sci.math.num-analysis,comp.lang.c Summary: Coding integer square root algorithms In article DONT.qed.SPAM.ME@pobox.com "Paul Hsieh" writes: >Whoops! I decided to dig a little deeper into this and look at the real >performance of these loops. It looks like Dann's solution is not too >bad. It really seems to exploit the low decree of accuracy required for >integer results. > >In any event, I was able to modify my solution to produce something that >is indeed faster, does not have divide by zero problems (*ahem*) and >which is not off by one with repect to the input (*ahem* again.) My >own testing using WATCOM C/C++ on an Athlon system shows my code to be >consistently faster than Dann's. > >int isqrt0(int r) { There's an immediate problem here in that this only works for signed integer values. To make this equivalent to Dann's code you'll need to fix it to work on the full range of 32 bit unsigned integers. >double x,y; >float tempf; >unsigned long *tfptr = (unsigned long *)&tempf; >int is; > > tempf = r; > *tfptr=(0xbe6f0000-*tfptr)>>1; Obviously this is not portable. I guess it should be fairly portable between IEEE systems where unsigned long is 32 bits wide, since the byte order of integers and floating point representations is usually consistent. There is a problem with this as far as C is concerned because you are strictly not allowed to access, for example, a float defined object using an lvalue with other than float or a character type. A C compiler could, for example, no bother to reload tempf because it knows that tempf cannot have been legitimately changed by the expression above. Using volatile may be an improvement but that may of course interfere with the compiler's optimisations. > x=tempf; > y=r*0.5; > x*=1.5-x*x*y; > if( r>101123 ) x*=1.5-x*x*y; > x*=r; > > /* This method of float=>int conversion is much faster than the > standard C cast, however it round instead of truncates. */ Doesn't that depend on the rounding mode that happens to have been set in the FPU (i.e. you can't assume that it won't truncate)? > _asm { > fld x > fistp is > } > > if( is*is>r ) is--; > > return is; >} Bearing in mind that "portable" C is desirable (at least because this is cross-posed to comp.lang.c) I've included is a portable (as far as possible) test program for the algorithms in the thread, plus a couple more. Here are two sets of results 1. Compiled by an oldish gcc (2.7.2.1) running on a 200Mhz PPro system gcc -O2 -fomit-frame-pointer RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=200.0MHz Function Overall Time Clocks per call Checksum dummy 0.470 9.2 00000000 ldiv 10.500 205.1 004d1d7a Dann 8.640 168.8 004d1d7a Fred 4.140 80.9 004d1d7a Stephen 8.940 174.6 004d1d7a Paul 8.680 169.5 004d1d7a Paul2 8.170 159.6 004d1d7a float 8.400 164.1 004d1d7a floati 7.560 147.7 004d1d7a 1. Compiled by MSVC6 running on a dual PIII 450Mhz system. cl -O2 RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=450.0MHz Function Overall Time Clocks per call Checksum dummy 0.203 8.9 00000000 ldiv 4.734 208.0 005334f9 Dann 3.922 172.4 005334f9 Fred 1.829 80.4 005334f9 Stephen 3.860 169.6 005334f9 Paul 3.547 155.9 005334f9 Paul2 3.859 169.6 005334f9 float 3.484 153.1 005334f9 floati 2.984 131.1 005334f9 And here's the code #include #include #include #include #include #define CPU_CLOCK 450.0 /* The clock rate of your CPU (MHz) */ #define REPEAT_COUNT 10000L /* Times to run through all the data */ #define RANDARRAY_SIZE 1024 /* Number of random test values */ static void init_randarray(unsigned long *); static void test(unsigned long *, const char *, unsigned (*)(unsigned long)); static void test_correct(const char *, unsigned (*)(unsigned long), unsigned long); static unsigned dummy(unsigned long); static unsigned ldiv_sqrt(unsigned long); static unsigned dann_sqrt(unsigned long); static unsigned fred_sqrt(unsigned long); static unsigned stephen_sqrt(unsigned long); static unsigned paul_sqrt(unsigned long); static unsigned paul2_sqrt(unsigned long); static unsigned float_sqrt(unsigned long); static unsigned floati_sqrt(unsigned long); /*************************************************************************** */ int main(void) { unsigned long randarray[RANDARRAY_SIZE]; srand((unsigned)time(NULL)); init_randarray(randarray); printf("RANDARRAY_SIZE=%d REPEAT_COUNT=%ld CPU_CLOCK=%.1fMHz\n\n", RANDARRAY_SIZE, REPEAT_COUNT, (double)CPU_CLOCK); printf("Function Overall Time Clocks per call Checksum\n\n"); test(randarray, "dummy", dummy); test(randarray, "ldiv", ldiv_sqrt); test(randarray, "Dann", dann_sqrt); test(randarray, "Fred", fred_sqrt); test(randarray, "Stephen", stephen_sqrt); test(randarray, "Paul", paul_sqrt); test(randarray, "Paul2", paul2_sqrt); test(randarray, "float", float_sqrt); test(randarray, "floati", floati_sqrt); #if TEST_CORRECT test_correct("fred_sqrt", fred_sqrt, 0xffffffff); #endif return 0; } /*************************************************************************** * Creates an array of random numbers for quick access. The distribution * is intended to produce a good mix of small and large numbers amd * make life difficult for Dann's algorithm. */ static void init_randarray(unsigned long *randarray) { const double k = log(0xffffffff) / (RAND_MAX + 1.0); size_t i; for (i = 0; i < RANDARRAY_SIZE; i++) { unsigned long value = exp(k * rand()); if (value >= 0x80000000) /* For signed integer algs */ value /= 2; randarray[i] = value; } } /*************************************************************************** * Test and time the specified function and output the results. The * checksum is to do something with the result and give a rough indication * that the functions are producing the same vslues. */ static void test( unsigned long *randarray, const char *name, unsigned (*func)(unsigned long)) { long rc; unsigned long checksum; clock_t start, end; double interval, clocks_per_call; fflush(stdout); end = clock(); while ((start = clock()) == end) /* For very granular clocks */ ; for (rc = 0; rc < REPEAT_COUNT; rc++) { size_t i; checksum = 0; for (i = 0; i < RANDARRAY_SIZE; i++) checksum += (*func)(randarray[i]); } end = clock(); interval = (double)(end-start)/(double)CLOCKS_PER_SEC; clocks_per_call = interval * CPU_CLOCK * 1000000 / REPEAT_COUNT / RANDARRAY_SIZE; printf(" %-10s %9.3f %7.1f %.8lx\n", name, interval, clocks_per_call, checksum); } /*************************************************************************** * Test the correctness of a function over its entire input range. * This does give some time statistics however it can take a long time * and on some systems it is possible for clock() to wrap before * execution is complete giving a nonsense result */ static void test_correct( const char *name, unsigned (*func)(unsigned long), unsigned long highval) { unsigned long i; clock_t start, end; double interval, clocks_per_call; printf("Testing %s for correctness\n", name); fflush(stdout); end = clock(); while ((start = clock()) == end) /* For very granular clocks */ ; i = 0; do { unsigned long result = fred_sqrt(i); unsigned long lobound = result * result; unsigned long hibound = lobound + 2*result; if (result > 65535 || i < lobound || i > hibound) printf("Error: %lu %lu %.3f\n", i, result, sqrt(i)); } while (++i <= highval && i != 0); end = clock(); interval = (double)(end-start)/(double)CLOCKS_PER_SEC; clocks_per_call = interval * CPU_CLOCK * 1000000 / (highval + 1.0); printf(" %-10s %9.3f %7.1f\n", name, interval, clocks_per_call); } /*************************************************************************** * Bare function to test loop and function call overhead */ static unsigned dummy(unsigned long x) { return 0; } /*************************************************************************** * An algorithm based on an old "long division" pen and paper technique */ static unsigned ldiv_sqrt(unsigned long n) { int i; unsigned result; unsigned long tmp, acc; result = 0; acc = 0; for (i = 30; i >= 0; i -= 2) { acc = acc*4 + ((n>>i)&3); tmp = result*4UL + 1; result <<= 1; if (acc >= tmp) { acc -= tmp; result++; } } return result; } /*************************************************************************** * Dann Corbitt's function "fixed" and pulled into one function (which * appears a little faster in my tests) */ static unsigned dann_sqrt(unsigned long x) { static const unsigned char sqq_table[] = { 0, 16, 22, 27, 32, 35, 39, 42, 45, 48, 50, 53, 55, 57, 59, 61, 64, 65, 67, 69, 71, 73, 75, 76, 78, 80, 81, 83, 84, 86, 87, 89, 90, 91, 93, 94, 96, 97, 98, 99, 101, 102, 103, 104, 106, 107, 108, 109, 110, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 128, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 144, 145, 146, 147, 148, 149, 150, 150, 151, 152, 153, 154, 155, 155, 156, 157, 158, 159, 160, 160, 161, 162, 163, 163, 164, 165, 166, 167, 167, 168, 169, 170, 170, 171, 172, 173, 173, 174, 175, 176, 176, 177, 178, 178, 179, 180, 181, 181, 182, 183, 183, 184, 185, 185, 186, 187, 187, 188, 189, 189, 190, 191, 192, 192, 193, 193, 194, 195, 195, 196, 197, 197, 198, 199, 199, 200, 201, 201, 202, 203, 203, 204, 204, 205, 206, 206, 207, 208, 208, 209, 209, 210, 211, 211, 212, 212, 213, 214, 214, 215, 215, 216, 217, 217, 218, 218, 219, 219, 220, 221, 221, 222, 222, 223, 224, 224, 225, 225, 226, 226, 227, 227, 228, 229, 229, 230, 230, 231, 231, 232, 232, 233, 234, 234, 235, 235, 236, 236, 237, 237, 238, 238, 239, 240, 240, 241, 241, 242, 242, 243, 243, 244, 244, 245, 245, 246, 246, 247, 247, 248, 248, 249, 249, 250, 250, 251, 251, 252, 252, 253, 253, 254, 254, 255 }; unsigned long xn; if (x >= 0x10000) if (x >= 0x1000000) if (x >= 0x10000000) if (x >= 0x40000000) xn = (sqq_table[x >> 24] << 8); else xn = (sqq_table[x >> 22] << 7); else if (x >= 0x4000000) xn = (sqq_table[x >> 20] << 6); else xn = (sqq_table[x >> 18] << 5); else if (x >= 0x100000) if (x >= 0x400000) xn = (sqq_table[x >> 16] << 4); else xn = (sqq_table[x >> 14] << 3); else if (x >= 0x40000) xn = (sqq_table[x >> 12] << 2); else xn = (sqq_table[x >> 10] << 1); else if (x >= 0x100) if (x >= 0x1000) if (x >= 0x4000) xn = (sqq_table[x >> 8] >> 0); else xn = (sqq_table[x >> 6] >> 1); else if (x >= 0x400) xn = (sqq_table[x >> 4] >> 2); else xn = (sqq_table[x >> 2] >> 3); else if (x >= 0x10) if (x >= 0x40) xn = (sqq_table[x >> 0] >> 4); else xn = (sqq_table[x << 2] >> 5); else if (x >= 0x4) xn = (sqq_table[x << 4] >> 6); else { if (x <= 1) return x; xn = (sqq_table[x << 6] >> 7); } xn = (xn + x / xn) / 2; xn = (xn + x / xn) / 2; xn = (xn + x / xn) / 2; if (xn * xn > x) xn--; return xn; } /*************************************************************************** * Once Dann's algorithm is in one function it becomes clear that it * is doing too much work in many cases (e.g. performing the NR step * 3 times even for small numbers. Since the if-else tree has already * established magnitude information we can use this to tailor the * subsequent work to be done, for a significant performance gain. * I'm sure that further tweaking is possible but this version is * at least proved correct (by the TEST_CORRECT code in main). */ static unsigned fred_sqrt(unsigned long x) { static const unsigned char sqq_table[] = { 0, 16, 22, 27, 32, 35, 39, 42, 45, 48, 50, 53, 55, 57, 59, 61, 64, 65, 67, 69, 71, 73, 75, 76, 78, 80, 81, 83, 84, 86, 87, 89, 90, 91, 93, 94, 96, 97, 98, 99, 101, 102, 103, 104, 106, 107, 108, 109, 110, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 128, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 144, 145, 146, 147, 148, 149, 150, 150, 151, 152, 153, 154, 155, 155, 156, 157, 158, 159, 160, 160, 161, 162, 163, 163, 164, 165, 166, 167, 167, 168, 169, 170, 170, 171, 172, 173, 173, 174, 175, 176, 176, 177, 178, 178, 179, 180, 181, 181, 182, 183, 183, 184, 185, 185, 186, 187, 187, 188, 189, 189, 190, 191, 192, 192, 193, 193, 194, 195, 195, 196, 197, 197, 198, 199, 199, 200, 201, 201, 202, 203, 203, 204, 204, 205, 206, 206, 207, 208, 208, 209, 209, 210, 211, 211, 212, 212, 213, 214, 214, 215, 215, 216, 217, 217, 218, 218, 219, 219, 220, 221, 221, 222, 222, 223, 224, 224, 225, 225, 226, 226, 227, 227, 228, 229, 229, 230, 230, 231, 231, 232, 232, 233, 234, 234, 235, 235, 236, 236, 237, 237, 238, 238, 239, 240, 240, 241, 241, 242, 242, 243, 243, 244, 244, 245, 245, 246, 246, 247, 247, 248, 248, 249, 249, 250, 250, 251, 251, 252, 252, 253, 253, 254, 254, 255 }; unsigned long xn; if (x >= 0x10000) if (x >= 0x1000000) if (x >= 0x10000000) if (x >= 0x40000000) { if (x >= 65535UL*65535UL) return 65535; xn = sqq_table[x>>24] << 8; } else xn = sqq_table[x>>22] << 7; else if (x >= 0x4000000) xn = sqq_table[x>>20] << 6; else xn = sqq_table[x>>18] << 5; else { if (x >= 0x100000) if (x >= 0x400000) xn = sqq_table[x>>16] << 4; else xn = sqq_table[x>>14] << 3; else if (x >= 0x40000) xn = sqq_table[x>>12] << 2; else xn = sqq_table[x>>10] << 1; goto nr1; } else if (x >= 0x100) { if (x >= 0x1000) if (x >= 0x4000) xn = (sqq_table[x>>8] >> 0) + 1; else xn = (sqq_table[x>>6] >> 1) + 1; else if (x >= 0x400) xn = (sqq_table[x>>4] >> 2) + 1; else xn = (sqq_table[x>>2] >> 3) + 1; goto adj; } else return sqq_table[x] >> 4; xn = (xn + 1 + x / xn) / 2; nr1: xn = (xn + 1 + x / xn) / 2; adj: if (xn * xn > x) xn--; return xn; } /*************************************************************************** * This algorithm is based on the code posted by Stephen Bye */ static unsigned stephen_sqrt(unsigned long n) { int i; unsigned long trial; unsigned d = 1U<<15; unsigned result = 0; unsigned long result2 = 0; for (i = 15; i >= 0; i--) { trial = result2 + ((2UL*result + d) << i); if (trial <= n) { result += d; result2 = trial; } d >>= 1; } return result; } /*************************************************************************** * Use the built-in floating point sqrt() function */ static unsigned float_sqrt(unsigned long n) { return sqrt(n); } /*************************************************************************** * Use the built-in sqrt() function converting between floating point and * signed integers. Has reduced range but can be faster than float_sqrt(). */ static unsigned floati_sqrt(unsigned long n) { return (long)sqrt((long)n); } /*************************************************************************** * Essentially the algorithm posted by Paul Hsieh. This is limited to * signed values and I've had to remove the inline assembly by default * for portability between compilers. It still makes non-portable * assumptions however. */ static unsigned paul_sqrt(unsigned long r) { double x, y; float tempf; unsigned long *tfptr = (unsigned long *)&tempf; long is; tempf = (long)r; *tfptr = (0xbe6f0000-*tfptr)>>1; x = tempf; y = r*0.5; x *= 1.5-x*x*y; if (r >101123) x *= 1.5-x*x*y; x *= r; #if USE_ASM asm { fld x fistp is } #else is = x + 0.5; #endif if (is*is > r) is--; return is; } /*************************************************************************** * As paul_sqrt except that the float/unsigned long aliasing is done * directly through casts rather than using a separate pointer variable. */ static unsigned paul2_sqrt(unsigned long r) { double x, y; float tempf; long is; tempf = (long)r; *(unsigned long *)&tempf = (0xbe6f0000-*(unsigned long *)&tempf)>>1; x = tempf; y = r*0.5; x *= 1.5-x*x*y; if (r >101123) x *= 1.5-x*x*y; x *= r; #if USE_ASM asm { fld x fistp is } #else is = x + 0.5; #endif if (is*is > r) is--; return is; } -- ----------------------------------------- Lawrence Kirby | fred@genesis.demon.co.uk Wilts, England | 70734.126@compuserve.com ----------------------------------------- ============================================================================== From: DONT.qed.SPAM.ME@pobox.com (Paul Hsieh) Subject: Re: Faster integer square root -- modified merge of two routines. Date: Wed, 12 Jan 2000 09:53:58 -0800 Newsgroups: comp.graphics.algorithms,sci.math.num-analysis,comp.lang.c Lawrence Kirby wrote: > DONT.qed.SPAM.ME@pobox.com "Paul Hsieh" writes: > > x=tempf; > > y=r*0.5; > > x*=1.5-x*x*y; > > if( r>101123 ) x*=1.5-x*x*y; > > x*=r; > > > > /* This method of float=>int conversion is much faster than the > > standard C cast, however it round instead of truncates. */ > > Doesn't that depend on the rounding mode that happens to have been set > in the FPU (i.e. you can't assume that it won't truncate)? The rounding mode I was referring to was the default one. The rounding mode is not typically changed at run time -- however the nVidia code appears to have coded a work around in their solution that could be used. (I've mirrored their code at: http://www.pobox.com/~qed/fastmath.cpp since its kind of hard to find on their site.) > > _asm { > > fld x > > fistp is > > } So as nVidia suggests, perhaps this could be changed to _asm { fld x frndint fistp is } to more robust in light of dynamic rounding mode changes. In any event I was mostly interested in removing the dilapidated float => int code that the compiler is forced to use to work around the x87's inane design. > > > > if( is*is>r ) is--; > > > > return is; > >} > > Bearing in mind that "portable" C is desirable (at least because this > is cross-posed to comp.lang.c) I've included is a portable (as far as > possible) test program for the algorithms in the thread, plus > a couple more. Here are two sets of results > > 1. Compiled by an oldish gcc (2.7.2.1) running on a 200Mhz PPro system > gcc -O2 -fomit-frame-pointer > > RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=200.0MHz > > Function Overall Time Clocks per call Checksum > > dummy 0.470 9.2 00000000 > ldiv 10.500 205.1 004d1d7a > Dann 8.640 168.8 004d1d7a > Fred 4.140 80.9 004d1d7a > Stephen 8.940 174.6 004d1d7a > Paul 8.680 169.5 004d1d7a > Paul2 8.170 159.6 004d1d7a > float 8.400 164.1 004d1d7a > floati 7.560 147.7 004d1d7a The measurable difference between "Paul" and "Paul2" is telling of the quality of the gcc compiler here. > 1. Compiled by MSVC6 running on a dual PIII 450Mhz system. > cl -O2 > > RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 CPU_CLOCK=450.0MHz > > Function Overall Time Clocks per call Checksum > > dummy 0.203 8.9 00000000 > ldiv 4.734 208.0 005334f9 > Dann 3.922 172.4 005334f9 > Fred 1.829 80.4 005334f9 > Stephen 3.860 169.6 005334f9 > Paul 3.547 155.9 005334f9 > Paul2 3.859 169.6 005334f9 > float 3.484 153.1 005334f9 > floati 2.984 131.1 005334f9 Dude, you do know that the PPro and P-!!! are pretty much the same cores right? I don't consider these two as seperate data points. (Except for watching MSVC suffering from the opposite problem as gcc -- *sigh*.) I recently got some mail from Aleks Jakulin that appears to have vindicated me at least somewhat. Here are his results on a DEC Alpha: RANDARRAY_SIZE=1024 REPEAT_COUNT=10000 Function Overall Time Clocks per call Checksum dummy 0.309 13.6 00000000 ldiv 3.802 167.1 0057ddd9 Dann 7.265 319.3 0057ddd9 Fred 3.220 141.5 0057ddd9 Stephen 3.600 158.2 0057ddd9 Paul 2.681 117.8 0057ddd9 Paul2 2.680 117.8 0057ddd9 float 21.424 941.5 0057ddd9 floati 21.407 940.7 0057ddd9 On my Athlon 500mhz using WATCOM C/C++ as well as some code modifications I get: dummy 0.220 9.7 00000000 ldiv 4.390 192.9 0049848d Dann 3.350 147.2 0049848d Fred 1.490 65.5 0049848d Stephen 4.400 193.4 0049848d Paul 3.680 161.7 0049848d Paul2 3.680 161.7 0049848d Paul3 2.960 130.1 0049848d float 2.470 108.5 0049848d floati 1.430 62.8 0049848d floati2 0.760 33.4 0049848d (Notice how Paul and Paul2 have the same performance in the above two results.) The Athlon's square root hardware laughs in all our collective faces. Oh well. :o) The "Fred" solution looks like its pretty consistently good as well, so I will have to add it to my square root page (if anyone cares its at http://www.pobox.com/~qed/sqroot.html ) My changes amounted to adding Paul3 and floati2: static unsigned paul3_sqrt(unsigned long r) { double x, y; float tempf; long is; tempf = (long)r; *(unsigned long *)&tempf = (0xbe6f0000-*(unsigned long *)&tempf)>>1; x = tempf; y = r*0.5; x *= 1.5-x*x*y; if (r > 101123) x *= 1.5-x*x*y; x *= r; #if defined(__WIN32__) | defined(__WATCOMC__) _asm { fld x fistp is } #else is = x+0.5; #endif is += (((signed int)(r-is*is))>>31); return is; } // Drum roll please ... static unsigned floati_sqrt2(unsigned long n) { #if defined(__WIN32__) | defined(__WATCOMC__) float half=0.4999999; int r; _asm { fild n fsqrt fsub half fistp r } return r; #else return (long)sqrt((long)n); #endif } -- Paul Hsieh http://www.pobox.com/~qed/mailme.html