Post your speeds - Calculate 50 Million Factorials

lundy · September 25, 2003 6:59PM

Quote:

Originally posted by Programmer

Okay guys, try this.

Won't compile for me with the default flags and Xcode.

cd /Users/lundy/Programming/vectors

/Users/lundy/Programming/vectors/main.c:7: error: parse error before "fixed64"

/Users/lundy/Programming/vectors/main.c:8: error: syntax error before '{' token

/Users/lundy/Programming/vectors/main.c:11: error: parse error before ':' token

/Users/lundy/Programming/vectors/main.c:12: warning: type defaults to `int' in declaration of `fixed64'

/Users/lundy/Programming/vectors/main.c:12: error: parse error before '&' token

/Users/lundy/Programming/vectors/main.c:13: error: parse error before '&' token

/Users/lundy/Programming/vectors/main.c:14: error: parse error before '&' token

/Users/lundy/Programming/vectors/main.c:15: error: parse error before '&' token

/Users/lundy/Programming/vectors/main.c:16: error: parse error before '&' token

/Users/lundy/Programming/vectors/main.c:17: error: parse error before ':' token

/Users/lundy/Programming/vectors/main.c:21: error: parse error before "operator"

/Users/lundy/Programming/vectors/main.c: In function `TransformVectors':

/Users/lundy/Programming/vectors/main.c:37: error: `for' loop initial declaration used outside C99 mode

/Users/lundy/Programming/vectors/main.c: In function `main':

/Users/lundy/Programming/vectors/main.c:58: error: ` for' loop initial declaration used outside C99 mode

/Users/lundy/Programming/vectors/main.c:70: error: `for' loop initial declaration used outside C99 mode

/Users/lundy/Programming/vectors/main.c:81: warning: int format, long int arg (arg 2)

\t/Users/lundy/Programming/vectors/main.c:7: error: parse error before "fixed64"

\t/Users/lundy/Programming/vectors/main.c:8: error: syntax error before '{' token

\t/Users/lundy/Programming/vectors/main.c:11: error: parse error before ':' token

\t/Users/lundy/Programming/vectors/main.c:12: warning: type defaults to `int' in declaration of `fixed64'

\t/Users/lundy/Programming/vectors/main.c:12: error: parse error before '&' token

\t/Users/lundy/Programming/vectors/main.c:13: error: parse error before '&' token

\t/Users/lundy/Programming/vectors/main.c:14: error: parse error before '&' token

\t/Users/lundy/Programming/vectors/main.c:15: error: parse error before '&' token

\t/Users/lundy/Programming/vectors/main.c:16: error: parse error before '&' token

\t/Users/lundy/Programming/vectors/main.c:17: error: parse error before ':' token

\t/Users/lundy/Programming/vectors/main.c:21: error: parse error before "operator"

\t/Users/lundy/Programming/vectors/main.c:37: error: `for' loop initial declaration used outside C99 mode

\t/Users/lundy/Programming/vectors/main.c:58: error: `for' loop initial declaration used outside C99 mode

\t/Users/lundy/Programming/vectors/main.c:70: error: `for' loop initial declaration used outside C99 mode

programmer · September 25, 2003 11:06PM

Quote:

Originally posted by lundy

Won't compile for me with the default flags and Xcode.

Make sure you give it the extension ".cpp" -- it is C++ code, not C code.

lundy · September 26, 2003 2:46AM

Quote:

Originally posted by Programmer

Make sure you give it the extension ".cpp" -- it is C++ code, not C code.

D'oh!

programmer · September 26, 2003 6:20PM

Somebody want to try this on a G5, please? Remember, save as a .cpp file.

tidris · September 26, 2003 11:29PM

Quote:

Originally posted by Programmer

Somebody want to try this on a G5, please? Remember, save as a .cpp file.

On a dual G5 I got a result of 46. On a 500 MHz G3 PowerBook I got a result of 425. I optimized for speed.

tidris · September 26, 2003 11:43PM

Quote:

Originally posted by Tidris

On a dual G5 I got a result of 46. On a 500 MHz G3 PowerBook I got a result of 425. I optimized for speed.

Ok, here is a more detailed report.

fixed64 typedef: 48

float typedef: 47

double typedef: 47 (sometimes 46)

That was on a dual G5.

programmer · September 27, 2003 12:23AM

Hmmm, that's interesting... can somebody send me their G5 so I can run a profile on the code and see why it isn't significantly faster than my G4 running at half the clock rate? Thanks.

lundy · September 27, 2003 1:24AM

Quote:

Originally posted by Programmer

Somebody want to try this on a G5, please? Remember, save as a .cpp file.

The first segment of the Xcode disk image is corrupt - at least for me it crashes Disk Utility. Another dude confirmed this.

Don't want to go back to Project Builder and the only way to get gcc 3.3 on Panther is to install Xcode.

tidris · September 27, 2003 8:58AM

Quote:

Originally posted by Tidris

Ok, here is a more detailed report.

fixed64 typedef: 48

float typedef: 47

double typedef: 47 (sometimes 46)

That was on a dual G5.

The floating point numbers improved somewhat by turning off profiling and using -O3 instead of -fast:

float typedef: 42

double typedef: 40

I am using OSX 10.2.7, ProjectBuilder 2.1, gcc-3.3, in case that matters.

zapchud · September 27, 2003 9:36AM

How about testing it with the XLC++-compiler?

tidris · September 27, 2003 1:08PM

Quote:

Originally posted by Zapchud

How about testing it with the XLC++-compiler?

I have been experimenting with that this morning but I must be doing something wrong because the result is worse than with gcc-3.3. For example, for the double typedef the best I can get with xlc is 45 versus 40 with gcc-3.3.

lundy · September 27, 2003 2:32PM

Quote:

Originally posted by Tidris

I have been experimenting with that this morning but I must be doing something wrong because the result is worse than with gcc-3.3. For example, for the double typedef the best I can get with xlc is 45 versus 40 with gcc-3.3.

59 seconds on the dual G5 EDIT: with "double".

42 seconds with G5 optimization.

50 seconds with long long ints.

I made a custom Build Style in Xcode called G5-Optimized, but Xcode shows this on the detail line:

Building target ?vectors? with build style ?G5-Optimized? (optimization:level ?size?, debug-symbols:on)

Optimization level:"size"???

Anybody else getting anything different? Xcode is really not that easy to get a handle on. A simple menu with the choices would be easier, for chrissakes.

cubedude · September 27, 2003 4:10PM

Quote:

ryan% ./unthreaded_factorial

Start: 1064696504 End: 1064696580

i= 50000001

Time=76

Quote:

ryan% ./threaded_factorial

Creating Thread Number: 0

Creating Thread Number: 1

Loop Done; Time=89 secs for thread#:0, Loops=25000000

Loop Done; Time=89 secs for thread#:1, Loops=25000000

G4 Cube 450

tidris · September 30, 2003 1:05AM

Quote:

Originally posted by Tidris

The floating point numbers improved somewhat by turning off profiling and using -O3 instead of -fast:

float typedef: 42

double typedef: 40

I am using OSX 10.2.7, ProjectBuilder 2.1, gcc-3.3, in case that matters.

I was experimenting with this again tonight and I found that if I change num_vectors from 4096 to 4094 or 4098, the results become:

fixed64 typedef: 28 seconds

float typedef: 12 seconds

double typedef: 12 seconds

I used the -fast option for gcc-3.3. That is very weird!

Edit:

Another way to get similary fast results is to leave num_vectors at 4096 but make the size of the va and vb arrays be 4098.

Edit:

Changing the declaration of m, va, vb to be as follows also does the trick:

numerictype va[num_vectors][4];

numerictype m[4][4] = {{0.1,0.2,0.3,0.0},{0.7,0.8,0.9,0.0},{0.4,0.5,0.6, 0.0},{0.3,0.1,0.6,1.0}};

numerictype vb[num_vectors][4];

tidris · September 30, 2003 1:42AM

Quote:

Originally posted by Tidris

I was experimenting with this again tonight and I found that if I change num_vectors from 4096 to 4094 or 4098, the results become:

fixed64 typedef: 28 seconds

float typedef: 12 seconds

double typedef: 12 seconds

I used the -fast option for gcc-3.3. That is very weird!

Here are additional double typedef results with other num_vector values:

1023 vectors: 3 seconds

1024 vectors: 3 seconds

1025 vectors: 3 seconds

2046 vectors: 6 seconds

2047 vectors: 8 second

2048 vectors: 24 seconds

2049 vectors: 8 seconds

2050 vectors: 7 seconds

This smells a lot like a compiler bug...

programmer · September 30, 2003 9:43AM

Quote:

Originally posted by Tidris

This smells a lot like a compiler bug...

Or a limitation of the G5's L1 cache "way-ness". I deliberately sized those arrays to overflow the L1 cache by about 2x, so I don't want to shrink them. Lets leave the number of array entries unchanged but use your version with the re-ordered data arrays, which seems to avoid the problem on the G5. It would be interesting if somebody could use the CHUD tools to verify the source of the problem, however.

My 1 GHz dual MDD G4 does this test (regardless of declaration order) in approximately:

fixed64 372

float 56

double 59

The numbers you posted for your G5 are pretty much exactly what I would have expected for a 2 GHz G5. The fixed64 number is especially interesting since it shows a 13.3x speedup thanks to the 970 being a 64-bit processor. The float numbers show both the clock rate doubling plus twice the number of FPUs, plus a bit more due to better FPU resources.

Your fiddling with the code having such a huge impact does demonstrate how fragile performance on high speed processors can be, and why code should be profiled.

tidris · September 30, 2003 1:58PM

Quote:

Originally posted by Tidris

I was experimenting with this again tonight and I found that if I change num_vectors from 4096 to 4094 or 4098, the results become:

fixed64 typedef: 28 seconds

float typedef: 12 seconds

double typedef: 12 seconds

I used the -fast option for gcc-3.3. That is very weird!

If you liked those numbers, you'll like these even more:

float typedef: 7 seconds

double typedef: 7 seconds

I got those with num_vectors at 4096 by changing TransformVectors() to be as follows:

void TransformVectors (unsigned int count, numerictype m[4][4], numerictype in[][4], numerictype out[][4])

{

\tnumerictype m00=m[0][0], m01=m[0][1], m02=m[0][2], m03=m[0][3];

\tnumerictype m10=m[1][0], m11=m[1][1], m12=m[1][2], m13=m[1][3];

\tnumerictype m20=m[2][0], m21=m[2][1], m22=m[2][2], m23=m[2][3];

\tnumerictype m30=m[3][0], m31=m[3][1], m32=m[3][2], m33=m[3][3];

\t

\tfor (unsigned int i = 0; i < count; ++i)

\t{

\t#if 0

// Old slow way.

\t out[i][0] = m00*in[i][0]+m01*in[i][1]+m02*in[i][2]+m03*in[i][3];

\t out[i][1] = m10*in[i][0]+m11*in[i][1]+m12*in[i][2]+m13*in[i][3];

\t out[i][2] = m20*in[i][0]+m21*in[i][1]+m22*in[i][2]+m23*in[i][3];

\t out[i][3] = m30*in[i][0]+m31*in[i][1]+m32*in[i][2]+m33*in[i][3];

\t#else

// New fast way.

\t numerictype out0 = m00*in[i][0]+m01*in[i][1]+m02*in[i][2]+m03*in[i][3];

\t numerictype out1 = m10*in[i][0]+m11*in[i][1]+m12*in[i][2]+m13*in[i][3];

\t numerictype out2 = m20*in[i][0]+m21*in[i][1]+m22*in[i][2]+m23*in[i][3];

\t numerictype out3 = m30*in[i][0]+m31*in[i][1]+m32*in[i][2]+m33*in[i][3];

\t out[i][0] = out0;

\t out[i][1] = out1;

\t out[i][2] = out2;

\t out[i][3] = out3;

\t#endif

\t}

}

The idea behind the change is to avoid intermixing accesses to the input and output arrays.

lundy · September 30, 2003 2:30PM

Here is the multithreaded version of the original code. I just put a thread wrapper around the whole shebang.

http://www.johnnylundy.com/MPvectors.cpp.zip

I get 26 seconds on the dual G5:

Creating Thread Number: 0

Creating Thread Number: 1

Loop Done; Time = 26 secs for thread#: 0, Loops = 50000

Loop Done; Time = 26 secs for thread#: 1, Loops = 50000

vectors has exited with status 0.

programmer · September 30, 2003 11:05PM

Quote:

Originally posted by Tidris

If you liked those numbers, you'll like these even more:

float typedef: 7 seconds

double typedef: 7 seconds

...

The idea behind the change is to avoid intermixing accesses to the input and output arrays.

Wow, that's pretty good.

Somebody want to run this on an Intel and an AMD so we can have a non-Mac frame of reference.

Strange that the intermixing has such an effect, especially since its all in cache. I wonder if it is simply a matter of writing results to registers rather than forcing the compiler to write back to memory to avoid potential aliasing problems with the following math.

EDIT: It helps the G4 substantially too -- from 59 down to 38.

lundy · October 1, 2003 2:22PM

Quote:

Originally posted by Programmer

Wow, that's pretty good. Somebody want to run this on an Intel and an AMD so we can have a non-Mac frame of reference.

Strange that the intermixing has such an effect, especially since its all in cache. I wonder if it is simply a matter of writing results to registers rather than forcing the compiler to write back to memory to avoid potential aliasing problems with the following math.

EDIT: It helps the G4 substantially too -- from 59 down to 38.

I'm slogging through the Programming Environments Manual for the PowerPC family - very very fascinating stuff.

Little-endian is supported by a mode bit.

But I can't seem to get the Xcode debugger (really gdb) to show me the product of C=A*B where all are long long ints, and compiler flags set for G5.

Is it normal for gcc to give a warning on a source statement

static long long int A=0xFFFFFFFFFFFFFFFF;

that the constant is too big for a long int? Well duh, it's not a long int.

Post your speeds - Calculate 50 Million Factorials

Comments