Hmmm, that's interesting... can somebody send me their G5 so I can run a profile on the code and see why it isn't significantly faster than my G4 running at half the clock rate? Thanks.
I have been experimenting with that this morning but I must be doing something wrong because the result is worse than with gcc-3.3. For example, for the double typedef the best I can get with xlc is 45 versus 40 with gcc-3.3.
I have been experimenting with that this morning but I must be doing something wrong because the result is worse than with gcc-3.3. For example, for the double typedef the best I can get with xlc is 45 versus 40 with gcc-3.3.
59 seconds on the dual G5 EDIT: with "double".
42 seconds with G5 optimization.
50 seconds with long long ints.
I made a custom Build Style in Xcode called G5-Optimized, but Xcode shows this on the detail line:
Building target ?vectors? with build style ?G5-Optimized? (optimization:level ?size?, debug-symbols:on)
Optimization level:"size"???
Anybody else getting anything different? Xcode is really not that easy to get a handle on. A simple menu with the choices would be easier, for chrissakes.
Or a limitation of the G5's L1 cache "way-ness". I deliberately sized those arrays to overflow the L1 cache by about 2x, so I don't want to shrink them. Lets leave the number of array entries unchanged but use your version with the re-ordered data arrays, which seems to avoid the problem on the G5. It would be interesting if somebody could use the CHUD tools to verify the source of the problem, however.
My 1 GHz dual MDD G4 does this test (regardless of declaration order) in approximately:
fixed64 372
float 56
double 59
The numbers you posted for your G5 are pretty much exactly what I would have expected for a 2 GHz G5. The fixed64 number is especially interesting since it shows a 13.3x speedup thanks to the 970 being a 64-bit processor. The float numbers show both the clock rate doubling plus twice the number of FPUs, plus a bit more due to better FPU resources.
Your fiddling with the code having such a huge impact does demonstrate how fragile performance on high speed processors can be, and why code should be profiled.
If you liked those numbers, you'll like these even more:
float typedef: 7 seconds
double typedef: 7 seconds
...
The idea behind the change is to avoid intermixing accesses to the input and output arrays.
Wow, that's pretty good. Somebody want to run this on an Intel and an AMD so we can have a non-Mac frame of reference.
Strange that the intermixing has such an effect, especially since its all in cache. I wonder if it is simply a matter of writing results to registers rather than forcing the compiler to write back to memory to avoid potential aliasing problems with the following math.
EDIT: It helps the G4 substantially too -- from 59 down to 38.
Wow, that's pretty good. Somebody want to run this on an Intel and an AMD so we can have a non-Mac frame of reference.
Strange that the intermixing has such an effect, especially since its all in cache. I wonder if it is simply a matter of writing results to registers rather than forcing the compiler to write back to memory to avoid potential aliasing problems with the following math.
EDIT: It helps the G4 substantially too -- from 59 down to 38.
I'm slogging through the Programming Environments Manual for the PowerPC family - very very fascinating stuff.
Little-endian is supported by a mode bit.
But I can't seem to get the Xcode debugger (really gdb) to show me the product of C=A*B where all are long long ints, and compiler flags set for G5.
Is it normal for gcc to give a warning on a source statement
static long long int A=0xFFFFFFFFFFFFFFFF;
that the constant is too big for a long int? Well duh, it's not a long int.
Comments
Originally posted by Programmer
Okay guys, try this.
Won't compile for me with the default flags and Xcode.
cd /Users/lundy/Programming/vectors
/Users/lundy/Programming/vectors/main.c:7: error: parse error before "fixed64"
/Users/lundy/Programming/vectors/main.c:8: error: syntax error before '{' token
/Users/lundy/Programming/vectors/main.c:11: error: parse error before ':' token
/Users/lundy/Programming/vectors/main.c:12: warning: type defaults to `int' in declaration of `fixed64'
/Users/lundy/Programming/vectors/main.c:12: error: parse error before '&' token
/Users/lundy/Programming/vectors/main.c:13: error: parse error before '&' token
/Users/lundy/Programming/vectors/main.c:14: error: parse error before '&' token
/Users/lundy/Programming/vectors/main.c:15: error: parse error before '&' token
/Users/lundy/Programming/vectors/main.c:16: error: parse error before '&' token
/Users/lundy/Programming/vectors/main.c:17: error: parse error before ':' token
/Users/lundy/Programming/vectors/main.c:21: error: parse error before "operator"
/Users/lundy/Programming/vectors/main.c: In function `TransformVectors':
/Users/lundy/Programming/vectors/main.c:37: error: `for' loop initial declaration used outside C99 mode
/Users/lundy/Programming/vectors/main.c: In function `main':
/Users/lundy/Programming/vectors/main.c:58: error: ` for' loop initial declaration used outside C99 mode
/Users/lundy/Programming/vectors/main.c:70: error: `for' loop initial declaration used outside C99 mode
/Users/lundy/Programming/vectors/main.c:81: warning: int format, long int arg (arg 2)
\t/Users/lundy/Programming/vectors/main.c:7: error: parse error before "fixed64"
\t/Users/lundy/Programming/vectors/main.c:8: error: syntax error before '{' token
\t/Users/lundy/Programming/vectors/main.c:11: error: parse error before ':' token
\t/Users/lundy/Programming/vectors/main.c:12: warning: type defaults to `int' in declaration of `fixed64'
\t/Users/lundy/Programming/vectors/main.c:12: error: parse error before '&' token
\t/Users/lundy/Programming/vectors/main.c:13: error: parse error before '&' token
\t/Users/lundy/Programming/vectors/main.c:14: error: parse error before '&' token
\t/Users/lundy/Programming/vectors/main.c:15: error: parse error before '&' token
\t/Users/lundy/Programming/vectors/main.c:16: error: parse error before '&' token
\t/Users/lundy/Programming/vectors/main.c:17: error: parse error before ':' token
\t/Users/lundy/Programming/vectors/main.c:21: error: parse error before "operator"
\t/Users/lundy/Programming/vectors/main.c:37: error: `for' loop initial declaration used outside C99 mode
\t/Users/lundy/Programming/vectors/main.c:58: error: `for' loop initial declaration used outside C99 mode
\t/Users/lundy/Programming/vectors/main.c:70: error: `for' loop initial declaration used outside C99 mode
Originally posted by lundy
Won't compile for me with the default flags and Xcode.
Make sure you give it the extension ".cpp" -- it is C++ code, not C code.
Originally posted by Programmer
Make sure you give it the extension ".cpp" -- it is C++ code, not C code.
D'oh!
Originally posted by Programmer
Somebody want to try this on a G5, please? Remember, save as a .cpp file.
On a dual G5 I got a result of 46. On a 500 MHz G3 PowerBook I got a result of 425. I optimized for speed.
Originally posted by Tidris
On a dual G5 I got a result of 46. On a 500 MHz G3 PowerBook I got a result of 425. I optimized for speed.
Ok, here is a more detailed report.
fixed64 typedef: 48
float typedef: 47
double typedef: 47 (sometimes 46)
That was on a dual G5.
Originally posted by Programmer
Somebody want to try this on a G5, please? Remember, save as a .cpp file.
The first segment of the Xcode disk image is corrupt - at least for me it crashes Disk Utility. Another dude confirmed this.
Don't want to go back to Project Builder and the only way to get gcc 3.3 on Panther is to install Xcode.
Originally posted by Tidris
Ok, here is a more detailed report.
fixed64 typedef: 48
float typedef: 47
double typedef: 47 (sometimes 46)
That was on a dual G5.
The floating point numbers improved somewhat by turning off profiling and using -O3 instead of -fast:
float typedef: 42
double typedef: 40
I am using OSX 10.2.7, ProjectBuilder 2.1, gcc-3.3, in case that matters.
Originally posted by Zapchud
How about testing it with the XLC++-compiler?
I have been experimenting with that this morning but I must be doing something wrong because the result is worse than with gcc-3.3. For example, for the double typedef the best I can get with xlc is 45 versus 40 with gcc-3.3.
Originally posted by Tidris
I have been experimenting with that this morning but I must be doing something wrong because the result is worse than with gcc-3.3. For example, for the double typedef the best I can get with xlc is 45 versus 40 with gcc-3.3.
59 seconds on the dual G5 EDIT: with "double".
42 seconds with G5 optimization.
50 seconds with long long ints.
I made a custom Build Style in Xcode called G5-Optimized, but Xcode shows this on the detail line:
Building target ?vectors? with build style ?G5-Optimized? (optimization:level ?size?, debug-symbols:on)
Optimization level:"size"???
Anybody else getting anything different? Xcode is really not that easy to get a handle on. A simple menu with the choices would be easier, for chrissakes.
ryan% ./unthreaded_factorial
Start: 1064696504 End: 1064696580
i= 50000001
Time=76
ryan% ./threaded_factorial
Creating Thread Number: 0
Creating Thread Number: 1
Loop Done; Time=89 secs for thread#:0, Loops=25000000
Loop Done; Time=89 secs for thread#:1, Loops=25000000
G4 Cube 450
Originally posted by Tidris
The floating point numbers improved somewhat by turning off profiling and using -O3 instead of -fast:
float typedef: 42
double typedef: 40
I am using OSX 10.2.7, ProjectBuilder 2.1, gcc-3.3, in case that matters.
I was experimenting with this again tonight and I found that if I change num_vectors from 4096 to 4094 or 4098, the results become:
fixed64 typedef: 28 seconds
float typedef: 12 seconds
double typedef: 12 seconds
I used the -fast option for gcc-3.3. That is very weird!
Edit:
Another way to get similary fast results is to leave num_vectors at 4096 but make the size of the va and vb arrays be 4098.
Edit:
Changing the declaration of m, va, vb to be as follows also does the trick:
numerictype va[num_vectors][4];
numerictype m[4][4] = {{0.1,0.2,0.3,0.0},{0.7,0.8,0.9,0.0},{0.4,0.5,0.6, 0.0},{0.3,0.1,0.6,1.0}};
numerictype vb[num_vectors][4];
Originally posted by Tidris
I was experimenting with this again tonight and I found that if I change num_vectors from 4096 to 4094 or 4098, the results become:
fixed64 typedef: 28 seconds
float typedef: 12 seconds
double typedef: 12 seconds
I used the -fast option for gcc-3.3. That is very weird!
Here are additional double typedef results with other num_vector values:
1023 vectors: 3 seconds
1024 vectors: 3 seconds
1025 vectors: 3 seconds
2046 vectors: 6 seconds
2047 vectors: 8 second
2048 vectors: 24 seconds
2049 vectors: 8 seconds
2050 vectors: 7 seconds
This smells a lot like a compiler bug...
Originally posted by Tidris
This smells a lot like a compiler bug...
Or a limitation of the G5's L1 cache "way-ness". I deliberately sized those arrays to overflow the L1 cache by about 2x, so I don't want to shrink them. Lets leave the number of array entries unchanged but use your version with the re-ordered data arrays, which seems to avoid the problem on the G5. It would be interesting if somebody could use the CHUD tools to verify the source of the problem, however.
My 1 GHz dual MDD G4 does this test (regardless of declaration order) in approximately:
fixed64 372
float 56
double 59
The numbers you posted for your G5 are pretty much exactly what I would have expected for a 2 GHz G5. The fixed64 number is especially interesting since it shows a 13.3x speedup thanks to the 970 being a 64-bit processor. The float numbers show both the clock rate doubling plus twice the number of FPUs, plus a bit more due to better FPU resources.
Your fiddling with the code having such a huge impact does demonstrate how fragile performance on high speed processors can be, and why code should be profiled.
Originally posted by Tidris
I was experimenting with this again tonight and I found that if I change num_vectors from 4096 to 4094 or 4098, the results become:
fixed64 typedef: 28 seconds
float typedef: 12 seconds
double typedef: 12 seconds
I used the -fast option for gcc-3.3. That is very weird!
If you liked those numbers, you'll like these even more:
float typedef: 7 seconds
double typedef: 7 seconds
I got those with num_vectors at 4096 by changing TransformVectors() to be as follows:
void TransformVectors (unsigned int count, numerictype m[4][4], numerictype in[][4], numerictype out[][4])
{
\tnumerictype m00=m[0][0], m01=m[0][1], m02=m[0][2], m03=m[0][3];
\tnumerictype m10=m[1][0], m11=m[1][1], m12=m[1][2], m13=m[1][3];
\tnumerictype m20=m[2][0], m21=m[2][1], m22=m[2][2], m23=m[2][3];
\tnumerictype m30=m[3][0], m31=m[3][1], m32=m[3][2], m33=m[3][3];
\t
\tfor (unsigned int i = 0; i < count; ++i)
\t{
\t#if 0
// Old slow way.
\t out[i][0] = m00*in[i][0]+m01*in[i][1]+m02*in[i][2]+m03*in[i][3];
\t out[i][1] = m10*in[i][0]+m11*in[i][1]+m12*in[i][2]+m13*in[i][3];
\t out[i][2] = m20*in[i][0]+m21*in[i][1]+m22*in[i][2]+m23*in[i][3];
\t out[i][3] = m30*in[i][0]+m31*in[i][1]+m32*in[i][2]+m33*in[i][3];
\t#else
// New fast way.
\t numerictype out0 = m00*in[i][0]+m01*in[i][1]+m02*in[i][2]+m03*in[i][3];
\t numerictype out1 = m10*in[i][0]+m11*in[i][1]+m12*in[i][2]+m13*in[i][3];
\t numerictype out2 = m20*in[i][0]+m21*in[i][1]+m22*in[i][2]+m23*in[i][3];
\t numerictype out3 = m30*in[i][0]+m31*in[i][1]+m32*in[i][2]+m33*in[i][3];
\t out[i][0] = out0;
\t out[i][1] = out1;
\t out[i][2] = out2;
\t out[i][3] = out3;
\t#endif
\t}
}
The idea behind the change is to avoid intermixing accesses to the input and output arrays.
http://www.johnnylundy.com/MPvectors.cpp.zip
I get 26 seconds on the dual G5:
Creating Thread Number: 0
Creating Thread Number: 1
Loop Done; Time = 26 secs for thread#: 0, Loops = 50000
Loop Done; Time = 26 secs for thread#: 1, Loops = 50000
vectors has exited with status 0.
Originally posted by Tidris
If you liked those numbers, you'll like these even more:
float typedef: 7 seconds
double typedef: 7 seconds
...
The idea behind the change is to avoid intermixing accesses to the input and output arrays.
Wow, that's pretty good. Somebody want to run this on an Intel and an AMD so we can have a non-Mac frame of reference.
Strange that the intermixing has such an effect, especially since its all in cache. I wonder if it is simply a matter of writing results to registers rather than forcing the compiler to write back to memory to avoid potential aliasing problems with the following math.
EDIT: It helps the G4 substantially too -- from 59 down to 38.
Originally posted by Programmer
Wow, that's pretty good. Somebody want to run this on an Intel and an AMD so we can have a non-Mac frame of reference.
Strange that the intermixing has such an effect, especially since its all in cache. I wonder if it is simply a matter of writing results to registers rather than forcing the compiler to write back to memory to avoid potential aliasing problems with the following math.
EDIT: It helps the G4 substantially too -- from 59 down to 38.
I'm slogging through the Programming Environments Manual for the PowerPC family - very very fascinating stuff.
Little-endian is supported by a mode bit.
But I can't seem to get the Xcode debugger (really gdb) to show me the product of C=A*B where all are long long ints, and compiler flags set for G5.
Is it normal for gcc to give a warning on a source statement
static long long int A=0xFFFFFFFFFFFFFFFF;
that the constant is too big for a long int? Well duh, it's not a long int.