compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"
I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.
As programmer has pointed out, most of the time is spent on function calls. So, it shouldn't matter all that much if you use doubles or Int64s. Besides, unless you change your algorithm, a compiler won't be able to generate code that uses more than one processor unit.
compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"
I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.
As usual, right click, save, chmod 755 and run.
G4 Dual 1gHz:
Code:
Creating Thread Number: 0
Creating Thread Number: 1
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done;
Time=18 secs for thread#: 1, Loops=25000000
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done;
Time=18 secs for thread#: 0, Loops=25000000
Either I'm having Terminal problems or there's something fishy with the binary. When I chmod 755 I get:
Quote:
usage: chmod [-R [-H | -L | -P]] mode file ...
and then when I try to run it I get:
Quote:
dyld: ./dftf-g5 can't open library: /System/Library/PrivateFrameworks/ZeroLink.framework/Versions/A/ZeroLink (No such file or directory, errno = 2)
It could very well be you... you have to specify a filename after "chmod 755" as suggested before you could drag the binary off the desktop into the terminal to have it show up... im not on my mac right now so i cant try it.
It could very well be you... you have to specify a filename after "chmod 755" as suggested before you could drag the binary off the desktop into the terminal to have it show up... im not on my mac right now so i cant try it.
Cheers
Thats what I did and no go. I'll keep playng around though...
Its the binary -- it is trying to dynamically link against some library that isn't part of the standard MacOS X install.
I think somebody ought to come up with an "evaluation algorithm" that can be optimized and will take advantage of the G5's better execution resources. Eliminate the recursion and so much more math (relative to the loop & function overhead). Some kind of long polynomial evaluation with lots of multiply-add operations per iteration. C'mon, I know one of you guys can do it...
I never called it a benchmark, I don't think, and if I did I didn't mean to.
You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?
Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.
With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.
In the real world, neither recursion or iteration would be the correct way to calculate a factorial (just use a formula). However, kudos for being willing to solve a problem with recursion (even if calculating a factorial is the classic intro to computer science recursion example). You would be surprised by how many programmers won't or can't use recursion to solve a problem.
I would encourage you to try to use recursion in your normal problem solving. Sometimes, it is easier to solve a problem recursively than it is iteratively. You just have to be sure that you won't blow the stack, but there are strategies for maximizing the number of recursive calls.
As programmer has pointed out, most of the time is spent on function calls. So, it shouldn't matter all that much if you use doubles or Int64s. Besides, unless you change your algorithm, a compiler won't be able to generate code that uses more than one processor unit.
Do you think having four threads would get all 4 FPUs involved? Or simpler to just have 2 separate FP multiply instructions in the function so that the instructions can be scheduled by the 970 in parallel to both FPUs?
Do you think having four threads would get all 4 FPUs involved? Or simpler to just have 2 separate FP multiply instructions in the function so that the instructions can be scheduled by the 970 in parallel to both FPUs?
There are only 2 processors so 2 threads will fully involve them. Each thread should execute code which has a lengthy series of calculations otherwise you aren't going to see the advantages of a pipeline FPU. If there are less than 12 FPU instructions (preferably multiply-add pairs) then you aren't seeing the potential of the processor.
Unfortunately it can't even do that because the original program uses recursion.
Seems quite provable to me that the defined factorial function has no side effects and will terminate when called with the value 16... thus the call to factorial can be replaced with the static value, and then the whole loop optimized out. Or perhaps I'm just used to saner languages than C?
Seems quite provable to me that the defined factorial function has no side effects and will terminate when called with the value 16... thus the call to factorial can be replaced with the static value, and then the whole loop optimized out. Or perhaps I'm just used to saner languages than C?
Well, we know it doesn't have any side-effects, but the gcc compiler doesn't know that.
I mean, obviously it is a trivial app. But if you optimize everything out, there is nothing left to generate some CPU cycles. I deliberately tried NOT to give the compiler anything to optimize. I always thought that having a compiler recognize a loop that didn't change data was silly anyway - how often do programmers write those kinds of loops (unless they are trying to make busy work for the processors, as we are here - in which case you want the compiler to leave it alone).
But threaded is faster than nonthreaded, by about 2x on dual machines. FP arithmetic seems to be a little slower than integer. Pretty much makes sense. Sure easier to interpret than "benchmarks" using multi-megabyte bloated apps that nobody knows whether they use both processors or how they are compiled, or anything.
My 1 GHz eMac got 22 seconds on the first binary that was posted here. I didn't feel like installing the developer tools so I'm glad you turned it into a binary to make it easier to run.
Okay guys, try this. I haven't verified that the math does anything useful except keep the processor busy for a while without crashing (56 seconds in the case of my 1 GHz G4). I've also left multi-threading & SIMD implementations as an exercise for the reader. There are 3 typedefs for numerictype which you can change the commenting on to get it to use different types (float, double, 64-bit integer fixed point). Somebody make & post some binaries.
Comments
Originally posted by lundy
I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip
compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"
I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.
As programmer has pointed out, most of the time is spent on function calls. So, it shouldn't matter all that much if you use doubles or Int64s. Besides, unless you change your algorithm, a compiler won't be able to generate code that uses more than one processor unit.
Originally posted by lundy
I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip
compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"
I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.
As usual, right click, save, chmod 755 and run.
G4 Dual 1gHz:
Code:
Creating Thread Number: 0
Creating Thread Number: 1
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done;
Time=18 secs for thread#: 1, Loops=25000000
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done;
Time=18 secs for thread#: 0, Loops=25000000
Either I'm having Terminal problems or there's something fishy with the binary. When I chmod 755 I get:
usage: chmod [-R [-H | -L | -P]] mode file ...
and then when I try to run it I get:
dyld: ./dftf-g5 can't open library: /System/Library/PrivateFrameworks/ZeroLink.framework/Versions/A/ZeroLink (No such file or directory, errno = 2)
Trace/BPT trap
So is it me or the binary?
So is it me or the binary?
It could very well be you... you have to specify a filename after "chmod 755" as suggested before you could drag the binary off the desktop into the terminal to have it show up... im not on my mac right now so i cant try it.
Cheers
Originally posted by AsLan^
It could very well be you... you have to specify a filename after "chmod 755" as suggested before you could drag the binary off the desktop into the terminal to have it show up... im not on my mac right now so i cant try it.
Cheers
Thats what I did and no go. I'll keep playng around though...
I think somebody ought to come up with an "evaluation algorithm" that can be optimized and will take advantage of the G5's better execution resources. Eliminate the recursion and so much more math (relative to the loop & function overhead). Some kind of long polynomial evaluation with lots of multiply-add operations per iteration. C'mon, I know one of you guys can do it...
Originally posted by Gabid
Thats what I did and no go. I'll keep playng around though...
Yeah - Apple's new ZeroLink feature waits until runtime to look for the libraries, to save compile-execute development time.
I did not notice a problem on my machine since I have the Tools installed and it can find its library at runtime.
Let me try and set it to turn off ZeroLink.
EDIT: OK, now it crashes on my G4 so it ought to be G5-only. Turned off ZeroLink so it is presumably linked.
Try the executable now.
Originally posted by lundy
I never called it a benchmark, I don't think, and if I did I didn't mean to.
You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?
Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.
With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.
In the real world, neither recursion or iteration would be the correct way to calculate a factorial (just use a formula). However, kudos for being willing to solve a problem with recursion (even if calculating a factorial is the classic intro to computer science recursion example). You would be surprised by how many programmers won't or can't use recursion to solve a problem.
I would encourage you to try to use recursion in your normal problem solving. Sometimes, it is easier to solve a problem recursively than it is iteratively. You just have to be sure that you won't blow the stack, but there are strategies for maximizing the number of recursive calls.
Originally posted by lundy
EDIT: OK, now it crashes on my G4 so it ought to be G5-only. Turned off ZeroLink so it is presumably linked.
Try the executable now.
Will do, as soon as I get home tonight.
Originally posted by 123
As programmer has pointed out, most of the time is spent on function calls. So, it shouldn't matter all that much if you use doubles or Int64s. Besides, unless you change your algorithm, a compiler won't be able to generate code that uses more than one processor unit.
Do you think having four threads would get all 4 FPUs involved? Or simpler to just have 2 separate FP multiply instructions in the function so that the instructions can be scheduled by the 970 in parallel to both FPUs?
Originally posted by lundy
Do you think having four threads would get all 4 FPUs involved? Or simpler to just have 2 separate FP multiply instructions in the function so that the instructions can be scheduled by the 970 in parallel to both FPUs?
There are only 2 processors so 2 threads will fully involve them. Each thread should execute code which has a lengthy series of calculations otherwise you aren't going to see the advantages of a pipeline FPU. If there are less than 12 FPU instructions (preferably multiply-add pairs) then you aren't seeing the potential of the processor.
Creating Thread Number: 0
Creating Thread Number: 1
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done; Time = 11 secs for thread#: 1, Loops = 25000000
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done; Time = 11 secs for thread#: 0, Loops = 25000000
This was consistent over mutiple runs. Of note, I don't think I ever heard the extra fans gunning up, though I did when I ran the previous binaries.
Originally posted by Gabid
Excellent. Everything works nicely now, so I can give you results (again!) from my stock G5 1.8:
This was consistent over mutiple runs. Of note, I don't think I ever heard the extra fans gunning up, though I did when I ran the previous binaries.
My G5 dualie arrived today, so I'll have some numbers this week.
And any good optimizer will just optimize out the whole loop
Originally posted by alcimedes
ok, i'm lazy and too tired to read this whole thread. is there something i can download to test my new dualie? (as in which one of these links?)
The G5-specific one is http://www.johnnylundy.com/dftf-G5.zip
Right-click and save, unzip, chmod 755 dftf-G5 in Terminal, and then ./dftf-G5 from the directory it is in.
I got my dualie G5 but I'm still getting it set up.
Best time so far is 4 seconds with the Energy Saver set on "Highest".
Originally posted by Anonymous Karma
Edited: Whoops! That's what I get for not reading the thread. As Programmer said most of this test is function call overhead.
And any good optimizer will just optimize out the whole loop
Unfortunately it can't even do that because the original program uses recursion.
Originally posted by Programmer
Unfortunately it can't even do that because the original program uses recursion.
Seems quite provable to me that the defined factorial function has no side effects and will terminate when called with the value 16... thus the call to factorial can be replaced with the static value, and then the whole loop optimized out. Or perhaps I'm just used to saner languages than C?
Originally posted by Anonymous Karma
Seems quite provable to me that the defined factorial function has no side effects and will terminate when called with the value 16... thus the call to factorial can be replaced with the static value, and then the whole loop optimized out. Or perhaps I'm just used to saner languages than C?
Well, we know it doesn't have any side-effects, but the gcc compiler doesn't know that.
I mean, obviously it is a trivial app. But if you optimize everything out, there is nothing left to generate some CPU cycles. I deliberately tried NOT to give the compiler anything to optimize. I always thought that having a compiler recognize a loop that didn't change data was silly anyway - how often do programmers write those kinds of loops (unless they are trying to make busy work for the processors, as we are here - in which case you want the compiler to leave it alone).
But threaded is faster than nonthreaded, by about 2x on dual machines. FP arithmetic seems to be a little slower than integer. Pretty much makes sense. Sure easier to interpret than "benchmarks" using multi-megabyte bloated apps that nobody knows whether they use both processors or how they are compiled, or anything.
#include <stdio.h>
#include <math.h>
#include <time.h>
#import <pthread.h>
long long fixed64one = 1LL << 32;
class fixed64
{
public:
fixed64 (void) : value(0) {};
fixed64 (double f) : value((long long)(f*(double)fixed64one)) {};
fixed64 (const fixed64& f) : value(f.value) {};
const fixed64& operator= (const fixed64& f) { value = f.value; return(*this); };
const fixed64& operator+= (const fixed64& f) { value += f.value; return(*this); };
const fixed64& operator-= (const fixed64& f) { value -= f.value; return(*this); };
const fixed64& operator*= (const fixed64& f) { value *= f.value; value >>= 32; return(*this); };
private:
long long value;
};
fixed64 operator+ (const fixed64& a, const fixed64& b) { fixed64 result = a; result += b; return(result); }
fixed64 operator- (const fixed64& a, const fixed64& b) { fixed64 result = a; result -= b; return(result); }
fixed64 operator* (const fixed64& a, const fixed64& b) { fixed64 result = a; result *= b; return(result); }
//-------------------------------------------------------------------------------------------------------------------
//typedef fixed64 numerictype;
//typedef float numerictype;
typedef double numerictype;
void TransformVectors (unsigned int count, numerictype m[4][4], numerictype in[][4], numerictype out[][4])
{
numerictype m00=m[0][0], m01=m[0][1], m02=m[0][2], m03=m[0][3];
numerictype m10=m[1][0], m11=m[1][1], m12=m[1][2], m13=m[1][3];
numerictype m20=m[2][0], m21=m[2][1], m22=m[2][2], m23=m[2][3];
numerictype m30=m[3][0], m31=m[3][1], m32=m[3][2], m33=m[3][3];
for (unsigned int i = 0; i < count; ++i)
{
out[i][0] = m00*in[i][0]+m01*in[i][1]+m02*in[i][2]+m03*in[i][3];
out[i][1] = m10*in[i][0]+m11*in[i][1]+m12*in[i][2]+m13*in[i][3];
out[i][2] = m20*in[i][0]+m21*in[i][1]+m22*in[i][2]+m23*in[i][3];
out[i][3] = m30*in[i][0]+m31*in[i][1]+m32*in[i][2]+m33*in[i][3];
}
}
int main (void)
{
const unsigned int num_iterations = 100000;
const unsigned int num_vectors = 4096;
numerictype m[4][4] = {{0.1,0.2,0.3,0.0},{0.7,0.8,0.9,0.0},{0.4,0.5,0.6, 0.0},{0.3,0.1,0.6,1.0}};
numerictype va[num_vectors][4], vb[num_vectors][4];
for (unsigned int count = 0; count < num_vectors; ++count)
{
va[count][0] = count/10.0;
va[count][1] = va[count][0]*va[count][0];
va[count][2] = va[count][0]+va[count][1];
va[count][3] = va[count][0]+va[count][1]*va[count][2];
}
long start, end;
time(&start); //Get starting time
for (unsigned int i = 0; i < num_iterations; ++i)
{
// these are split into halves which could be done by seperate threads...
TransformVectors(num_vectors/2, m, va, vb);
TransformVectors(num_vectors/2, m, va+num_vectors/2, vb+num_vectors/2);
TransformVectors(num_vectors/2, m, vb, va);
TransformVectors(num_vectors/2, m, vb+num_vectors/2, va+num_vectors/2);
}
time(&end); //Get ending time for thread
printf("Time = %d\
", end - start);
return 0;
}