Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...
Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...
I looked at the code and the "64-bit" switches didn't do anything. So either I missed something or the compiler actually does not generate 64-bit arithmetic code yet.
Time to read thru the Macintoshian Achaia threads at Arse Technica...
Which still seems pretty good, for an 800MHz PIII.
I think that in PC's, these results are explained by the high optimization of the compilers combined with the fact that the tested code is very simple. If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power. But I am not sure if this really happens that way. That's why I suggested to test a less trivial code.
Seems like we're quietly retracing the evolution of benchmarking here. I hope it's obvious that this won't tell us nearly as much, even, as SPEC numbers...
Daarrr! I keep forgetting that long is the same size as int on x86 (4 bytes).
Using __int 64 I get a much more respectable 20s from my Duron 1300, still using Visual Studio 98.
Which seems a bit too fast.
Yeah, it is kind of weird, but MS doesn't want longs to be the machine word size. Go figure (it probably breaks MFC which explicitly assumes that DWORDs are 32 bit).
Sadly, this means that it is much more difficult to make a 64 bit app for windows- you can't just recompiler it with a 64 bit flag, you must replace the appropriate longs with __int64. Yes, this is a pain.
Hey, this was post 486 for me. I wonder what I'll be writing for post 586, 601, 603, 604, etc.
If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power
Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.
Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.
Except that the compiler can decide on its own to make the function call a series of inline operations, then optimize from there...
Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save
Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):
Dual 1 GHz MDD, 768 MB RAM, 10.2.6, a couple of minors apps in the background not doing anything...
Threaded integer: 11 secs
Unthreaded integer: 22 secs
Threaded double: 14 secs
Unthreaded double: 26 secs
Somebody with a G5 run these...
Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.
Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do...
Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):
I never called it a benchmark, I don't think, and if I did I didn't mean to.
Quote:
Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.
You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?
Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.
With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.
Quote:
Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do... [/B]
For the hell of it I tried
return (16*15*14*13*12*11*10*9*8*7*6*5*4*3*2) also, but the compiler was way too smart for that.
You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?
Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.
With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.
Yes, that's what I meant. By forcing the function call and the recursion you are limiting what the compiler can do, and forcing more use of the stack memory. This is a disadvantage for the PowerPC processors because it involves more memory operations and severely limits the number of values that can be carried in registers at any point in time. Interesting problems do much more calculation per function call than this little program, and would result in more interesting comparisons. Somebody ought to post a simple fractal calculator, for example.
compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"
I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.
Comments
Originally posted by Gabid
Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...
I looked at the code and the "64-bit" switches didn't do anything. So either I missed something or the compiler actually does not generate 64-bit arithmetic code yet.
Time to read thru the Macintoshian Achaia threads at Arse Technica...
Originally posted by Transcendental Octothorpe
With that change, I am now getting 32 seconds.
Which still seems pretty good, for an 800MHz PIII.
I think that in PC's, these results are explained by the high optimization of the compilers combined with the fact that the tested code is very simple. If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power. But I am not sure if this really happens that way. That's why I suggested to test a less trivial code.
-- Mark
Originally posted by Stoo
Daarrr! I keep forgetting that long is the same size as int on x86 (4 bytes).
Using __int 64 I get a much more respectable 20s from my Duron 1300, still using Visual Studio 98.
Which seems a bit too fast.
Yeah, it is kind of weird, but MS doesn't want longs to be the machine word size. Go figure (it probably breaks MFC which explicitly assumes that DWORDs are 32 bit).
Sadly, this means that it is much more difficult to make a 64 bit app for windows- you can't just recompiler it with a 64 bit flag, you must replace the appropriate longs with __int64. Yes, this is a pain.
Hey, this was post 486 for me. I wonder what I'll be writing for post 586, 601, 603, 604, etc.
Originally posted by PB
If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power
Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.
Originally posted by lundy
Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.
Except that the compiler can decide on its own to make the function call a series of inline operations, then optimize from there...
-- Mark
http://homepage.mac.com/tadunne/unth...factorial.java
Dont know how do threads yet, so maybe someone can help me with that..
Or maybe the JVM is clever enough to use both procs? (Top said that the Java console program was using 9 threads) .. I don't know..
On my eMac 700 640 Mb:
unthreaded_factorial.c got 41 secs
unthreaded_factorial.java got 78 sec (I need to redo this test as I have A Bittorrent downloading thats eating 15% of my CPU)
Please forgive my bad code, it's my first Java porting job!
Originally posted by qazII
Maybe the function call should take the factorial of a random number. (Of course, the seed will be preset to some fixed value).
OK, here is an attempt.
-------------------
// factorial.c -- calculate factorials recursively 50 million times
#include <stdio.h> // for printf()
#include <stdlib.h>
#include <time.h>
#include <math.h>
double factorial (long long value)
{
if (value == 1) {
return (1);
} else {
return (value * factorial (value - 1));
}
} // factorial
int main (int argc, char *argv[])
{
long i,j,start,end,dif;
long long result;
time(&start); //Get starting time
for (i=1;i<1000001;i++){
srandom( i );
j = random() % 30;
if( j >= 15 && j <= 20 ){
result = factorial( j ); //Crunch
}}
time(&end); //Get ending time
printf("Start: %ld End: %ld\
\
", start, end);
dif = end - start;
printf ("i= %ld\
Time=%.00ld\
\
",i,dif);
return (0);
} // main
-------------------
Certainly not the best one can do, but it is a little more complicated and must be noticeably slower than the original.
Originally posted by PB
Certainly not the best one can do, but it is a little more complicated and must be noticeably slower than the original.
Oh, I forgot to point out that the original has 50000000 steps in the loop, while the current one has only 1000000.
Originally posted by PB
Oh, I forgot to point out that the original has 50000000 steps in the loop, while the current one has only 1000000.
One guy just posted 5 seconds for his new dualie G5:
http://www.dslreports.com/forum/rema...e=flat#8013450
This was with Eugene's binary at -O5, using the threaded version of my app.
EDIT: Then he set the processor Energy Saver to "Highest":
V-Cs-Computer:~/Desktop/g5factorial] vc% ./tf-G5-O5
Creating Thread Number: 0
Creating Thread Number: 1
Loop Done; Time=4 secs for thread#:0, Loops=25000000
Loop Done; Time=4 secs for thread#:1, Loops=25000000
Originally posted by Programmer
Another version which uses double precision floating point would be interesting as well.
Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save
Originally posted by lundy
Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save
Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):
Dual 1 GHz MDD, 768 MB RAM, 10.2.6, a couple of minors apps in the background not doing anything...
Threaded integer: 11 secs
Unthreaded integer: 22 secs
Threaded double: 14 secs
Unthreaded double: 26 secs
Somebody with a G5 run these...
Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.
Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do...
Originally posted by Programmer
Somebody with a G5 run these...
If the file has been compiled and can be run from the terminal I'll run it on my G5 1.8 sometime after 7 EST when I get home tonight.
Originally posted by Programmer
Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):
I never called it a benchmark, I don't think, and if I did I didn't mean to.
Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.
You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?
Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.
With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.
Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do... [/B]
For the hell of it I tried
return (16*15*14*13*12*11*10*9*8*7*6*5*4*3*2) also, but the compiler was way too smart for that.
Originally posted by lundy
Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save
Still no one's had a chance to compile a binary I can run via the Terminal? Too bad. Some day I'll really have to learn how to complie myself!
Originally posted by lundy
You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?
Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.
With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.
Yes, that's what I meant. By forcing the function call and the recursion you are limiting what the compiler can do, and forcing more use of the stack memory. This is a disadvantage for the PowerPC processors because it involves more memory operations and severely limits the number of values that can be carried in registers at any point in time. Interesting problems do much more calculation per function call than this little program, and would result in more interesting comparisons. Somebody ought to post a simple fractal calculator, for example.
Originally posted by Gabid
Still no one's had a chance to compile a binary I can run via the Terminal? Too bad. Some day I'll really have to learn how to complie myself!
I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip
compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"
I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.
As usual, right click, save, chmod 755 and run.
G4 Dual 1gHz:
Creating Thread Number: 0
Creating Thread Number: 1
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done;
Time=18 secs for thread#: 1, Loops=25000000
Factorial result= 2.092279e+13
G5 Double-Precision Float Loop Done;
Time=18 secs for thread#: 0, Loops=25000000