Post your speeds - Calculate 50 Million Factorials

gabid · September 10, 2003 6:54PM

Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...

lundy · September 10, 2003 11:13PM

Quote:

Originally posted by Gabid

Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...

I looked at the code and the "64-bit" switches didn't do anything. So either I missed something or the compiler actually does not generate 64-bit arithmetic code yet.

Time to read thru the Macintoshian Achaia threads at Arse Technica...

pb · September 11, 2003 3:12AM

Quote:

Originally posted by Transcendental Octothorpe

With that change, I am now getting 32 seconds.

Which still seems pretty good, for an 800MHz PIII.

I think that in PC's, these results are explained by the high optimization of the compilers combined with the fact that the tested code is very simple. If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power. But I am not sure if this really happens that way. That's why I suggested to test a less trivial code.

mark_wilkins · September 11, 2003 3:48AM

Seems like we're quietly retracing the evolution of benchmarking here. I hope it's obvious that this won't tell us nearly as much, even, as SPEC numbers...

-- Mark

yevgeny · September 11, 2003 3:53PM

Quote:

Originally posted by Stoo

Daarrr! I keep forgetting that long is the same size as int on x86 (4 bytes).

Using __int 64 I get a much more respectable 20s from my Duron 1300, still using Visual Studio 98.

Which seems a bit too fast.

Yeah, it is kind of weird, but MS doesn't want longs to be the machine word size. Go figure (it probably breaks MFC which explicitly assumes that DWORDs are 32 bit).

Sadly, this means that it is much more difficult to make a 64 bit app for windows- you can't just recompiler it with a 64 bit flag, you must replace the appropriate longs with __int64. Yes, this is a pain.

Hey, this was post 486 for me. I wonder what I'll be writing for post 586, 601, 603, 604, etc.

lundy · September 11, 2003 5:21PM

Quote:

Originally posted by PB

If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power

Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.

mark_wilkins · September 11, 2003 5:32PM

Quote:

Originally posted by lundy

Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.

Except that the compiler can decide on its own to make the function call a series of inline operations, then optimize from there...

-- Mark

qazii · September 11, 2003 11:39PM

Maybe the function call should take the factorial of a random number. (Of course, the seed will be preset to some fixed value).

tadunne1 · September 12, 2003 5:58AM

I'm currently at home ill, I'm also learning Java so just to be sad I've ported the unthreaded version...

http://homepage.mac.com/tadunne/unth...factorial.java

Dont know how do threads yet, so maybe someone can help me with that..

Or maybe the JVM is clever enough to use both procs? (Top said that the Java console program was using 9 threads) .. I don't know..

On my eMac 700 640 Mb:

unthreaded_factorial.c got 41 secs

unthreaded_factorial.java got 78 sec (I need to redo this test as I have A Bittorrent downloading thats eating 15% of my CPU)

Please forgive my bad code, it's my first Java porting job!

pb · September 12, 2003 8:29AM

Quote:

Originally posted by qazII

Maybe the function call should take the factorial of a random number. (Of course, the seed will be preset to some fixed value).

OK, here is an attempt.

-------------------

// factorial.c -- calculate factorials recursively 50 million times

#include <stdio.h> // for printf()

#include <stdlib.h>

#include <time.h>

#include <math.h>

double factorial (long long value)

{

if (value == 1) {

return (1);

} else {

return (value * factorial (value - 1));

}

} // factorial

int main (int argc, char *argv[])

{

long i,j,start,end,dif;

long long result;

time(&start); //Get starting time

for (i=1;i<1000001;i++){

srandom( i );

j = random() % 30;

if( j >= 15 && j <= 20 ){

result = factorial( j ); //Crunch

}}

time(&end); //Get ending time

printf("Start: %ld End: %ld\

\

", start, end);

dif = end - start;

printf ("i= %ld\

Time=%.00ld\

\

",i,dif);

return (0);

} // main

-------------------

Certainly not the best one can do, but it is a little more complicated and must be noticeably slower than the original.

pb · September 12, 2003 8:42AM

Quote:

Originally posted by PB

Certainly not the best one can do, but it is a little more complicated and must be noticeably slower than the original.

Oh, I forgot to point out that the original has 50000000 steps in the loop, while the current one has only 1000000.

lundy · September 19, 2003 8:46PM

Quote:

Originally posted by PB

Oh, I forgot to point out that the original has 50000000 steps in the loop, while the current one has only 1000000.

One guy just posted 5 seconds for his new dualie G5:

http://www.dslreports.com/forum/rema...e=flat#8013450

This was with Eugene's binary at -O5, using the threaded version of my app.

EDIT: Then he set the processor Energy Saver to "Highest":

V-Cs-Computer:~/Desktop/g5factorial] vc% ./tf-G5-O5

Creating Thread Number: 0

Creating Thread Number: 1

Loop Done; Time=4 secs for thread#:0, Loops=25000000

Loop Done; Time=4 secs for thread#:1, Loops=25000000

programmer · September 22, 2003 10:31PM

Another version which uses double precision floating point would be interesting as well.

lundy · September 22, 2003 11:44PM

Quote:

Originally posted by Programmer

Another version which uses double precision floating point would be interesting as well.

Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save

programmer · September 23, 2003 11:08AM

Quote:

Originally posted by lundy

Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save

Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):

Dual 1 GHz MDD, 768 MB RAM, 10.2.6, a couple of minors apps in the background not doing anything...

Threaded integer: 11 secs

Unthreaded integer: 22 secs

Threaded double: 14 secs

Unthreaded double: 26 secs

Somebody with a G5 run these...

Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.

Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do...

gabid · September 23, 2003 11:25AM

Quote:

Originally posted by Programmer

Somebody with a G5 run these...

If the file has been compiled and can be run from the terminal I'll run it on my G5 1.8 sometime after 7 EST when I get home tonight.

lundy · September 23, 2003 3:57PM

Quote:

Originally posted by Programmer

Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):

I never called it a benchmark, I don't think, and if I did I didn't mean to.

Quote:

Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.

You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?

Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.

With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.

Quote:

Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do... [/B]

For the hell of it I tried

return (16*15*14*13*12*11*10*9*8*7*6*5*4*3*2) also, but the compiler was way too smart for that.

gabid · September 23, 2003 6:34PM

Quote:

Originally posted by lundy

Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save

Still no one's had a chance to compile a binary I can run via the Terminal?

Too bad. Some day I'll really have to learn how to complie myself!

programmer · September 23, 2003 10:08PM

Quote:

Originally posted by lundy

You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?

Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.

With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.

Yes, that's what I meant. By forcing the function call and the recursion you are limiting what the compiler can do, and forcing more use of the stack memory. This is a disadvantage for the PowerPC processors because it involves more memory operations and severely limits the number of values that can be carried in registers at any point in time. Interesting problems do much more calculation per function call than this little program, and would result in more interesting comparisons. Somebody ought to post a simple fractal calculator, for example.

lundy · September 23, 2003 11:20PM

Quote:

Originally posted by Gabid

Still no one's had a chance to compile a binary I can run via the Terminal? Too bad. Some day I'll really have to learn how to complie myself!

I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip

compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"

I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.

As usual, right click, save, chmod 755 and run.

G4 Dual 1gHz:

Code:

Creating Thread Number: 0

Creating Thread Number: 1

Factorial result= 2.092279e+13

G5 Double-Precision Float Loop Done;

Time=18 secs for thread#: 1, Loops=25000000

Factorial result= 2.092279e+13

G5 Double-Precision Float Loop Done;

Time=18 secs for thread#: 0, Loops=25000000

Post your speeds - Calculate 50 Million Factorials

Comments