Post your speeds - Calculate 50 Million Factorials

123578

Comments

  • Reply 81 of 146
    gabidgabid Posts: 477member
    Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...
  • Reply 82 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Gabid

    Looks like 10 secs is my limit! Running Eugene's binaries (which now work: thanks again!) I got 10 secs for everything, both threaded and unthreaded. And this was G5 optimised? Interestingly, I got the same speed with the binary that wasn't specially for the G5. I wonder if there is any other way to optimise things...



    I looked at the code and the "64-bit" switches didn't do anything. So either I missed something or the compiler actually does not generate 64-bit arithmetic code yet.



    Time to read thru the Macintoshian Achaia threads at Arse Technica...
  • Reply 83 of 146
    pbpb Posts: 4,255member
    Quote:

    Originally posted by Transcendental Octothorpe

    With that change, I am now getting 32 seconds.



    Which still seems pretty good, for an 800MHz PIII.




    I think that in PC's, these results are explained by the high optimization of the compilers combined with the fact that the tested code is very simple. If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power. But I am not sure if this really happens that way. That's why I suggested to test a less trivial code.
  • Reply 84 of 146
    Seems like we're quietly retracing the evolution of benchmarking here. I hope it's obvious that this won't tell us nearly as much, even, as SPEC numbers...



    -- Mark
  • Reply 85 of 146
    yevgenyyevgeny Posts: 1,148member
    Quote:

    Originally posted by Stoo

    Daarrr! I keep forgetting that long is the same size as int on x86 (4 bytes).



    Using __int 64 I get a much more respectable 20s from my Duron 1300, still using Visual Studio 98.



    Which seems a bit too fast.




    Yeah, it is kind of weird, but MS doesn't want longs to be the machine word size. Go figure (it probably breaks MFC which explicitly assumes that DWORDs are 32 bit).



    Sadly, this means that it is much more difficult to make a 64 bit app for windows- you can't just recompiler it with a 64 bit flag, you must replace the appropriate longs with __int64. Yes, this is a pain.



    Hey, this was post 486 for me. I wonder what I'll be writing for post 586, 601, 603, 604, etc.
  • Reply 86 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by PB

    If all you are doing is to repeat in a loop a fixed calculation, the compiler is perhaps able to predict in some sense the calculation and consequently you measure rather the loop capabilities than the actual processing power



    Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.
  • Reply 87 of 146
    Quote:

    Originally posted by lundy

    Right ... but it's a function call, so I don't see how the compiler can move it outside the loop - the compiler has no way to know whether the function (factorial) is going to return a different number each time, even given the same argument.



    Except that the compiler can decide on its own to make the function call a series of inline operations, then optimize from there...



    -- Mark
  • Reply 88 of 146
    Maybe the function call should take the factorial of a random number. (Of course, the seed will be preset to some fixed value).
  • Reply 89 of 146
    I'm currently at home ill, I'm also learning Java so just to be sad I've ported the unthreaded version...



    http://homepage.mac.com/tadunne/unth...factorial.java



    Dont know how do threads yet, so maybe someone can help me with that..



    Or maybe the JVM is clever enough to use both procs? (Top said that the Java console program was using 9 threads) .. I don't know..



    On my eMac 700 640 Mb:



    unthreaded_factorial.c got 41 secs



    unthreaded_factorial.java got 78 sec (I need to redo this test as I have A Bittorrent downloading thats eating 15% of my CPU)



    Please forgive my bad code, it's my first Java porting job!



  • Reply 90 of 146
    pbpb Posts: 4,255member
    Quote:

    Originally posted by qazII

    Maybe the function call should take the factorial of a random number. (Of course, the seed will be preset to some fixed value).



    OK, here is an attempt.



    -------------------



    // factorial.c -- calculate factorials recursively 50 million times



    #include <stdio.h> // for printf()

    #include <stdlib.h>

    #include <time.h>

    #include <math.h>



    double factorial (long long value)

    {

    if (value == 1) {

    return (1);

    } else {

    return (value * factorial (value - 1));

    }

    } // factorial





    int main (int argc, char *argv[])

    {

    long i,j,start,end,dif;

    long long result;

    time(&start); //Get starting time





    for (i=1;i<1000001;i++){

    srandom( i );

    j = random() % 30;

    if( j >= 15 && j <= 20 ){

    result = factorial( j ); //Crunch

    }}

    time(&end); //Get ending time



    printf("Start: %ld End: %ld\

    \

    ", start, end);

    dif = end - start;

    printf ("i= %ld\

    Time=%.00ld\

    \

    ",i,dif);



    return (0);



    } // main



    -------------------



    Certainly not the best one can do, but it is a little more complicated and must be noticeably slower than the original.
  • Reply 91 of 146
    pbpb Posts: 4,255member
    Quote:

    Originally posted by PB



    Certainly not the best one can do, but it is a little more complicated and must be noticeably slower than the original.




    Oh, I forgot to point out that the original has 50000000 steps in the loop, while the current one has only 1000000.
  • Reply 92 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by PB

    Oh, I forgot to point out that the original has 50000000 steps in the loop, while the current one has only 1000000.



    One guy just posted 5 seconds for his new dualie G5:



    http://www.dslreports.com/forum/rema...e=flat#8013450



    This was with Eugene's binary at -O5, using the threaded version of my app.



    EDIT: Then he set the processor Energy Saver to "Highest":



    V-Cs-Computer:~/Desktop/g5factorial] vc% ./tf-G5-O5

    Creating Thread Number: 0

    Creating Thread Number: 1

    Loop Done; Time=4 secs for thread#:0, Loops=25000000

    Loop Done; Time=4 secs for thread#:1, Loops=25000000
  • Reply 93 of 146
    Another version which uses double precision floating point would be interesting as well.
  • Reply 94 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Programmer

    Another version which uses double precision floating point would be interesting as well.



    Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save
  • Reply 95 of 146
    Quote:

    Originally posted by lundy

    Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save



    Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):



    Dual 1 GHz MDD, 768 MB RAM, 10.2.6, a couple of minors apps in the background not doing anything...



    Threaded integer: 11 secs

    Unthreaded integer: 22 secs

    Threaded double: 14 secs

    Unthreaded double: 26 secs





    Somebody with a G5 run these...



    Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.



    Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do...
  • Reply 96 of 146
    Quote:

    Originally posted by Programmer





    Somebody with a G5 run these...







    If the file has been compiled and can be run from the terminal I'll run it on my G5 1.8 sometime after 7 EST when I get home tonight.
  • Reply 97 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Programmer

    Well since I asked I guess I should post results (after noting that this benchmark isn't testing much that is particularly interesting and cannot really be used to extrapolate about the performance of other software):



    I never called it a benchmark, I don't think, and if I did I didn't mean to.



    Quote:

    Looking at the code, however, I see that you've used tail end recursion which eliminates most of the compiler's ability to perform optimizations (unless it is quite smart, and GCC isn't). A more interesting test would be to rewrite it to use iteration instead.



    You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?



    Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.



    With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.



    Quote:

    Edit: I did this locally and got a result of 3 seconds on the same machine. Just goes to show what tweaking the algorithm can do... [/B]



    For the hell of it I tried



    return (16*15*14*13*12*11*10*9*8*7*6*5*4*3*2) also, but the compiler was way too smart for that.
  • Reply 98 of 146
    gabidgabid Posts: 477member
    Quote:

    Originally posted by lundy

    Simple enough. Changed arithmetic to double-precision float; here is double float threaded factorial. right click and save



    Still no one's had a chance to compile a binary I can run via the Terminal? Too bad. Some day I'll really have to learn how to complie myself!
  • Reply 99 of 146
    Quote:

    Originally posted by lundy

    You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?



    Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.



    With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.





    Yes, that's what I meant. By forcing the function call and the recursion you are limiting what the compiler can do, and forcing more use of the stack memory. This is a disadvantage for the PowerPC processors because it involves more memory operations and severely limits the number of values that can be carried in registers at any point in time. Interesting problems do much more calculation per function call than this little program, and would result in more interesting comparisons. Somebody ought to post a simple fractal calculator, for example.
  • Reply 100 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Gabid

    Still no one's had a chance to compile a binary I can run via the Terminal? Too bad. Some day I'll really have to learn how to complie myself!



    I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip



    compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"



    I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.



    As usual, right click, save, chmod 755 and run.





    G4 Dual 1gHz:

    Code:




    Creating Thread Number: 0

    Creating Thread Number: 1

    Factorial result= 2.092279e+13



    G5 Double-Precision Float Loop Done;

    Time=18 secs for thread#: 1, Loops=25000000



    Factorial result= 2.092279e+13



    G5 Double-Precision Float Loop Done;

    Time=18 secs for thread#: 0, Loops=25000000





Sign In or Register to comment.