Post your speeds - Calculate 50 Million Factorials

123468

Comments

  • Reply 101 of 146
    123123 Posts: 278member
    Quote:

    Originally posted by lundy

    I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip



    compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"



    I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.







    As programmer has pointed out, most of the time is spent on function calls. So, it shouldn't matter all that much if you use doubles or Int64s. Besides, unless you change your algorithm, a compiler won't be able to generate code that uses more than one processor unit.
  • Reply 102 of 146
    gabidgabid Posts: 477member
    Quote:

    Originally posted by lundy

    I have a binary up now, http://www.johnnylundy.com/dftf-G5.zip



    compiled with the flags "-fast -mcpu=970 -mtune=970 -mpowerpc64"



    I thought one of those switches would make the binary crash on the G4, but it runs there also. Shark sampling shows floating-point arithmetic being used.



    As usual, right click, save, chmod 755 and run.





    G4 Dual 1gHz:

    Code:




    Creating Thread Number: 0

    Creating Thread Number: 1

    Factorial result= 2.092279e+13



    G5 Double-Precision Float Loop Done;

    Time=18 secs for thread#: 1, Loops=25000000



    Factorial result= 2.092279e+13



    G5 Double-Precision Float Loop Done;

    Time=18 secs for thread#: 0, Loops=25000000









    Either I'm having Terminal problems or there's something fishy with the binary. When I chmod 755 I get:



    Quote:

    usage: chmod [-R [-H | -L | -P]] mode file ...



    and then when I try to run it I get:



    Quote:

    dyld: ./dftf-g5 can't open library: /System/Library/PrivateFrameworks/ZeroLink.framework/Versions/A/ZeroLink (No such file or directory, errno = 2)

    Trace/BPT trap



    So is it me or the binary?
  • Reply 103 of 146
    Quote:

    So is it me or the binary?



    It could very well be you... you have to specify a filename after "chmod 755" as suggested before you could drag the binary off the desktop into the terminal to have it show up... im not on my mac right now so i cant try it.



    Cheers
  • Reply 104 of 146
    gabidgabid Posts: 477member
    Quote:

    Originally posted by AsLan^

    It could very well be you... you have to specify a filename after "chmod 755" as suggested before you could drag the binary off the desktop into the terminal to have it show up... im not on my mac right now so i cant try it.



    Cheers




    Thats what I did and no go. I'll keep playng around though...
  • Reply 105 of 146
    Its the binary -- it is trying to dynamically link against some library that isn't part of the standard MacOS X install.





    I think somebody ought to come up with an "evaluation algorithm" that can be optimized and will take advantage of the G5's better execution resources. Eliminate the recursion and so much more math (relative to the loop & function overhead). Some kind of long polynomial evaluation with lots of multiply-add operations per iteration. C'mon, I know one of you guys can do it...
  • Reply 106 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Gabid

    Thats what I did and no go. I'll keep playng around though...



    Yeah - Apple's new ZeroLink feature waits until runtime to look for the libraries, to save compile-execute development time.



    I did not notice a problem on my machine since I have the Tools installed and it can find its library at runtime.



    Let me try and set it to turn off ZeroLink.



    EDIT: OK, now it crashes on my G4 so it ought to be G5-only. Turned off ZeroLink so it is presumably linked.



    Try the executable now.
  • Reply 107 of 146
    yevgenyyevgeny Posts: 1,148member
    Quote:

    Originally posted by lundy

    I never called it a benchmark, I don't think, and if I did I didn't mean to.







    You mean ans=1; for (i=16;2;i--); ans=ans*i; -- that kind of iteration?



    Well, yeah, I didn't want gcc to be able to move any code out or to inline it, so I figured a function call, even with a fixed argument, would work.



    With unrolling or inlining, then I would just have had to add back more main iterations to get above 1 second on the time.




    In the real world, neither recursion or iteration would be the correct way to calculate a factorial (just use a formula). However, kudos for being willing to solve a problem with recursion (even if calculating a factorial is the classic intro to computer science recursion example). You would be surprised by how many programmers won't or can't use recursion to solve a problem.



    I would encourage you to try to use recursion in your normal problem solving. Sometimes, it is easier to solve a problem recursively than it is iteratively. You just have to be sure that you won't blow the stack, but there are strategies for maximizing the number of recursive calls.
  • Reply 108 of 146
    gabidgabid Posts: 477member
    Quote:

    Originally posted by lundy



    EDIT: OK, now it crashes on my G4 so it ought to be G5-only. Turned off ZeroLink so it is presumably linked.



    Try the executable now.




    Will do, as soon as I get home tonight.
  • Reply 109 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by 123

    As programmer has pointed out, most of the time is spent on function calls. So, it shouldn't matter all that much if you use doubles or Int64s. Besides, unless you change your algorithm, a compiler won't be able to generate code that uses more than one processor unit.



    Do you think having four threads would get all 4 FPUs involved? Or simpler to just have 2 separate FP multiply instructions in the function so that the instructions can be scheduled by the 970 in parallel to both FPUs?
  • Reply 110 of 146
    Quote:

    Originally posted by lundy

    Do you think having four threads would get all 4 FPUs involved? Or simpler to just have 2 separate FP multiply instructions in the function so that the instructions can be scheduled by the 970 in parallel to both FPUs?



    There are only 2 processors so 2 threads will fully involve them. Each thread should execute code which has a lengthy series of calculations otherwise you aren't going to see the advantages of a pipeline FPU. If there are less than 12 FPU instructions (preferably multiply-add pairs) then you aren't seeing the potential of the processor.
  • Reply 111 of 146
    gabidgabid Posts: 477member
    Excellent. Everything works nicely now, so I can give you results (again!) from my stock G5 1.8:



    Quote:

    Creating Thread Number: 0

    Creating Thread Number: 1

    Factorial result= 2.092279e+13



    G5 Double-Precision Float Loop Done; Time = 11 secs for thread#: 1, Loops = 25000000



    Factorial result= 2.092279e+13



    G5 Double-Precision Float Loop Done; Time = 11 secs for thread#: 0, Loops = 25000000



    This was consistent over mutiple runs. Of note, I don't think I ever heard the extra fans gunning up, though I did when I ran the previous binaries.
  • Reply 112 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Gabid

    Excellent. Everything works nicely now, so I can give you results (again!) from my stock G5 1.8:







    This was consistent over mutiple runs. Of note, I don't think I ever heard the extra fans gunning up, though I did when I ran the previous binaries.




    My G5 dualie arrived today, so I'll have some numbers this week.
  • Reply 113 of 146
    Edited: Whoops! That's what I get for not reading the thread. As Programmer said most of this test is function call overhead.



    And any good optimizer will just optimize out the whole loop
  • Reply 114 of 146
    ok, i'm lazy and too tired to read this whole thread. is there something i can download to test my new dualie? (as in which one of these links?)
  • Reply 115 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by alcimedes

    ok, i'm lazy and too tired to read this whole thread. is there something i can download to test my new dualie? (as in which one of these links?)





    The G5-specific one is http://www.johnnylundy.com/dftf-G5.zip



    Right-click and save, unzip, chmod 755 dftf-G5 in Terminal, and then ./dftf-G5 from the directory it is in.



    I got my dualie G5 but I'm still getting it set up.



    Best time so far is 4 seconds with the Energy Saver set on "Highest".
  • Reply 116 of 146
    Quote:

    Originally posted by Anonymous Karma

    Edited: Whoops! That's what I get for not reading the thread. As Programmer said most of this test is function call overhead.



    And any good optimizer will just optimize out the whole loop




    Unfortunately it can't even do that because the original program uses recursion.
  • Reply 117 of 146
    Quote:

    Originally posted by Programmer

    Unfortunately it can't even do that because the original program uses recursion.



    Seems quite provable to me that the defined factorial function has no side effects and will terminate when called with the value 16... thus the call to factorial can be replaced with the static value, and then the whole loop optimized out. Or perhaps I'm just used to saner languages than C?
  • Reply 118 of 146
    lundylundy Posts: 4,466member
    Quote:

    Originally posted by Anonymous Karma

    Seems quite provable to me that the defined factorial function has no side effects and will terminate when called with the value 16... thus the call to factorial can be replaced with the static value, and then the whole loop optimized out. Or perhaps I'm just used to saner languages than C?



    Well, we know it doesn't have any side-effects, but the gcc compiler doesn't know that.



    I mean, obviously it is a trivial app. But if you optimize everything out, there is nothing left to generate some CPU cycles. I deliberately tried NOT to give the compiler anything to optimize. I always thought that having a compiler recognize a loop that didn't change data was silly anyway - how often do programmers write those kinds of loops (unless they are trying to make busy work for the processors, as we are here - in which case you want the compiler to leave it alone).



    But threaded is faster than nonthreaded, by about 2x on dual machines. FP arithmetic seems to be a little slower than integer. Pretty much makes sense. Sure easier to interpret than "benchmarks" using multi-megabyte bloated apps that nobody knows whether they use both processors or how they are compiled, or anything.
  • Reply 119 of 146
    lucaluca Posts: 3,833member
    My 1 GHz eMac got 22 seconds on the first binary that was posted here. I didn't feel like installing the developer tools so I'm glad you turned it into a binary to make it easier to run.
  • Reply 120 of 146
    Okay guys, try this. I haven't verified that the math does anything useful except keep the processor busy for a while without crashing (56 seconds in the case of my 1 GHz G4). I've also left multi-threading & SIMD implementations as an exercise for the reader. There are 3 typedefs for numerictype which you can change the commenting on to get it to use different types (float, double, 64-bit integer fixed point). Somebody make & post some binaries.



    Code:




    #include <stdio.h>

    #include <math.h>

    #include <time.h>

    #import <pthread.h>



    long long fixed64one = 1LL << 32;

    class fixed64

    {

    public:

    fixed64 (void) : value(0) {};

    fixed64 (double f) : value((long long)(f*(double)fixed64one)) {};

    fixed64 (const fixed64& f) : value(f.value) {};

    const fixed64& operator= (const fixed64& f) { value = f.value; return(*this); };

    const fixed64& operator+= (const fixed64& f) { value += f.value; return(*this); };

    const fixed64& operator-= (const fixed64& f) { value -= f.value; return(*this); };

    const fixed64& operator*= (const fixed64& f) { value *= f.value; value >>= 32; return(*this); };

    private:

    long long value;

    };



    fixed64 operator+ (const fixed64& a, const fixed64& b) { fixed64 result = a; result += b; return(result); }

    fixed64 operator- (const fixed64& a, const fixed64& b) { fixed64 result = a; result -= b; return(result); }

    fixed64 operator* (const fixed64& a, const fixed64& b) { fixed64 result = a; result *= b; return(result); }



    //-------------------------------------------------------------------------------------------------------------------



    //typedef fixed64 numerictype;

    //typedef float numerictype;

    typedef double numerictype;



    void TransformVectors (unsigned int count, numerictype m[4][4], numerictype in[][4], numerictype out[][4])

    {

    numerictype m00=m[0][0], m01=m[0][1], m02=m[0][2], m03=m[0][3];

    numerictype m10=m[1][0], m11=m[1][1], m12=m[1][2], m13=m[1][3];

    numerictype m20=m[2][0], m21=m[2][1], m22=m[2][2], m23=m[2][3];

    numerictype m30=m[3][0], m31=m[3][1], m32=m[3][2], m33=m[3][3];

    for (unsigned int i = 0; i < count; ++i)

    {

    out[i][0] = m00*in[i][0]+m01*in[i][1]+m02*in[i][2]+m03*in[i][3];

    out[i][1] = m10*in[i][0]+m11*in[i][1]+m12*in[i][2]+m13*in[i][3];

    out[i][2] = m20*in[i][0]+m21*in[i][1]+m22*in[i][2]+m23*in[i][3];

    out[i][3] = m30*in[i][0]+m31*in[i][1]+m32*in[i][2]+m33*in[i][3];

    }

    }





    int main (void)

    {

    const unsigned int num_iterations = 100000;

    const unsigned int num_vectors = 4096;



    numerictype m[4][4] = {{0.1,0.2,0.3,0.0},{0.7,0.8,0.9,0.0},{0.4,0.5,0.6, 0.0},{0.3,0.1,0.6,1.0}};

    numerictype va[num_vectors][4], vb[num_vectors][4];

    for (unsigned int count = 0; count < num_vectors; ++count)

    {

    va[count][0] = count/10.0;

    va[count][1] = va[count][0]*va[count][0];

    va[count][2] = va[count][0]+va[count][1];

    va[count][3] = va[count][0]+va[count][1]*va[count][2];

    }





    long start, end;

    time(&start); //Get starting time



    for (unsigned int i = 0; i < num_iterations; ++i)

    {

    // these are split into halves which could be done by seperate threads...

    TransformVectors(num_vectors/2, m, va, vb);

    TransformVectors(num_vectors/2, m, va+num_vectors/2, vb+num_vectors/2);



    TransformVectors(num_vectors/2, m, vb, va);

    TransformVectors(num_vectors/2, m, vb+num_vectors/2, va+num_vectors/2);

    }



    time(&end); //Get ending time for thread

    printf("Time = %d\

    ", end - start);

    return 0;

    }





Sign In or Register to comment.