Benchmark Mac OS X

Posted:
in macOS edited January 2014
I wrote a simple benchmarking program for a class and thought I would share it here for anybody who may be interested.



<a href="http://homepage.mac.com/rogue27/benchmark/"; target="_blank">The program and instructions are here.</a>



I am interested in seeing other people's results posted in this thread, especially to compare performance between 10.1 and 10.2, G3 and G4, and single vs dual.



For comparison, here are my results: (I added the hardware information myself - the program doesn't do that)

[quote]

Hardware:

400Mhz G3

384 MB RAM

OS X.2.4



Compiler version: (can be found by typing c++ -v)

Thread model: posix

Apple Computer, Inc. GCC version 1175, based on gcc version 3.1 20020420 (prerelease)



Test 1: a=0

0.005082 microseconds each



Test 2: getpid()

0.053682 microseconds each



Test 3: malloc() and free()

1.434426 microseconds each



Test 4: new and delete

1.517845 microseconds each



Test 5: pthread_create() and pthread_join()

280.024360 microseconds each



Test 6: fork() and exit()

3319.775100 microseconds each

<hr></blockquote>



Another result:

[quote]

Hardware:

Dual 1Ghz Pentium III

512 MB RAM

Red Hat Linux 8.0



Compiler:

Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --host=i386-redhat-linux

Thread model: posix

gcc version 3.2.1 20030202 (Red Hat Linux 8.0 3.2.1-7)



Test 1: a=0

0.002730 microseconds each



Test 2: getpid()

0.430145 microseconds each



Test 3: malloc() and free()

0.201980 microseconds each



Test 4: new and delete

0.260633 microseconds each



Test 5: pthread_create() and pthread_join()

10.667020 microseconds each



Test 6: fork() and exit()

812.037700 microseconds each

<hr></blockquote>



More tests will be added in the future.



[ 03-09-2003: Message edited by: rogue27 ]</p>

Comments

  • Reply 1 of 12
    serranoserrano Posts: 1,806member
    iBook 700 (Opaque 16)



    [quote]

    Test 1: a=0

    0.001445 microseconds each



    Test 2: getpid()

    0.029638 microseconds each



    Test 3: malloc() and free()

    0.810515 microseconds each



    Test 4: new and delete

    0.859413 microseconds each



    Test 5: pthread_create() and pthread_join()

    123.473320 microseconds each



    Test 6: fork() and exit()

    2417.014600 microseconds each

    <hr></blockquote>



    [quote]

    Thread model: posix

    Apple Computer, Inc. GCC version 1175, based on gcc version 3.1 20020420 (prerelease)

    <hr></blockquote>



    [ 03-10-2003: Message edited by: serrano ]</p>
  • Reply 2 of 12
    rogue27rogue27 Posts: 607member
    It looks like the tests are all proportionally faster on your computer due to the clockspeed change, except for test 1 which is accelerated by more than just clockspeed. There must have been some significant improvement to the G3 to cause that. I noticed test 1 also runs twice as fast (takes half as much time) on a G4 of equal speed.



    The memory operations (malloc(), free(), new, and delete) run much faster on Pentiums. I don't know if there is much that can be done about that, because I think most of the speed advantage there is in hardware. However, some compiler and Operating System operations should be able to improve these somewhat.



    pthread_create and pthread_join times how long it takes to create and destroy threads. Improvements to the operating system should allow this to get faster.



    fork() and exit() time how long it takes to create and destroy a process. Improvements to the operating system should also allow these to run faster.



    The first test, a=0, just times a very simple instruction that can be done on the processor. For comparison, getpid() times a very simple system call. The difference between the two is mostly due to the time needed for a process to hand control of the CPU to the OS, for the OS to handle a simple request, and then give the CPU back to the process. I would have expected this to be faster on a PC due to a Pentium having less registers to dump, but for some reason a significantly faster PC did this significantly slower than my laptop.



    When Mac OS X.3 comes out and we have the GCC 3.3 compiler, I would expect that the last four tests will run measureably faster on the same hardware.
  • Reply 3 of 12
    gspottergspotter Posts: 342member
    G4 Dual 1 GHz MDD:



    Test 1: a=0

    0.002005 microseconds each



    Test 2: getpid()

    0.037230 microseconds each



    Test 3: malloc() and free()

    0.574456 microseconds each



    Test 4: new and delete

    0.605735 microseconds each



    Test 5: pthread_create() and pthread_join()

    228.008900 microseconds each



    Test 6: fork() and exit()

    1653.814800 microseconds each



    Thread model: posix

    Apple Computer, Inc. GCC version 1175, based on gcc version 3.1 20020420 (prerelease)



    [ 03-11-2003: Message edited by: GSpotter ]</p>
  • Reply 4 of 12
    rogue27rogue27 Posts: 607member
    Hmm... that's odd that an iBook would be faster on tests 1, 2, and 5.



    Were you doing stuff in other programs when you did this?



    The memory operations were much better on the MDD machine, but still 3x slower than a dual Pentium III.
  • Reply 5 of 12
    majormattmajormatt Posts: 1,077member
    Hmm, these tests seem to be directly proportional to clock speed between Mac and PC. I thought Macs wouldnt do this poorly.
  • Reply 6 of 12
    amorphamorph Posts: 7,112member
    Integer ops scale linearly with clockspeed.



    Also, malloc() and new are bandwidth limited, and the Macs are running into the MaxBus (on G4s) or the 60x bus (on G3s), which don't have so much bandwidth to play with.



    The last three definitely look like areas where gcc's PowerPC code generation and Darwin's code could stand some work.



    [ 03-11-2003: Message edited by: Amorph ]</p>
  • Reply 7 of 12
    rogue27rogue27 Posts: 607member
    [quote]Originally posted by Amorph:

    <strong>Integer ops scale linearly with clockspeed.</strong><hr></blockquote>



    But the 700Mhz iBook beat the Dual 1Ghz G4 on the first test.



    [quote]Also, malloc() and new are bandwidth limited, and the Macs are running into the MaxBus (on G4s) or the 60x bus (on G3s), which don't have so much bandwidth to play with.<hr></blockquote>



    More efficient code could help this a little, and I don't think the PC tested had DDR anyway, but I was told that Pentiums do such things about 4x faster than PowerPCs of equal clockspeed.





    [quote]The last three definitely look like areas where gcc's PowerPC code generation and Darwin's code could stand some work.<hr></blockquote>



    The last two really need some improvement. I think the fourth one is weak for the same reasons the third one is. I hope 10.3 and GCC 3.3 will make the last four tests go faster, but I won't be using those for quite a while.
  • Reply 8 of 12
    rogue27rogue27 Posts: 607member
    I changed the source code a bit, (still available at the url in my first post) and the changes seem to yield slightly better overall results.



    It looks like OS X isn't too shabby, but it could really use some improvement to the creation and joining of threads which is embarassingly slow at the moment. (Posix Threads anyway - I'm not sure if Apple normally uses some other thread model.)



    I'm trying to get a friend of mine to post some results from a Windows machine if he can get my code to compile with DJGPP.



    I should be adding some filesystem performance tests within the next few weeks.



    The only bad thing is that the tests are all being done on different hardware. Eventually, I want to benchmark every version of OS X on the same machine and put up some graphs.
  • Reply 9 of 12
    123123 Posts: 278member
    Quote:

    Originally posted by rogue27

    I should be adding some filesystem performance tests within the next few weeks.

    [/B]



    Before you do that, think about your benchmarks. What are they supposed to tell us (if you don't know, you can't make meaningful analysis: well getpid is faster on this machine... so what? what does that mean?), how can you be sure they do what you want (did you look at the assembler code? You're using -O3 after all). Isn't there a better, more direct way to benchmark some things?



    For example: Test 1 is useless. Even without optimization, all you're benchmarking is the for loop (and maybe memory access). (If you do it right, you will get the theoretical numbers (1/clockspeed), which is also not really useful, but at least, it's a correct number). So, at least subtract the for loop and make sure li rX, 0 is executed.
  • Reply 10 of 12
    rogue27rogue27 Posts: 607member
    Quote:

    Originally posted by 123

    getpid is faster on this machine... so what? what does that mean?



    It means that the time it takes for a process to hand control to the OS, and for the OS to hand control back to the process is faster. getpid() is essentially the simplest system call you can make, so what is meaningful is not the speed of getpid() itself, but the speed of context switching. I explained that in one of my postings above.



    Quote:

    How can you be sure they do what you want (did you look at the assembler code? You're using -O3 after all). Isn't there a better, more direct way to benchmark some things?



    I can't be sure, since the assembler code wouldn't make too much sense to me if I did look at it. I also know that in part, I am benchmarking the compiler itself as much as I am benchmarking certain system functions. However, since the OS is compiled with the compiler, it is a fair enough test. Also, the loops are needed, because without a high number of tests, the results wouldn't be as accurate due to the time spent making the time function calls. With a higher number of tests, the time spent making the time calls can be divided over a large number of trials, until it's effect is practically nonexistant.



    Quote:

    For example: Test 1 is useless. Even without optimization, all you're benchmarking is the for loop (and maybe memory access). (If you do it right, you will get the theoretical numbers (1/clockspeed), which is also not really useful, but at least, it's a correct number). So, at least subtract the for loop and make sure li rX, 0 is executed.



    I agree that Test 1 is pretty useless, however, it was required on the homework assignment I did this for



    Anyway, what the numbers mean:



    Test 1: a=0

    This test supposedly tests integer assignment. Most of the time in this test is spent in the for loop, and for the most part, this test scales with clockspeed. However, a G4 of equal speed to my G3 did this test twice as fast. The compiler probably does some loop unrolling to make this faster. It does take about twice as long without optimizations.



    Test 2: getpid()

    This test does a very simple system call, to measure how much overhead is involved in doing the context switches needed to make a system call. This works because the time spent running the loop and getting the process id are negligible compared to the time spent doing the context switches. (passing control of the CPU to the OS, and then for the OS to pass control of the CPU back to the user process.



    Tests 3 and 4:

    These tests show how long it takes to allocate and deallocate space in memory.



    Test 5: Threads

    This test times how long it takes to create and join a thread.



    Test 6: Fork and Exit

    This test times how long it takes to create and exit a process. The main purpose of this is to show how much faster it is to use threads than to create additional processes.



    Test 4 will not be in the next version because I don't think comparing two different ways to allocate memory is very important since everybody knows C is faster than C++. The tests will also not be numbered in the next version.
  • Reply 11 of 12
    wmfwmf Posts: 1,164member
    I think I'll stick with lmbench.
  • Reply 12 of 12
    123123 Posts: 278member
    Quote:

    Originally posted by rogue27



    I am benchmarking the compiler itself as much as I am benchmarking certain system functions. However, since the OS is compiled with the compiler, it is a fair enough test.



    The compiler probably does some loop unrolling to make this faster.





    So the purpose of this test is to see whether a compiler does loop unrolling or not? (they almost never do, they do mostly software pipelining) Fair enough, but then you shouldn't say that "This test supposedly tests integer assignment ", call it compiler optimization test.



    Code:




    _main:

    ....

    lis r24,0x5f5 ; l = ..... hi

    bl L_sleep$stub

    ori r29,r24,57600 ; l = ..... lo

    addi r3,r1,64

    li r4,0

    bl L_gettimeofday$stub

    mtctr r29 ; l -> loop var

    L53:

    bdnz L53 ; if (l-- != 0) then goto L53

    addi r27,r1,80

    li r4,0 ; a = 0

    mr r3,r27

    addi r26,r1,64

    bl L_gettimeofday$stub

    .....









    As you can see, the assignment (li r4,0) is executed only ONCE (you've told the compiler to do as much optimization as it can!).





    Quote:



    Also, the loops are needed, because without a high number of tests, the results wouldn't be as accurate due to the time spent making the time function calls




    I didn't say you don't need the loop, but you have to subtract the time you spend in the loop. The easiest way (but not necessarily entirely accurate: pipelining) to do this is to measure how much time is spent in an empty loop (actually, as you can see from above, you're already doing this). If you do it right, you'll find out that the assignment is done in one clock cycle. But then again, your interpretation will probably be wrong because it's not the same thing to load 0 and a bigger 32-bit integer constant and you probably didn't know that. Also, for example, a 0 assignment is the same on the 970 and the G4, assigning a 64-bit integer constant is faster on the G4.... All I want to say is: many benchmarks you can find on the net have design flaws or the result is hard to interprete or it doesn't tell the whole story and is useless... It's not entirely trivial to come up with a good benchmark.



    Quote:



    This test times how long it takes to create and exit a process. The main purpose of this is to show how much faster it is to use threads than to create additional processes.





    To USE threads? You'd also have to compare single address space vs. IPC (pipes, shared memory etc.), copy-on-write, user-level scheduling vs. LWP kernel vs. HW kernel scheduling etc. for several thread and process models...



    Quote:



    These tests show how long it takes to allocate and deallocate space in memory.



    because I think most of the speed advantage there is in hardware





    I'm curious what you think you are measuring here... If you benchmark an algorithm, it's quite bad to test it for one case only. The size should be variable, also malloc and free calls should occur in random order.
Sign In or Register to comment.