The usefulness of a programming paradigm is independent of the ratio of talented to sloppy programmers/designers. Knuckling under to what the lowest common denominator can handle does nobody any good and is no way to get ahead in the field. Let the lazy fall to the business version of Darwin. I haven't met an interactive software project yet that wouldn't readily avail itself of significant threading, but you can't just add it in after the fact, then it's too late.
It's better to hire a pair of tens and teach them to be good designers, letting the rest of the team fill in the code, rather than spend twice as long in the project at double the cost with a staff of 5-7's who don't realize the utility of good design. I have brilliant compatriots who should be able to code circles around me but can't complete quality projects and now complain when they have to work 80 hour weeks on a product 6 months overdue on an already slid timeline. They did very little design--and it shows.
You cannot talk of thread switch costs without comparing them to process context switches, to do so completely ignores the entire reason for more than one thread to be allowed to exist per process. Threads do not force processor context switches, processes do. The costs are significantly different and the ability to execute an in-process thread switch eliminates an unnecessary "extra" process context switch. Process context switches forced by a scheduler are rough washes as they happen in roughly the same number of time-slices under either paradigm. More threads in any particular process will add a few words each during the switch, but it takes far less time to transfer those words than the in-process switch saves by avoiding an extra blocked process switch.
I agree in general with the first sentence in your final paragraph but those distinctions are most important for describing optimal performance, not better-than performance. Your following additional points on varying cores and thread execution resources per core actually make it more advantageous to a properly threaded application even if it is not optimally threaded for that architecture, because the ability to utilize more of the available resources will just about always be more efficient that leaving them idle. And if there are fewer resources available, then it is even more important to avoid performance killing blocked execution.
The usefulness of a programming paradigm is independent of the ratio of talented to sloppy programmers/designers. Knuckling under to what the lowest common denominator can handle does nobody any good and is no way to get ahead in the field. Let the lazy fall to the business version of Darwin. I haven't met an interactive software project yet that wouldn't readily avail itself of significant threading, but you can't just add it in after the fact, then it's too late.
After reading this I really got to wondering where you got the idea that I'm "anti-threading", so I read back over the this topic to see where and I noticed that a couple of people seem to have mis-read my earlier post. I did not take a position against multiple processors or multi-threading -- I was simply arguing that the original post's idea of using multiple chips was not a particuarly good idea. I am most emphatically not against multiple cores, or SMT, and therefore it follows that I am not anti-threading. Considering the position I've taken in other threads on this forum, it is amusing to see people thinking that I'm somehow anti-multiprocessor.
I do stand by my original position that adding additional chips to systems is exactly the opposite direction computer development has been going for the past 30-odd years. Instead everything is becoming more integrated, not less. Also, inter-chip communication is relatively slow and high latency compared to on-chip communication so the benefits of multiple processors at the motherboard level are substantially less than at the chip level.
Quote:
It's better to hire a pair of tens and teach them to be good designers, letting the rest of the team fill in the code, rather than spend twice as long in the project at double the cost with a staff of 5-7's who don't realize the utility of good design. I have brilliant compatriots who should be able to code circles around me but can't complete quality projects and now complain when they have to work 80 hour weeks on a product 6 months overdue on an already slid timeline. They did very little design--and it shows.
I am in complete agreement. The opening line of my previous post was meant to describe the reality of the software industry at large (as you point out here), not the prefered way of operating. You'll note the smiley attached to my statement. FWIW, I am firmly in the design-is-important camp and have been for 15+ years.
Quote:
You cannot talk of thread switch costs without comparing them to process context switches, to do so completely ignores the entire reason for more than one thread to be allowed to exist per process. Threads do not force processor context switches, processes do. The costs are significantly different and the ability to execute an in-process thread switch eliminates an unnecessary "extra" process context switch. Process context switches forced by a scheduler are rough washes as they happen in roughly the same number of time-slices under either paradigm. More threads in any particular process will add a few words each during the switch, but it takes far less time to transfer those words than the in-process switch saves by avoiding an extra blocked process switch.
Heh. Of course you can talk about the cost of thread switching without comparing it to the cost of process switching. Run in a single process system and such a comparison is meaningless! Most embedded systems (at least small ones) and all game consoles are entirely single process multi-threaded environments.
The cost of a thread switch is (typically) the cost of saving the processor state, the impact doing so has on the caches, the effects of the interrupt causing the switch (such as blowing the pipelines and/or the branch predictor), and the cost of the thread scheduler code in the kernel. Sure that all amounts to a mere fraction of a process switch (if you have such a thing), but since I was ignoring that cost in the first place that comparison isn't really relevant.
Quote:
I agree in general with the first sentence in your final paragraph but those distinctions are most important for describing optimal performance, not better-than performance. Your following additional points on varying cores and thread execution resources per core actually make it more advantageous to a properly threaded application even if it is not optimally threaded for that architecture, because the ability to utilize more of the available resources will just about always be more efficient that leaving them idle. And if there are fewer resources available, then it is even more important to avoid performance killing blocked execution. [/B]
Well, as you pointed out:
Quote:
And yes sometimes the proper design may be a single thread, but that is becoming increasingly rare.
My statement about "you got it backwards" was directed to your "well designed threaded code will almost always outperform less threaded code except in the design corner not requiring shared resources or i/o". Your statement implies that threading will do better than non-threading except there there are no shared resources or I/O. I said this was backwards because it is when you hit shared resources and I/O that you have synchronization points which tend to slow you down -- it is in cases where you have pure computational threads running on completely independent calculations where multi-processors achieve their maximal performance gain over single threaded code.
Your earlier statement...
Quote:
But the lions share of things which can be threaded actually don't need constant multiple references to shared data. They merely separate the code likely to block from code that should never ever be blocked for that reason.
... is interesting because lots of code needs constant multiple references to shared data. Fortunately read access to shared data is, in most systems, free because unmodified data can be held in the caches of multiple processors without any problems. It is only when it has to be modified that things get a bit troublesome. Frequently this occurs in code that doesn't need to block but does benefit from being distributed across multiple threads to attack a problem in a divide-and-conquer manner. Consider something like Photoshop where a multi-megabyte source image having a filter applied to it and a new image is being generated. The filter may draw on a large area of the image to compute each output pixel, but each pixel is output only once. The inputs for successive pixels overlap greatly, but the outputs don't overlap at all (not even at the cacheline level if you divide up the work properly). Each thread is assigned a set of pixels to compute, and will run until finished with no blocking required. How many threads to use? In the absence of SMT it is probably best to have 1 thread per core -- more threads than cores will just result in extra overhead and slow down the overall calculation. If SMT is present, whether you want to take advantage of that will depend on how many stalls you are getting in your execution pipelines and how much pressure running two threads will put on the caches and memory subsystem. If other processes are running in the system then it might make sense to use fewer threads so that the OS doesn't take away any of the execution resources you are using to service the other processes, which would delay the overall calculation, pollute your caches, etc.
I missed wizard69's post earlier, sorry. I partially responded in the preceding response to Hiro, but this is interesting enough that'll I'll reply in detail here as well...
Quote:
Originally posted by wizard69
I'm not convinced that one would have to go that far. All you really need is high integration on the chips themselves and high speed communications between chips. We are nearing the stage where it is possible to contain an entire computing system on a die, so for many applications where communications does not have a huge impact these are very valuable approaches. Sure this isn't the cure for every possible computing illness but will work well for many. AMD is one to watch here, as they are close to implementing what I would consider ideal systems for the future.
As I said above, I completely agree. Highly integrated chips which contain multiple cores and I/O devices are definitely the future. Your point about high degrees of integration reducing the chip count which in turn allows the chip count to be slightly increased to add more cores is a reasonable one -- this is exactly why I expect Apple to ship a quad core machine when a dual core 970 is available. They could reduce the chip count, but since they're already paying for the two chips why not keep paying and get a lot more bang for the buck. Similarly, this could be extended to the northbridge: get an on-chip memory controller w/ HT interface and instead of 2 970's + northbridge you can fit 3 dual core 970's. IMO this is still a long way from a massively parallel PowerMac built out of lots of chips. I don't see an increase in chip count being practical.
Quote:
It is not like I would want to see these simple CPU's give up the instruction set that we currently have. Just that I would rather see many cores on a chip and take a tiny performance hit as opposed to having just two. Thus my statement about a 603 class chip with out a super altvec unit containing instead a simpler implementation of the altvec instrucitons. I believe that the instruction set needs to be balanced more so than processor performance.
This is interesting coming from the same guy who was arguing for AltiVec2, more instructions, bigger registers, etc in the other (970GX) thread.
Quote:
While your statements are not wrong as stated above I'm not sure I'd take the same meaning from them that you do. Many common implementations today are not symmetric at all when looked at from thread performance. SMT and all of the other thread processor approaches do apply priorities to threads, so threads really are not running with symmetric capabilities. Same with multiprocessors systems, where in some one processor runs at a disadvantage to another. The operating systems of today have to take knowledge about thread performance into account when scheduling so the complexity issue is already there. Frankly putting as many processors as possible on one chip brings some symmetry back inot the equation (atleast for the processors on that chip)
The nice thing about SMT priorities is that they can be dynamically balanced based on many factors. Having some processors without AltiVec would be doable, but would introduce complications and inefficiencies. The OS would probably catch the "invalid instruction" interrupt, check if it was an AltiVec instruction and then move the offending thread over to an AltiVec-equipped processor. This just complicates life, however, and its my belief that we'd just be better off with a uniform set of cores that all implement the same instruction set -- hence part of my argument in the other thread for not changing that instruction set unless there is a very significant advantage to doing so. These days AltiVec isn't such a huge chunk of chip real estate (10-20 million transistors?), and a surprising amount of code uses it (including frequently called OS functions).
The non-symmetric parts of the chip can just be kept out of the PowerPC user model -- either accessed through OS calls, or as memory mapped I/O. The majority of software doesn't know or care how the new hardware capabilities work, they just do. If programmable features are added, that can be done through special APIs and languages (such as OpenGL's shader language, for example).
Quote:
Apple could get a market advantage if they could convince IBM to go along buy implementing many cores on chip. It may very well be possible to implement as many as four cores on one chip while the competition is trying to cram two of thier cores on chip. If such a chip where to exists tommorrow, Apple would be able to implement it well immediately with the current system software they have. I think you would see very few complaints about systems with 4 or 8 processors at todays prices.
I don't think IBM requires any convincing. None at all. The POWER4 started life as a dual core design. The POWER5 added SMT to the dual core design. IBM is working on Cell with Sony. MS is rumoured to be getting IBM to build a 3 core SMT PowerPC chip for them. 5+ years ago IBM was putting out papers about what they were going to do when they could put a billion transistors on a chip (in 2010) and it was all about a grid of processors on one chip. IBM is leading on this one, we just haven't seen it reach Apple's chips... yet. They are all set up for it though -- the 970FX is about half the transistor count of Prescott, and is designed to work well in SMP systems.
We are mostly in agreement and your examples point out the design constraints of finding any particular applications best design, including threading. Go ahead and ignore process vs thread switching cost at your own performance peril. Your choice and one I don't agree with. We just seem to have worked on different enough projects to have a different opinion of which end of the spectrum to start at when we begin the design phase.
I don't understand your last para though. If you spread a single process across multiple processors and if you are at that point via an intelligent design you should have already made the choices to minimize problems and maximize throughput. Locks for critical region code will preclude access to cached data, read or write for the duration of the lock, so read/write cache issues are relatively moot unless you do something that thrashes those memory values. That bad design is an indication of poor threading choices.
I am working from an assumption that work will be put in to make the appropriate choices and the code will be reasonably well written. Something greatly helped by good clean design. When discussing the merits of something like threading it doesn't help to be scared off by what can happen if you make poor decisions unless you plan on making poor decisions. I can create plenty of examples where a violation of good design and coding principles can completely hose things up, that doesn't change what the best design was though, just presents data points of where that best design was missed. Heck, look at the Finder or Safari, we have great examples of where the best design was missed. Tab A should never block completion of Tab B or navigating in Column View shouldn't freeze the whole app while accessing the next nested file list.
Quick question on "threading" and multitasking. (not sure if I'm using the corect terminology, but I'll try). Is it possible to design a computer system that would allow multiple operations to be perfomed IN the same application at the same time? for example, would it be possible to be applying a filter to a huge PS file, and while that is taking place, open another PS file and start working on that, saving other open files, etc? Now I can easily swithc between apps while the system is working on one process in a number of apps, but not work on multiple processes in the same program. Is this possible? Is there a name for this type of multitasking?
thanks for any info on this, I appologize for my lack of correct technical lingo...
Quick question on "threading" and multitasking. (not sure if I'm using the corect terminology, but I'll try). Is it possible to design a computer system that would allow multiple operations to be perfomed IN the same application at the same time? for example, would it be possible to be applying a filter to a huge PS file, and while that is taking place, open another PS file and start working on that, saving other open files, etc? Now I can easily swithc between apps while the system is working on one process in a number of apps, but not work on multiple processes in the same program. Is this possible? Is there a name for this type of multitasking?
Absolutely, this is a key use of multi-threading. It is up to the application developer to design his software this way.
Here's a question for you all. On CNN MONEY They said that the CELL chips would be made at the East Fishkill plant in NY and would be ten times faster than the ones currenly used. The thing I saw that really shocked me was that they said in the article that IBM commited to put them in a SONY computer first. Where was Apple in all of this? What will we have that will keep us afloat if these things will be in PC's first?
To handle this, they have to allocate multiple buffers instead of using static memory ...
LOL. You're funny.
Quote:
Originally posted by synp
... and they have to use serialization like semaphores or mutexes. All of these add a lot of complexity to all functions. OTOH all of the things needed to be event-driven can be concentrated into a nice well-debugged library.
This is worse!
Quote:
Originally posted by synp
Unless you need very high performance, and unless your job is very amenable to being split up into chunks, it is far better to use an event model rather than multiple threads.
Threading is useful when an application has to do multiple, logically separate things such as when mail client displays email while retrieving new mail; Word running spell checker in a separete thread, etc. None of this is "high performance". And the job is not split into "chunks" as is rendering. These are logically separate tasks so they deserve separete threads.
Quote:
Originally posted by synp
Only apps that can really benefit (think Photoshop and rendering apps) will use the high parallelism.
MS Word benefits too but for different reasons than Photoshop/Max/Lightwave do. Once you understand why this is so then you're on your way to competently designing MT apps.
Quote:
Originally posted by synp There's really no need to make a thread and have it sleep waiting for a network event.
Yes there is. Event loops usually require one switch statement which gets very big very quickly in any non-trivial app. Do you know why the early Java UI API changed its event handling api? Coz they implemented what youre advocating. And the real world caught up so they had to change it.
Quote:
Originally posted by synp
Even high-performance servers like Apache use only a few threads.
A little knowledge is a dangerous thing. Apache and IIS use a "few threads" (Apache on *Nix fork()s as well) cos they are not CPU bound but IO bound. They use thread pooling so after a thread issues an IO request (disk or network) it can continue servicing other pending requests.
I can only conclude from you comments (the first one was particularly revealing) that you're still a noob when it comes to programming.
Anyway, Cell will be awesome. TheRegister was reporting a 2TFlops. The processor is by IBM, Toshiba and Sony. No wonder Sony, and not Apple, will see it first.
Are there many technical diferences in the Cell processor vs. the PowerPC we use today such as the G5? Would it be possible for Apple to plug in these chips without software develpers having to jump through any hoops? I thought i read there is no AltaVec on board, if the chip is that fast, does it matter?
Wouldnt it be in IBM's best interest to sell these chips to Apple?
MS Word benefits too but for different reasons than Photoshop/Max/Lightwave do. Once you understand why this is so then you're on your way to competently designing MT apps.
Thank you! I was starting to think people didn't realize that threading had more than one purpose.
1) The one everyone knows about, and the one which requires all the synchronization, is parallelism... breaking up one big task into multiple smaller ones so that it can be completed (much) more quickly.
2) The other purpose seems to have been lost on some people: Splitting independent tasks off into their own threads to avoid resource contention, and to keep the app running full tilt. In the case of a GUI app, this includes keeping the UI fully responsive, regardless of how busy the app is. In other cases, like Apache, it allows mutliple clients to make use of the server at once.
And you don't need to be a design guru to code the second kind of threading. Sure, you have to know how threads work, and what kind of things to split off, but they're really pretty easy to work with.
Thank you! I was starting to think people didn't realize that threading had more than one purpose.
LOL You're welcome!
As you say threads are pretty easy to work with if you've got the data flow right. Some OSs provide thread pools where you just give the OS a function pointer and it will handle all the threading stuff automatically.
When people think of high-performance they immediately jump to more GHZ, multi-threading, SMP, cores etc. Sometimes the biggest jump in performance can be had by simply changing the algorithm. Nothing beats going from an O(n^2) to finding an O(log n) algorithm...
We are mostly in agreement and your examples point out the design constraints of finding any particular applications best design, including threading. Go ahead and ignore process vs thread switching cost at your own performance peril. Your choice and one I don't agree with. We just seem to have worked on different enough projects to have a different opinion of which end of the spectrum to start at when we begin the design phase.
Hmmm. Perhaps you misunderstand why I'm ignoring the process vs. thread switching cost. I'm talking about scenarios where processes don't exist!. In these cases there is no peril to ignore processes. If there are processes (and you are compelled to use them), then absolutely you need to weight the tradeoffs of threads vs. processes. Most PC software runs in a single process, however, and virtually all embedded and game software systems don't support multiple processes.
Quote:
I don't understand your last para though. If you spread a single process across multiple processors and if you are at that point via an intelligent design you should have already made the choices to minimize problems and maximize throughput. Locks for critical region code will preclude access to cached data, read or write for the duration of the lock, so read/write cache issues are relatively moot unless you do something that thrashes those memory values. That bad design is an indication of poor threading choices.
I was trying to point out that there are hardware issues which significantly complicate how you can achieve maximum performance on hardware that you're not certain of when building the software. This is not an issue of bad design, it is that there is no perfect design possible that will cover all the varieties of hardware/OS in the market. Achieving a respectable design on known hardware is hard enough in practice (especially under realistic project conditions in the industry), doing so for wildly varying hardware is pretty much impossible.
Quote:
I am working from an assumption that work will be put in to make the appropriate choices and the code will be reasonably well written. Something greatly helped by good clean design. When discussing the merits of something like threading it doesn't help to be scared off by what can happen if you make poor decisions unless you plan on making poor decisions. I can create plenty of examples where a violation of good design and coding principles can completely hose things up, that doesn't change what the best design was though, just presents data points of where that best design was missed. Heck, look at the Finder or Safari, we have great examples of where the best design was missed. Tab A should never block completion of Tab B or navigating in Column View shouldn't freeze the whole app while accessing the next nested file list.
I'm all for good clean design, but no design is ever perfect. And the less you know at design time, the less likely that your design is going to be perfect. Toss in the typical project timelines, changing requirements, development practicalities, etc.
At this point I can't even remember what we're arguing about. The original post said "why not put lots of chips in a Mac?", and that's what I suggested was not going to happen.
I see your points a bit more clearly now. We are coming at it from different directions and that colors our perceptions differently. We are in a half full/half empty discussion I guess. Where you see potential limitations, I see design considerations. They are closely related for all their technical identicality, but how those semantics affect the creative problem solving in design is often quite substantial. That's OK, it's just different views of the same thing.
I see your points a bit more clearly now. We are coming at it from different directions and that colors our perceptions differently. We are in a half full/half empty discussion I guess. Where you see potential limitations, I see design considerations. They are closely related for all their technical identicality, but how those semantics affect the creative problem solving in design is often quite substantial. That's OK, it's just different views of the same thing.
Ah, bliss. For what its worth, I call all of these things design considerations as well. Strangely most of our discussion arose from my objection to the part of your initial statement about where threading will do better than not threading, not whether threading is a good thing. And this was all an aside from the original post's topic.
All apps with GUIs have multiple threads. A simple aqua application that I wrote (just some buttons and simple calculations) uses nine threads, while Calculator uses three. This is forced by the operating system and the frameworks you use to write your application.
The question is where the app does its work. It might look like a great idea to split off a thread to check the POP3 account while having another thread handle GUI requests, but there's a price to pay. There's bound to be some in-memory database of mail messages. As soon as the thread that's communicating with the POP3 account finds new mail, it has to update this database. The GUI thread has to do the same (deleting messages, marking them as read, flagging them). A background thread that deletes 30-day old trashcan items also needs to do this. To enable all this to be done by different threads, you need to have serialization on the database.
In the simple case, that will be a global database lock. This means that while the POP3 thread is adding new messages, the GUI cannot update. That's exactly how Mail.app feels, even on a dual-CPU. Suddenly the cog widget appears and starts spinning (that's what the GUI threads are doing - spinning the widget while other threads do real work). For the few seconds that it is spinning you can't do a thing, although the OS queues your keystrokes (that's another thing GUI threads are doing)
This could be improved by having a more sofisticated database with record-level locks, but apparently Apple does not think it's worth it for Mail.app, and they're probably right.
Now consider an event model for the same app. Here's a chain of events:
- User clicked a message line. The app reads the message from the database, displays it, and marks the file as read, while updating the display as appropriate.
- 5-minute timer elapsed. The app begins a TCP connection with the POP3 server.
- 3-way handshake is done with POP3 server. The app sends a request.
- some data arrives from the POP3 server. The app updates the database and display as necessary.
- The user clicks a message line. The app reads the message and displays it.
- some more data arrives. The app updates the display and database.
- The last of the data arrives. The app updates again and closes the connection.
The event model gives better GUI responsiveness, because there is no danger to the database integrity - the worker thread needs never worry about locks.
So when is threading good? Well, let's add message signing and encryption to our application. RSA signature and encryption are heavily CPU-bound, and can take a sizable fraction of a second for a good-sized mail message. This would hurt responsiveness and it is a very good idea to split it off into a separate thread regardless of the number of CPUs.
Just creating OS threads for structural beauty as some people are advocating, is wrong. It promises race conditions and data integrity failures.
Umm... synp... I don't think you have a good understanding of how threading should be done, and I think you're confusing "Threaded" with "No Events". Threaded apps will most likely be very event-driven, but not in the sense of having a single thread handling all events, and freezing whenever a long-running task is required.
In the case of the threaded email app, where a separate thread is doing the TCP communication... who says the mail database has to be locked the entire time the thread is running? Avoiding locking is the exact reason I'm suggesting threading! In fact, that's the primary purpose of user-interaction-related threads. If someone spawns a thread, and then has it lock the UI the entire time the thread is running... well, frankly, they're an idiot. Pardon my French.
Locking would easily occur in a single-threaded (event-driven...?) application any time the network bogs down, or a large attachment is coming in, because the same thread that runs the GUI and the rest of the app is stuck waiting for the data to come in. And it doesn't matter if it's "event driven", because the event that says "data is coming in... read me!" has taken all of the single thread's attention. If you're doing some whiz-bang fancy code to juggle user interactions and data loading at the same time, not only is your code becoming exceedingly complex (and begging for bugs), but guess what? You're also going to have to deal with database integrity issues. Maybe not the same kind of issues, but issues nonetheless.
A well-written threaded app (and I'm not talking "written by gurus", I'm talking "written by anyone with a clue") could do any number of things to avoid excessive locking of the mail database (and therefore, the portions of the app that are dependent on it). It could either collect all of the messages in a separate storage area, and then dump them all into the main mail database in one fell swoop (locking it for, oh, a couple milliseconds), or update the mail database after each message comes in, locking and UNlocking the database each time. The result? A mail application that NEVER locks up or becomes unresponsive, regardless of how busy it is. As for system integrity? You don't need to be a brainiac to keep all the data safe, believe me. And system locks? Trust me, I think the average user would rather deal with a few millisecond-long locks from a background thread (which is only locking at the moment it dumps its data into the main database) than having their entire app freeze during the entire duration of a (potentially slow) network transaction.
As for Apple Mail.app's performance? That's an entirely different issue. I haven't seen the code, so I don't know what they're doing, but it sure locks up less than what I use at work.
And no, I'm not advocating threading for structural beauty. I'm advocating it for a far superior user experience.
I'm not advocating a single thread either. I'm saying you should have a thread only where it makes sense. I do, however think that the main logic of the program, be it a mail app or a firewall or a database should be in a single thread.
Tasks should be relegated to a different thread when there is a good reason. Cryptographic operations are long and should be separated. Image processing should be separated. Anything that takes more than a few milliseconds should not run on the main thread.
Because of the way GUIs work, it is almost always a good idea to separate the GUI into its own thread.
I believe that things that do not separate well are not good candidates for threading, otherwise you're begging to have rare and weird race conditions. It is far better to have a good library that makes things like network access and database access truly event-based (as in you don't get a "data available" event unless the data is available now). That library may use threads, but still only that library will have to be thread-aware. All the program-logic code needs not be thread aware, which means it's much simpler and less hazardous. With a good library, you never get locked.
Of course, it is possible that I think this way because the company I work for works this way, but this does make sense to me.
The only real problem with remaining in that mindset is the unabated speed advances of monolithic processors is just about over. Future significant speed gains will be made by software that can best take advantage of multiple threads. That is the architecture the chip designers are now pushing and will until someone can figure out how to unwind quantum computing, and that's going to be awhile. So if the the short-mid term architecture advancements are pushing simultaneous execution up from 1 to 4-16 threads per box, the reality is software designers have to figure out how to most effectively use that architecture. That can't be done with the same software and conservative mindset developed over the past 30 years.
Comments
It's better to hire a pair of tens and teach them to be good designers, letting the rest of the team fill in the code, rather than spend twice as long in the project at double the cost with a staff of 5-7's who don't realize the utility of good design. I have brilliant compatriots who should be able to code circles around me but can't complete quality projects and now complain when they have to work 80 hour weeks on a product 6 months overdue on an already slid timeline. They did very little design--and it shows.
You cannot talk of thread switch costs without comparing them to process context switches, to do so completely ignores the entire reason for more than one thread to be allowed to exist per process. Threads do not force processor context switches, processes do. The costs are significantly different and the ability to execute an in-process thread switch eliminates an unnecessary "extra" process context switch. Process context switches forced by a scheduler are rough washes as they happen in roughly the same number of time-slices under either paradigm. More threads in any particular process will add a few words each during the switch, but it takes far less time to transfer those words than the in-process switch saves by avoiding an extra blocked process switch.
I agree in general with the first sentence in your final paragraph but those distinctions are most important for describing optimal performance, not better-than performance. Your following additional points on varying cores and thread execution resources per core actually make it more advantageous to a properly threaded application even if it is not optimally threaded for that architecture, because the ability to utilize more of the available resources will just about always be more efficient that leaving them idle. And if there are fewer resources available, then it is even more important to avoid performance killing blocked execution.
Originally posted by Hiro
The usefulness of a programming paradigm is independent of the ratio of talented to sloppy programmers/designers. Knuckling under to what the lowest common denominator can handle does nobody any good and is no way to get ahead in the field. Let the lazy fall to the business version of Darwin. I haven't met an interactive software project yet that wouldn't readily avail itself of significant threading, but you can't just add it in after the fact, then it's too late.
After reading this I really got to wondering where you got the idea that I'm "anti-threading", so I read back over the this topic to see where and I noticed that a couple of people seem to have mis-read my earlier post. I did not take a position against multiple processors or multi-threading -- I was simply arguing that the original post's idea of using multiple chips was not a particuarly good idea. I am most emphatically not against multiple cores, or SMT, and therefore it follows that I am not anti-threading. Considering the position I've taken in other threads on this forum, it is amusing to see people thinking that I'm somehow anti-multiprocessor.
I do stand by my original position that adding additional chips to systems is exactly the opposite direction computer development has been going for the past 30-odd years. Instead everything is becoming more integrated, not less. Also, inter-chip communication is relatively slow and high latency compared to on-chip communication so the benefits of multiple processors at the motherboard level are substantially less than at the chip level.
It's better to hire a pair of tens and teach them to be good designers, letting the rest of the team fill in the code, rather than spend twice as long in the project at double the cost with a staff of 5-7's who don't realize the utility of good design. I have brilliant compatriots who should be able to code circles around me but can't complete quality projects and now complain when they have to work 80 hour weeks on a product 6 months overdue on an already slid timeline. They did very little design--and it shows.
I am in complete agreement. The opening line of my previous post was meant to describe the reality of the software industry at large (as you point out here), not the prefered way of operating. You'll note the smiley attached to my statement. FWIW, I am firmly in the design-is-important camp and have been for 15+ years.
You cannot talk of thread switch costs without comparing them to process context switches, to do so completely ignores the entire reason for more than one thread to be allowed to exist per process. Threads do not force processor context switches, processes do. The costs are significantly different and the ability to execute an in-process thread switch eliminates an unnecessary "extra" process context switch. Process context switches forced by a scheduler are rough washes as they happen in roughly the same number of time-slices under either paradigm. More threads in any particular process will add a few words each during the switch, but it takes far less time to transfer those words than the in-process switch saves by avoiding an extra blocked process switch.
Heh. Of course you can talk about the cost of thread switching without comparing it to the cost of process switching. Run in a single process system and such a comparison is meaningless! Most embedded systems (at least small ones) and all game consoles are entirely single process multi-threaded environments.
The cost of a thread switch is (typically) the cost of saving the processor state, the impact doing so has on the caches, the effects of the interrupt causing the switch (such as blowing the pipelines and/or the branch predictor), and the cost of the thread scheduler code in the kernel. Sure that all amounts to a mere fraction of a process switch (if you have such a thing), but since I was ignoring that cost in the first place that comparison isn't really relevant.
I agree in general with the first sentence in your final paragraph but those distinctions are most important for describing optimal performance, not better-than performance. Your following additional points on varying cores and thread execution resources per core actually make it more advantageous to a properly threaded application even if it is not optimally threaded for that architecture, because the ability to utilize more of the available resources will just about always be more efficient that leaving them idle. And if there are fewer resources available, then it is even more important to avoid performance killing blocked execution. [/B]
Well, as you pointed out:
And yes sometimes the proper design may be a single thread, but that is becoming increasingly rare.
My statement about "you got it backwards" was directed to your "well designed threaded code will almost always outperform less threaded code except in the design corner not requiring shared resources or i/o". Your statement implies that threading will do better than non-threading except there there are no shared resources or I/O. I said this was backwards because it is when you hit shared resources and I/O that you have synchronization points which tend to slow you down -- it is in cases where you have pure computational threads running on completely independent calculations where multi-processors achieve their maximal performance gain over single threaded code.
Your earlier statement...
But the lions share of things which can be threaded actually don't need constant multiple references to shared data. They merely separate the code likely to block from code that should never ever be blocked for that reason.
... is interesting because lots of code needs constant multiple references to shared data. Fortunately read access to shared data is, in most systems, free because unmodified data can be held in the caches of multiple processors without any problems. It is only when it has to be modified that things get a bit troublesome. Frequently this occurs in code that doesn't need to block but does benefit from being distributed across multiple threads to attack a problem in a divide-and-conquer manner. Consider something like Photoshop where a multi-megabyte source image having a filter applied to it and a new image is being generated. The filter may draw on a large area of the image to compute each output pixel, but each pixel is output only once. The inputs for successive pixels overlap greatly, but the outputs don't overlap at all (not even at the cacheline level if you divide up the work properly). Each thread is assigned a set of pixels to compute, and will run until finished with no blocking required. How many threads to use? In the absence of SMT it is probably best to have 1 thread per core -- more threads than cores will just result in extra overhead and slow down the overall calculation. If SMT is present, whether you want to take advantage of that will depend on how many stalls you are getting in your execution pipelines and how much pressure running two threads will put on the caches and memory subsystem. If other processes are running in the system then it might make sense to use fewer threads so that the OS doesn't take away any of the execution resources you are using to service the other processes, which would delay the overall calculation, pollute your caches, etc.
Originally posted by wizard69
I'm not convinced that one would have to go that far. All you really need is high integration on the chips themselves and high speed communications between chips. We are nearing the stage where it is possible to contain an entire computing system on a die, so for many applications where communications does not have a huge impact these are very valuable approaches. Sure this isn't the cure for every possible computing illness but will work well for many. AMD is one to watch here, as they are close to implementing what I would consider ideal systems for the future.
As I said above, I completely agree. Highly integrated chips which contain multiple cores and I/O devices are definitely the future. Your point about high degrees of integration reducing the chip count which in turn allows the chip count to be slightly increased to add more cores is a reasonable one -- this is exactly why I expect Apple to ship a quad core machine when a dual core 970 is available. They could reduce the chip count, but since they're already paying for the two chips why not keep paying and get a lot more bang for the buck. Similarly, this could be extended to the northbridge: get an on-chip memory controller w/ HT interface and instead of 2 970's + northbridge you can fit 3 dual core 970's. IMO this is still a long way from a massively parallel PowerMac built out of lots of chips. I don't see an increase in chip count being practical.
It is not like I would want to see these simple CPU's give up the instruction set that we currently have. Just that I would rather see many cores on a chip and take a tiny performance hit as opposed to having just two. Thus my statement about a 603 class chip with out a super altvec unit containing instead a simpler implementation of the altvec instrucitons. I believe that the instruction set needs to be balanced more so than processor performance.
This is interesting coming from the same guy who was arguing for AltiVec2, more instructions, bigger registers, etc in the other (970GX) thread.
While your statements are not wrong as stated above I'm not sure I'd take the same meaning from them that you do. Many common implementations today are not symmetric at all when looked at from thread performance. SMT and all of the other thread processor approaches do apply priorities to threads, so threads really are not running with symmetric capabilities. Same with multiprocessors systems, where in some one processor runs at a disadvantage to another. The operating systems of today have to take knowledge about thread performance into account when scheduling so the complexity issue is already there. Frankly putting as many processors as possible on one chip brings some symmetry back inot the equation (atleast for the processors on that chip)
The nice thing about SMT priorities is that they can be dynamically balanced based on many factors. Having some processors without AltiVec would be doable, but would introduce complications and inefficiencies. The OS would probably catch the "invalid instruction" interrupt, check if it was an AltiVec instruction and then move the offending thread over to an AltiVec-equipped processor. This just complicates life, however, and its my belief that we'd just be better off with a uniform set of cores that all implement the same instruction set -- hence part of my argument in the other thread for not changing that instruction set unless there is a very significant advantage to doing so. These days AltiVec isn't such a huge chunk of chip real estate (10-20 million transistors?), and a surprising amount of code uses it (including frequently called OS functions).
The non-symmetric parts of the chip can just be kept out of the PowerPC user model -- either accessed through OS calls, or as memory mapped I/O. The majority of software doesn't know or care how the new hardware capabilities work, they just do. If programmable features are added, that can be done through special APIs and languages (such as OpenGL's shader language, for example).
Apple could get a market advantage if they could convince IBM to go along buy implementing many cores on chip. It may very well be possible to implement as many as four cores on one chip while the competition is trying to cram two of thier cores on chip. If such a chip where to exists tommorrow, Apple would be able to implement it well immediately with the current system software they have. I think you would see very few complaints about systems with 4 or 8 processors at todays prices.
I don't think IBM requires any convincing. None at all. The POWER4 started life as a dual core design. The POWER5 added SMT to the dual core design. IBM is working on Cell with Sony. MS is rumoured to be getting IBM to build a 3 core SMT PowerPC chip for them. 5+ years ago IBM was putting out papers about what they were going to do when they could put a billion transistors on a chip (in 2010) and it was all about a grid of processors on one chip. IBM is leading on this one, we just haven't seen it reach Apple's chips... yet. They are all set up for it though -- the 970FX is about half the transistor count of Prescott, and is designed to work well in SMP systems.
I don't understand your last para though. If you spread a single process across multiple processors and if you are at that point via an intelligent design you should have already made the choices to minimize problems and maximize throughput. Locks for critical region code will preclude access to cached data, read or write for the duration of the lock, so read/write cache issues are relatively moot unless you do something that thrashes those memory values. That bad design is an indication of poor threading choices.
I am working from an assumption that work will be put in to make the appropriate choices and the code will be reasonably well written. Something greatly helped by good clean design. When discussing the merits of something like threading it doesn't help to be scared off by what can happen if you make poor decisions unless you plan on making poor decisions. I can create plenty of examples where a violation of good design and coding principles can completely hose things up, that doesn't change what the best design was though, just presents data points of where that best design was missed. Heck, look at the Finder or Safari, we have great examples of where the best design was missed. Tab A should never block completion of Tab B or navigating in Column View shouldn't freeze the whole app while accessing the next nested file list.
thanks for any info on this, I appologize for my lack of correct technical lingo...
Originally posted by mikenap
Quick question on "threading" and multitasking. (not sure if I'm using the corect terminology, but I'll try). Is it possible to design a computer system that would allow multiple operations to be perfomed IN the same application at the same time? for example, would it be possible to be applying a filter to a huge PS file, and while that is taking place, open another PS file and start working on that, saving other open files, etc? Now I can easily swithc between apps while the system is working on one process in a number of apps, but not work on multiple processes in the same program. Is this possible? Is there a name for this type of multitasking?
Absolutely, this is a key use of multi-threading. It is up to the application developer to design his software this way.
Originally posted by synp
To handle this, they have to allocate multiple buffers instead of using static memory ...
LOL. You're funny.
Originally posted by synp
... and they have to use serialization like semaphores or mutexes. All of these add a lot of complexity to all functions. OTOH all of the things needed to be event-driven can be concentrated into a nice well-debugged library.
This is worse!
Originally posted by synp
Unless you need very high performance, and unless your job is very amenable to being split up into chunks, it is far better to use an event model rather than multiple threads.
Threading is useful when an application has to do multiple, logically separate things such as when mail client displays email while retrieving new mail; Word running spell checker in a separete thread, etc. None of this is "high performance". And the job is not split into "chunks" as is rendering. These are logically separate tasks so they deserve separete threads.
Originally posted by synp
Only apps that can really benefit (think Photoshop and rendering apps) will use the high parallelism.
MS Word benefits too but for different reasons than Photoshop/Max/Lightwave do. Once you understand why this is so then you're on your way to competently designing MT apps.
Originally posted by synp There's really no need to make a thread and have it sleep waiting for a network event.
Yes there is. Event loops usually require one switch statement which gets very big very quickly in any non-trivial app. Do you know why the early Java UI API changed its event handling api? Coz they implemented what youre advocating. And the real world caught up so they had to change it.
Originally posted by synp
Even high-performance servers like Apache use only a few threads.
A little knowledge is a dangerous thing. Apache and IIS use a "few threads" (Apache on *Nix fork()s as well) cos they are not CPU bound but IO bound. They use thread pooling so after a thread issues an IO request (disk or network) it can continue servicing other pending requests.
I can only conclude from you comments (the first one was particularly revealing) that you're still a noob when it comes to programming.
Anyway, Cell will be awesome. TheRegister was reporting a 2TFlops. The processor is by IBM, Toshiba and Sony. No wonder Sony, and not Apple, will see it first.
Wouldnt it be in IBM's best interest to sell these chips to Apple?
MS Word benefits too but for different reasons than Photoshop/Max/Lightwave do. Once you understand why this is so then you're on your way to competently designing MT apps.
Thank you! I was starting to think people didn't realize that threading had more than one purpose.
1) The one everyone knows about, and the one which requires all the synchronization, is parallelism... breaking up one big task into multiple smaller ones so that it can be completed (much) more quickly.
2) The other purpose seems to have been lost on some people: Splitting independent tasks off into their own threads to avoid resource contention, and to keep the app running full tilt. In the case of a GUI app, this includes keeping the UI fully responsive, regardless of how busy the app is. In other cases, like Apache, it allows mutliple clients to make use of the server at once.
And you don't need to be a design guru to code the second kind of threading. Sure, you have to know how threads work, and what kind of things to split off, but they're really pretty easy to work with.
Do you think apple will be able to get access to this processor seeing that the 970 hit so many snags at IBM?
tom's hardware
Originally posted by TrevorD
Thank you! I was starting to think people didn't realize that threading had more than one purpose.
LOL You're welcome!
As you say threads are pretty easy to work with if you've got the data flow right. Some OSs provide thread pools where you just give the OS a function pointer and it will handle all the threading stuff automatically.
When people think of high-performance they immediately jump to more GHZ, multi-threading, SMP, cores etc. Sometimes the biggest jump in performance can be had by simply changing the algorithm. Nothing beats going from an O(n^2) to finding an O(log n) algorithm...
Originally posted by synp
there is no good reason to have multiple executing thread in a mail application.
And yet, Mail.app is heavily threaded.
Originally posted by Hiro
We are mostly in agreement and your examples point out the design constraints of finding any particular applications best design, including threading. Go ahead and ignore process vs thread switching cost at your own performance peril. Your choice and one I don't agree with. We just seem to have worked on different enough projects to have a different opinion of which end of the spectrum to start at when we begin the design phase.
Hmmm. Perhaps you misunderstand why I'm ignoring the process vs. thread switching cost. I'm talking about scenarios where processes don't exist!. In these cases there is no peril to ignore processes. If there are processes (and you are compelled to use them), then absolutely you need to weight the tradeoffs of threads vs. processes. Most PC software runs in a single process, however, and virtually all embedded and game software systems don't support multiple processes.
I don't understand your last para though. If you spread a single process across multiple processors and if you are at that point via an intelligent design you should have already made the choices to minimize problems and maximize throughput. Locks for critical region code will preclude access to cached data, read or write for the duration of the lock, so read/write cache issues are relatively moot unless you do something that thrashes those memory values. That bad design is an indication of poor threading choices.
I was trying to point out that there are hardware issues which significantly complicate how you can achieve maximum performance on hardware that you're not certain of when building the software. This is not an issue of bad design, it is that there is no perfect design possible that will cover all the varieties of hardware/OS in the market. Achieving a respectable design on known hardware is hard enough in practice (especially under realistic project conditions in the industry), doing so for wildly varying hardware is pretty much impossible.
I am working from an assumption that work will be put in to make the appropriate choices and the code will be reasonably well written. Something greatly helped by good clean design. When discussing the merits of something like threading it doesn't help to be scared off by what can happen if you make poor decisions unless you plan on making poor decisions. I can create plenty of examples where a violation of good design and coding principles can completely hose things up, that doesn't change what the best design was though, just presents data points of where that best design was missed. Heck, look at the Finder or Safari, we have great examples of where the best design was missed. Tab A should never block completion of Tab B or navigating in Column View shouldn't freeze the whole app while accessing the next nested file list.
I'm all for good clean design, but no design is ever perfect. And the less you know at design time, the less likely that your design is going to be perfect. Toss in the typical project timelines, changing requirements, development practicalities, etc.
At this point I can't even remember what we're arguing about.
Originally posted by Hiro
I see your points a bit more clearly now. We are coming at it from different directions and that colors our perceptions differently. We are in a half full/half empty discussion I guess. Where you see potential limitations, I see design considerations. They are closely related for all their technical identicality, but how those semantics affect the creative problem solving in design is often quite substantial. That's OK, it's just different views of the same thing.
Ah, bliss.
Thanks for the discussion.
Originally posted by Amorph
And yet, Mail.app is heavily threaded.
All apps with GUIs have multiple threads. A simple aqua application that I wrote (just some buttons and simple calculations) uses nine threads, while Calculator uses three. This is forced by the operating system and the frameworks you use to write your application.
The question is where the app does its work. It might look like a great idea to split off a thread to check the POP3 account while having another thread handle GUI requests, but there's a price to pay. There's bound to be some in-memory database of mail messages. As soon as the thread that's communicating with the POP3 account finds new mail, it has to update this database. The GUI thread has to do the same (deleting messages, marking them as read, flagging them). A background thread that deletes 30-day old trashcan items also needs to do this. To enable all this to be done by different threads, you need to have serialization on the database.
In the simple case, that will be a global database lock. This means that while the POP3 thread is adding new messages, the GUI cannot update. That's exactly how Mail.app feels, even on a dual-CPU. Suddenly the cog widget appears and starts spinning (that's what the GUI threads are doing - spinning the widget while other threads do real work). For the few seconds that it is spinning you can't do a thing, although the OS queues your keystrokes (that's another thing GUI threads are doing)
This could be improved by having a more sofisticated database with record-level locks, but apparently Apple does not think it's worth it for Mail.app, and they're probably right.
Now consider an event model for the same app. Here's a chain of events:
- User clicked a message line. The app reads the message from the database, displays it, and marks the file as read, while updating the display as appropriate.
- 5-minute timer elapsed. The app begins a TCP connection with the POP3 server.
- 3-way handshake is done with POP3 server. The app sends a request.
- some data arrives from the POP3 server. The app updates the database and display as necessary.
- The user clicks a message line. The app reads the message and displays it.
- some more data arrives. The app updates the display and database.
- The last of the data arrives. The app updates again and closes the connection.
The event model gives better GUI responsiveness, because there is no danger to the database integrity - the worker thread needs never worry about locks.
So when is threading good? Well, let's add message signing and encryption to our application. RSA signature and encryption are heavily CPU-bound, and can take a sizable fraction of a second for a good-sized mail message. This would hurt responsiveness and it is a very good idea to split it off into a separate thread regardless of the number of CPUs.
Just creating OS threads for structural beauty as some people are advocating, is wrong. It promises race conditions and data integrity failures.
In the case of the threaded email app, where a separate thread is doing the TCP communication... who says the mail database has to be locked the entire time the thread is running? Avoiding locking is the exact reason I'm suggesting threading! In fact, that's the primary purpose of user-interaction-related threads. If someone spawns a thread, and then has it lock the UI the entire time the thread is running... well, frankly, they're an idiot. Pardon my French.
Locking would easily occur in a single-threaded (event-driven...?) application any time the network bogs down, or a large attachment is coming in, because the same thread that runs the GUI and the rest of the app is stuck waiting for the data to come in. And it doesn't matter if it's "event driven", because the event that says "data is coming in... read me!" has taken all of the single thread's attention. If you're doing some whiz-bang fancy code to juggle user interactions and data loading at the same time, not only is your code becoming exceedingly complex (and begging for bugs), but guess what? You're also going to have to deal with database integrity issues. Maybe not the same kind of issues, but issues nonetheless.
A well-written threaded app (and I'm not talking "written by gurus", I'm talking "written by anyone with a clue") could do any number of things to avoid excessive locking of the mail database (and therefore, the portions of the app that are dependent on it). It could either collect all of the messages in a separate storage area, and then dump them all into the main mail database in one fell swoop (locking it for, oh, a couple milliseconds), or update the mail database after each message comes in, locking and UNlocking the database each time. The result? A mail application that NEVER locks up or becomes unresponsive, regardless of how busy it is. As for system integrity? You don't need to be a brainiac to keep all the data safe, believe me. And system locks? Trust me, I think the average user would rather deal with a few millisecond-long locks from a background thread (which is only locking at the moment it dumps its data into the main database) than having their entire app freeze during the entire duration of a (potentially slow) network transaction.
As for Apple Mail.app's performance? That's an entirely different issue. I haven't seen the code, so I don't know what they're doing, but it sure locks up less than what I use at work.
And no, I'm not advocating threading for structural beauty. I'm advocating it for a far superior user experience.
Tasks should be relegated to a different thread when there is a good reason. Cryptographic operations are long and should be separated. Image processing should be separated. Anything that takes more than a few milliseconds should not run on the main thread.
Because of the way GUIs work, it is almost always a good idea to separate the GUI into its own thread.
I believe that things that do not separate well are not good candidates for threading, otherwise you're begging to have rare and weird race conditions. It is far better to have a good library that makes things like network access and database access truly event-based (as in you don't get a "data available" event unless the data is available now). That library may use threads, but still only that library will have to be thread-aware. All the program-logic code needs not be thread aware, which means it's much simpler and less hazardous. With a good library, you never get locked.
Of course, it is possible that I think this way because the company I work for works this way, but this does make sense to me.