Why dual processor

tht · November 29, 2004 8:56AM

Quote:

Originally posted by Programmer

Not true -- the PowerPC will do a direct processor to processor transfer if one processor holds a cache line that the other wants. Its part of the MERSI standard. If only the MESI standard is supported then the situation is worse because the cacheline has to go back to memory and then from there to the requested processor.

Slight correction. The "S" is for shared cached coherency where there are direct processor to processor cache transfers. So if a processor has MESI cache coherency protocols, it would be fine for SMP configs. When it has MEI only, like the PPC 603 and 750, it will be terrible at SMP. The 7410, 744x, 745x and 970 processors are all MESI.

The "R" protocol is a Moto specific shared intervention implementation, only existed for the 7400. Not quite sure how it works, primarily useful for 4-way or more SMP systems from what I recollect.

programmer · November 29, 2004 9:43AM

Quote:

Originally posted by UnixPoet

If two threads are heavily accessing the same cache-line then the code was probably written badly and they should be using synchronization anyway.

Maybe they are using a global variable? And if the access are reads there will be no impact.

Unfortunately it is pretty easy for this kind of mistake to slip in -- especially with 128 byte cache lines. Two threads modifying seperate variables in the same cache line would trade the line back and forth. Badly written? That depends on your definition. If somebody wrote it as perfectly reasonable code for a dual G4 they would be displeased with the performance on a G5 (32->128 byte cache lines).

Quote:

Oh dear! Atomic operations are called such, because even if they are converted into multiple micro-ops (on a CICS CPU), the operation is not interruptable and is guaranteed to finish. Its an all or nothing operation - think of it as a transaction.

On the PowerPC arch you need to use lwarx and stwcx which, according to http://publibn.boulder.ibm.com/doc_l...C050214982JEFF can be used for "atomically loading and replacing a word in storage". The two ops basically define transaction boundaries so from the point of view of an OS/application taken together they are atomic.

Right, that's what I was refering to. And if you rememeber, I was trying to convey that these were expensive operations. Considering that the PowerPC primitives use multiple instructions and (typically) a loop, I would say that an atomic operation is considerably more expensive than a normal memory access. Having to do this a lot introduces overhead into a multiprocessor system.

Quote:

On the x86, unless you're implying that it isnt a modern processor, atomic operations do not need to be emulated - they are directly supported. See CMPXCHG, etc.

Heh, while it is tempting to say yes... no, they are modern and they do crack the instructions internally and (worse yet from a performance perspective) tend to cause a lot of stalls in the pipeline.

Quote:

The only cross-process, and not just cross-thread, user space synch primitives I know of is the FUTEX subsystem in linux kernel 2.6. Please correct me if wrong.

How many times do I have to say this: I'm not considering the inter-process case.

Quote:

You were using incorrect terminology and some of the statements you made were just plain false. Memory controllers do not "open" pages, they do not just have 4 or 8 or whatever pages open, etc.

Forgive me then, that is the terminology I have encountered in hardware documentation describing the techology. If you have better terminology, please share. Go read the RAMBus documentation for a clear example of holding pages open (a very similar thing applies to conventional DRAM).

Quote:

You've got to remember that people might be coming across this board in the future and, unless corrected, will take the contents of a post as fact. The wisdom of relying on assertions made in forums as facts is another matter...

Stay out of lecture mode, you really make yourself sound like an ass.

programmer · November 29, 2004 9:49AM

Quote:

Originally posted by THT

Slight correction. The "S" is for shared cached coherency where there are direct processor to processor cache transfers. So if a processor has MESI cache coherency protocols, it would be fine for SMP configs. When it has MEI only, like the PPC 603 and 750, it will be terrible at SMP. The 7410, 744x, 745x and 970 processors are all MESI.

The "R" protocol is a Moto specific shared intervention implementation, only existed for the 7400. Not quite sure how it works, primarily useful for 4-way or more SMP systems from what I recollect.

I'm going from memory here. I am pretty sure that the 74xx, 745x, and 970/970FX all support MERSI. I just came across a reference to it in the newly released 970FX documentation.

The "S" shared bit means that read-only copies of the cacheline can exist in multiple caches at the same time. This is definitely critical to SMP operation.

The G4 introduced the "R" (Reserved? I can't remember), but it also introduced the ability to immediately intervene when a request for a modified cache line is made. In this case the processor holding the modified cacheline will send it directly to the requesting one, rather than going through the writeback/read cycle. I don't think the "R" actually had anything directly to do with that, although I might be mistaken. I certain gave the wrong impression by mentioning the MESI vs. MERSI.

And except in the perfect universe that some people code in, this does happen at times.

tht · November 29, 2004 12:26PM

Quote:

Originally posted by Programmer

I'm going from memory here. I am pretty sure that the 74xx, 745x, and 970/970FX all support MERSI. I just came across a reference to it in the newly released 970FX documentation.

I'm pretty sure that the 7400 is the only G4 processor that supports the "R" protocol. It looks like the 7410, the 180 nm version of the 7400, does as well, but for the 7450-based processors (all 744x and 745x CPUs), it's only MESI.

Not sure about the 970, but I don't recall an "R" or equivalent protocol.

amorph · November 29, 2004 2:31PM

For what it's worth, searching for MERSI at freescale.com brings up MPC7400 and MPC7410 product pages. Searching for MERSI at ibm.com/us turns up no results.

unixpoet · November 29, 2004 3:12PM

Quote:

Originally posted by Programmer

Two threads modifying seperate variables in the same cache line ...

Coming from a supposed programmer this is astounding! In case you forgot, different threads get separate stacks. Therefore there is no way that two different variables, each local to its thread, can possibly be sharing a cache line.

If the threads are accessing global variables that is simply bad design.

If two threads are heavily doing reads and writes to the same block of memory then why split the logic up into two threads? Again, bad design.

I did not intend to come across as lecturing, but damn, you need to do your homework! And stop going from memory, it tends to be unreliable so you may say things which may, ahem, give the wrong impression

This will be my last post on this matter.

slughead · November 29, 2004 4:43PM

The internet is on computers these days.

Hey I'm a pretty good programmer and I don't know this stuff, don't make fun of us software techies ! :,(

fallenfromthetree · November 29, 2004 7:30PM

Have you read this?

/http://apnews.excite.com/article/20041129/D86LPO183.html

The race is on folks!

hiro · November 29, 2004 9:05PM

The threading concept seems to be a bit overcharged and obfuscated right now. Is this place always so defensive with it's alpha-hierarchy posturing? It seems to rival the spanking discussion in it's enthusiasm. Entering a fray vice discussion just isn't worth it.

programmer · November 29, 2004 9:31PM

Quote:

Originally posted by UnixPoet

Coming from a supposed programmer this is astounding! In case you forgot, different threads get separate stacks. Therefore there is no way that two different variables, each local to its thread, can possibly be sharing a cache line.

What on earth are you talking about, man? Any non-stack variable is candidate to be shared between threads. Or do you not use a dynamic heap?

Quote:

If the threads are accessing global variables that is simply bad design.

Static & global variables exist, and they aren't always bad. Sharing them between objects and threads is bad design, but I didn't say sharing variables, I said variables sharing cache lines. Most development environments aren't smart enough to seperate them so its easy for this to happen accidentally -- you can argue that this is bad design as well, but the world isn't perfect.

Quote:

I did not intend to come across as lecturing, but damn, you need to do your homework! And stop going from memory, it tends to be unreliable so you may say things which may, ahem, give the wrong impression

I just double checked and the 970FX manual (from the IBM site) does indeed state that the 970FX's L2 cache supports the MERSI protocol. If I read the 7450 manual correctly while it doesn't support 'R', it does support direct intervention between processors.

I don't mind being corrected, but I have yet to be wrong on anything you've pointed at.

Quote:

This will be my last post on this matter.

Fine by me.

slughead · November 29, 2004 9:50PM

Quote:

Originally posted by Programmer

Fine by me.

I enjoyed this discussion thoroughly.

I understood about 75% of it, unfortunately it was mainly just 75% of each sentence, so in reality I had absolutely no idea what anyone was talking about 90% of the time.

the 10% was good though, kudos to unix poet for giving programmer a run for his money

in the end I think you both had a bunch of things wrong, but truncating and combining your opinions at will would probably yield some truth.

UP was probably right about multithreading, and programmer was probably right about all the stuff I don't understand. And no, I'm not in a position to argue.

fallenfromthetree · November 30, 2004 4:15AM

Quote:

Originally posted by slughead

I enjoyed this discussion thoroughly.

I understood about 75% of it, unfortunately it was mainly just 75% of each sentence, so in reality I had absolutely no idea what anyone was talking about 90% of the time.

the 10% was good though, kudos to unix poet for giving programmer a run for his money

in the end I think you both had a bunch of things wrong, but truncating and combining your opinions at will would probably yield some truth.

UP was probably right about multithreading, and programmer was probably right about all the stuff I don't understand. And no, I'm not in a position to argue.

DITTO!

No one here minds a sprited debate, as long as we also try to respect

each other's opinions with an open mind.

programmer · November 30, 2004 10:20AM

Quote:

Originally posted by UnixPoet

You were using incorrect terminology and some of the statements you made were just plain false. Memory controllers do not "open" pages, they do not just have 4 or 8 or whatever pages open, etc.

And just because this one was bugging me I did a single simple Google search the turned up these (from a very very long list):

http://www.dewassoc.com/performance/...pel_rambus.htm

http://www.ntsi.com/DDRRam_Explained.htm

http://www.ntsi.com/DDRRam_Explained_p2.htm

http://www.glue.umd.edu/~ramvinod/dram_rowpredict.pdf

http://www.miro.pair.com/tweakbios/tweakdoc.html

http://www.lostcircuits.com/memory/ddrii/

http://www.lostcircuits.com/memory/ddr2/

http://www.hardocp.com/article.html?art=MjM4LDI=

http://www.realworldtech.com/page.cf...WT101602161848

http://www.sharkyextreme.com/forums_spotlight/5/2.shtml

http://www.ece.neu.edu/students/dmor...dram040611.pdf

So if I am incorrect in my terminology, I am at least in very good company. The next time you want to correct me, at least put a little bit of effort into it, okay? (And if anything I was overly optimistic in how many pages are held open by a typical memory controller)

overtoasty · November 30, 2004 11:45AM

Quote:

Originally posted by Programmer

And just because this one was bugging me I did a single simple Google search the turned up these (from a very very long list):

http://www.dewassoc.com/performance/...pel_rambus.htm

http://www.ntsi.com/DDRRam_Explained.htm

http://www.ntsi.com/DDRRam_Explained_p2.htm

http://www.glue.umd.edu/~ramvinod/dram_rowpredict.pdf

http://www.miro.pair.com/tweakbios/tweakdoc.html

http://www.lostcircuits.com/memory/ddrii/

http://www.lostcircuits.com/memory/ddr2/

http://www.hardocp.com/article.html?art=MjM4LDI=

http://www.realworldtech.com/page.cf...WT101602161848

http://www.sharkyextreme.com/forums_spotlight/5/2.shtml

http://www.ece.neu.edu/students/dmor...dram040611.pdf

So if I am incorrect in my terminology, I am at least in very good company. The next time you want to correct me, at least put a little bit of effort into it, okay? (And if anything I was overly optimistic in how many pages are held open by a typical memory controller)

Geez Sid, you're supposed to remove the patches slowly

unixpoet · November 30, 2004 12:19PM

Quote:

Originally posted by Programmer

And just because this one was bugging me I did a single simple Google search the turned up these (from a very very long list):

...

LOL

I promised I was not gonna reply but this one is too good to miss. Yeah - that statement I made on memory controllers was from the hip so I fscked up no it. In my defense I want to say that the memory controller is the least of your worries (as opposed to memory locality of reference which is crucial) when it comes to extracting the most out of multi-threaded designs or, for that matter, any high performance code. There is nothing a programmer can do about the controller, it's out of his control, good design however is not.

I could point out again the bloopers you made (or maybe you just hadnt read my posts properly...) but then we'd restart another round of posting. No use beating a dead horse.

The real point of all this: what irked me about your first post was your portrayal of multiple threads as being complex, with too much dependencies, etc. It painted threads in too negative a light which I think is misleading, after all they are just another tool like C++ or Java. Threads can bite but not if used wisely.

IIRC correctly: the original post was if dual is better, than why not go dual? I hope we both agree that more is better

amorph · November 30, 2004 1:33PM

Quote:

Originally posted by UnixPoet

LOL

I promised I was not gonna reply but this one is too good to miss. Yeah - that statement I made on memory controllers was from the hip so I fscked up no it. In my defense I want to say that the memory controller is the least of your worries (as opposed to memory locality of reference which is crucial) when it comes to extracting the most out of multi-threaded designs or, for that matter, any high performance code. There is nothing a programmer can do about the controller, it's out of his control, good design however is not.

Any good design takes into account the things that the designer can do nothing about.

Clearly, the design of the memory controllers in question assumes locality of reference to a considerable degree by only allowing a handful of hardware pages to remain open at any given time. So the issue becomes not only keeping any given locality within the limits of the available CPU caches (which the programmer can also do nothing about), but also keeping the number of localities in use at any given moment from causing the memory controller to thrash.

Re: "bloopers," you allow for the possibility that maybe you just misread. I think this thread could, in general, do with the assumption that an apparently stupid or outrageous statement is a misreading or a misunderstanding. It would produce much more light and much less heat, especially for our less technically inclined readers.

Quote:

The real point of all this: what irked me about your first post was your portrayal of multiple threads as being complex, with too much dependencies, etc. It painted threads in too negative a light which I think is misleading, after all they are just another tool like C++ or Java. Threads can bite but not if used wisely.

Given that the question Programmer responded to, from an apparent non-programmer, wondered what the dependencies were that had to be considered, it's not surprising that his answer dwelt on problems, edge cases, and examples (like blocking on disk I/O) meant to be illustrative.

Mindful of the context of this thread, he talked about the issues of threading in a significantly (if not massively) MP environment, in which hardware latency, overhead, coherency and synchronization issues suddenly become serious design issues (scaling up with the amount of parallelism).

Quote:

IIRC correctly: the original post was if dual is better, than why not go dual? I hope we both agree that more is better

Which is why you can spit and hit a dual-processor computer, right?

The questions are: Under what circumstances, and to what extent, is MP better? What are the costs? what are the tradeoffs? How well do the available solutions scale? What has to be done to get existing software to take best advantage of the new architecture?

programmer · November 30, 2004 2:56PM

Thank you, Amorph, excellent post.

I knew I could tempt our poet into another post...

Quote:

Originally posted by UnixPoet

I promised I was not gonna reply but this one is too good to miss. Yeah - that statement I made on memory controllers was from the hip so I fscked up no it. In my defense I want to say that the memory controller is the least of your worries (as opposed to memory locality of reference which is crucial) when it comes to extracting the most out of multi-threaded designs or, for that matter, any high performance code. There is nothing a programmer can do about the controller, it's out of his control, good design however is not.

And you berate me for shooting from the hip?

I fail to see how the memory controller's behaviour is the least of my worries. Certainly it is not the only worry, but if you've got your design right and are trying to determine how many pieces to chop your computation into so that you can utilize multiple processor cores then it becomes very relevant. I wouldn't have mentioned it if we hadn't run headlong into this issue more than once.

Quote:

The real point of all this: what irked me about your first post was your portrayal of multiple threads as being complex, with too much dependencies, etc. It painted threads in too negative a light which I think is misleading, after all they are just another tool like C++ or Java. Threads can bite but not if used wisely.

Your position is plenty clear, but I've been doing this a long time and I must say that threads are one of the more dangerous tools available. They represent dynamic, timing specific behaviour which is among the most difficult things to deal with. They can bite even when the wise use them, and they ought to get the respect that they deserve. That is why I paint a complex picture of them -- they are more complex than the naive might think, and they should be approached with caution.

Think of it like dynamite -- you can use it to make really big holes, but you need to take a lot more care than you would with something like a shovel or a hammer.

That said, I am a huge fan of multi-threaded programming and having been for more than 15 years. I've written my own threaded kernels, including a pre-emptive one that run under MacOS System6 back in the 80s. I've been a firm believer for almost as long that multiprocessors were the future and that eventually most software would be multithreaded. Hopefully we'll eventually have the tools to make parallel programming a little less treacherous than it currently is -- C/C++ has little in the way of protective measures, and Java is only slightly better. Until then, be careful out there.

Quote:

IIRC correctly: the original post was if dual is better, than why not go dual? I hope we both agree that more is better

[/B]

Lots more.

mcq · November 30, 2004 4:40PM

Can I be lazy and ask what the R in MERSI does for cache coherency? I was just excited to hear about MESI in a comp architecture class ("Wow, I read this term in some heated debate on a forum!"). I didn't really understand a lot of it since I was tired halfway through, but since I heard nothing about MERSI just thought I'd ask. Didn't seem to get a whole lot via a Google search outside of product summaries saying a processor supported it.

Thanks

programmer · November 30, 2004 6:34PM

Quote:

Originally posted by MCQ

Can I be lazy and ask what the R in MERSI does for cache coherency? I was just excited to hear about MESI in a comp architecture class ("Wow, I read this term in some heated debate on a forum!"). I didn't really understand a lot of it since I was tired halfway through, but since I heard nothing about MERSI just thought I'd ask. Didn't seem to get a whole lot via a Google search outside of product summaries saying a processor supported it.

From the PPC7400 technical manual:

Quote:

Supports data intervention (cache-to-cache data-only operations) in multi-processor systems (Five-state MERSI). This includes shared intervention. Note that general intervention is possible when using MESI protocol.

and

Quote:

For example, MPX bus supports data intervention. On the 60x bus, if one processor does a read of data that is marked ModiÞed in another processorÕs cache, the transaction is retried and the data is pushed to memory, after which the transaction is restarted. MPX bus allows the data to be forwarded directly to the requesting processor from the processor that has it cached. (MPC7400 also supports intervention for data marked Exclusive or Shared.)

Does that help?

fallenfromthetree · November 30, 2004 7:10PM

But........ what does the "R" stand for?

Why dual processor

Comments