Why does software crash, freeze and hang?

dahacouk · November 9, 2005 6:05AM

Hands up who has never had a desktop application fatally crash, temporarily freeze or permanently hang on them? Apart from the abacus users over there we've all had it happen to us. But why?

Is it just sloppy programming or is there something fundamentally broken with the way applications and operating systems are coded?

After upgrading to OS X 10.4.3, Mail started consistently crashing a few seconds after launching. After digging around the forums I got a possible fix: remove all stored email messages and configuration/preference files. It worked, I'm happy to say. I got a virgin Mail application that was seemingly stable. Luckily, I use IMAP email - all my mail is stored on my server. So, after setting up the account details in Mail - yet again - I just synced Mail with the IMAP server and I had all my email local again - nice.

Now then, this debacle isn't new. I've had to do this all before. There's a pattern here perhaps. When the chips are down or when configuration files are corrupted or not understood by the new version of the software why does the application have to crash on me? Why is this such a fragile world?

What's the basis for my thinking that our software could be build to not crash, freeze or hang? I'll tell you...

Take the scenario where you're browsing the Web, hopping from one website to another - and then you click on a link to a website that doesn't exist. How do you know? Well, the browser says "can't find website" or some such thing. It doesn't crash, freeze or throw a wobbly - hopefully.

Now, what's special about the browser is that it is generally robust enough not to crash when it comes across websites that don't exist or don't adhere to 100% HTML standard specifications - hopefully. In fact there's a lot of expectation that websites wont exist or wont conform to web standards built into the browser.

Now, what would happen if Mail found that it's configuration files where corrupt or formatted wrong or of an old version? Could it just say "can't read configuration files - shall I try to repair or just set up a brand new one?" rather than just crashing all forlorn like? Flump! I know Mail is currently capable of setting up a new configuration file because that's what it did when I removed them.

What would happen if Mail was to pretty much expect configuration files not to conform to standard (like the web browser and its websites)? If rather than crashing Mail just asked politely "old preference files not readable - would you like to start afresh?" Wouldn't that be much nicer?

And another thing... What about that spinning coloured ball or watch (for pre OS X users) or sand timer (for Windows users), eh? That long pause. Why is my interface frozen when the application is waiting for data or running a calculation? That just seems archaic. A good (or bad) case in point is spotlight searches. Why does Finder freeze with that spinning ball when it does a spotlight search? And why can't I open any other Finder windows when it freezes?

So, something seems wrong at the core here. Spinning balls shouldn't exist. So, what we could be talking here is decoupling the desktop interface from the collection and management of the data. So that the interface never crashes or stalls but could report if it's not finding the data where or in the form it thinks it should find it. And the interface never momentarily freezes when accessing information or performing a calculation, it just displays a time line bar and lets you get on with other stuff.

Wouldn't this be much better - much more robust? Applications would be built to handle corrupt, malformed or missing data gracefully and also handle waiting for data to sync up by showing the user what's going on. And never crash or hang the user interface.

Developers could split the interface from the data processing part of their applications and enable these two components to message each other. Or an application could be split into many more components all messaging each other. The contents of each message would be checked before ingesting so that it couldn't hang or crash the recipient component.

I must admit that sometimes applications seem to cope well but there just seems to be far too many times when they don't.

Now, am I dreaming or is there fundamentally something amiss in the way applications and operating systems are built today? Couldn't all this be much better? What do you reckon, eh?

Cheers Daniel

code master · November 9, 2005 5:38PM

Coding to handle errors and problems is not an easy thing to do. There are so many things that can go wrong, and to check for all of them takes up too many resources and makes everything go too slowly.

Corrupted files are probably the most common, and hardest issue to deal with. First, any corrupted code files used by the application are fatal. Reading in corrupted preferences/database files should be handled more gracefully, but then what do you do? It is easy to validate things that are true/false, or simple numbers, but when you read in strings/numbers that don't terminate due to a missing keyword/return, the rest of the file is often junk, but you can't know it without a lot of checking. Say you have a database reference. How do you know if it's good? Validating every array index before using it is a LOT of overhead. And who says that the length you are comparing it to is even correct?

This is not to say that there aren't ways of trying and catching failures more gracefully then crashing, but I expect often the best result you can hope for (especially on loading the program) is that you have a program that reports an unknown error, or a corrupted file, and then has to close because it can't work anyhow. No sense adding all the extra code (and possible errors) to end up with the same thing: a useless program.

karl kuehn · November 9, 2005 5:39PM

You have never written anything more complicated than "Hello World" before have you? Never written anything multithreaded? Once you have, then come back and we will have a better conversation.

It is complex to write complex applications, and you are always under the gun to get things done, to "ship". And there is always some assumption that seemed perfectly reasonable at the time that turns out to not be the case. And we are totally ignoring that the tools you use to create the software also are subject to these same conditions... and we have not mentioned the vagrencies of hardware in the real world.

Things can get better, but perfection is unreasonable.

code master · November 9, 2005 5:46PM

Oh, and things like spinning beach balls are usually the result of not making a threaded (or properly threaded) application, or waiting on locked resources.

Often, small programs gain no benefit and add increased complexity to add threading. If there are no shared resources, there is often little advantage (except on huge calculations that can be done in parallel), and if there are shared resources, all the interactions have to be checked for to prevent other oddities.

And if a program is threaded, locks on resources (often a semaphore or mutex) will still cause the beach ball. A thrad that needs returns from the HD when it's spinning up has to wait. Code waiting to access a common block of memory still has to wait. Sure, with lots of extra effort, there are ways to avoid spin-locks, but if most of the time, it's not a concern, it's not worth the effort.

So everything comes down to deadlines, and how much design/effort the developer puts into the program. When bosses want results and deadlines are looming, the user pays for it. When hobbiests want want to finish the project, boring code like error handling and critical section tuning gets put off and off and never completed.

Why does software crash, freeze and hang?

Comments