Recently I was reminded of what a problem error management poses and how much more expensive it is when it is poorly done or not done at all. I have been setting up a new piece of software but had some difficulty in getting one part of it to work. (The vendor and support organization should remain anonymous.) The operation I was attempting would fail but there were no clearly identifiable postings to the error log. And what events did seem coincident made no sense.
Back when I worked on Digital Vax machines there was a joke that the way DEC field service would fix a flat tire would be to check the other three tires first. This support issue seemed to go the same way as I was passed from person to person and retreading the same ground over and over. And of course, the folks I was dealing with were a long way from both me and the guys who write the code. Eventually the problem just went away leaving no clue as to what happened or changed to resolve the issue.
But this reminded me of how good the VMS error message convention was – DEC had designated a 32bit number for reporting errors. This was divided up into three fields – a three bit field for severity and two larger fields for facility and problem. Essentially, the error number told you who was complaining, what was being complained about and how bad was it. This concept seems to have gotten lost – current software uses numeric error numbers but only some of them are documented in public accessible form and one needs to know who was complaining to interpret the error number correctly. And then there are my favorite ‘fatal error – fault bucket xxxxxxxx’, which has no online documentation at all.
And having the error log entry display contain a nice user-friendly link that says ‘click here to learn more about this error’ – that takes you to an error page when you do because there is no index for that error. As a result, I have learned to do a search on Google and not bother with the vendor site at all. Why bother, it never works anyhow.
And along this line there are the health monitoring messages that complain about the health monitoring system, especially when the machine is starting up and the delayed start services haven’t as yet. After a while, like the boy who cried wolf, one stops looking at the health monitoring system at all. It may be doing useful things but since it seems more like a cranky hypochondriac aunt no one wants to associate with it. Probably not the design intent.
Now in the computer world, all of these errors were created by developer-written code. So someone decided to report ‘C0000005’ for a particular type of error and someone told them it was ok. There may even be a last chance exception handler that reports something before the program drops back to OS command level (preferred) or to bare metal if the problem is really bad. But what seems to be missing is the administrative step to collect this information, provide some additional support comments and put it someplace searchable. So costs were saved on the development side, but more than made up for on the support side. I spent a couple of weeks on this problem before it just went away and the folks I was working with put in a good week on their own plus communication time. Surely this was more costly overall than decent documentation?
So what happens is that everyone tries out their own personal ju ju – are we current with patches? Is your network up? How different are the clocks on all your machines? And so forth – if we don’t know what the problem is or why the problem went away then whatever we were doing, thinking or wearing may have been the reason. Lets do it again….
Back when I was a systems developer we took turns handling support calls from our customers world-wide. This was referred to as our week in the barrel and while there we were expected to not get anything else done. Our projects all waited for us to climb out. So we had a good idea what the issues were and had access to the source code as well so we could trace out what the programs were doing. I don’t think those folks I worked with had the same luxury. And besides, there are so many layers of code in current programs that finding the root cause might be problematic. And furthermore, modern pipelined processors don’t report fundamental errors synchronously any more – so the current instruction may not have anything to do with the real problem. One can understand the reasons for using interpreters and runtime frameworks – just to get control back for error reporting even at the cost of a bit of performance.
But in a sense the externalizing of customer support has another effect – the results of poor coding, or more likely the collision of multiple pieces of good code that just don’t happen to work together, is handled by people well removed from the perpetrators of said code. Their experience is probably summarized in some tidy management report that may eventually make it back to the developers but not necessarily. SO not only are costs enhanced but the learning diminished by the decoupling of support from development.
Then it struck me that there is a lot of this going on. Corporations and governments contract out their public-facing services and insulate the organization from the responses. You can rant or rave at your elected representative all you want but if that communication gets handled by their press secretary and never reported upwards it has no effect. Maybe they gauge public response by the weight of the mail and not the content. Or hold public meetings where attendee questions get danced around and then ignored. Or send out a mailing with a questions along the lines of ‘have you stopped beating your wife yet’? Each choice is really the same so only the form of listening to the public is followed. They don’t really want to know – it gets in the way of their plans and takes their minds of themselves.
A pity, this decoupling of action and support – so as actions at many levels get increasingly decoupled from perceptible reality, one wonders if this is how the Romans saw it towards the end?