People often like to quote from the stories of Sherlock Holmes and it was possibly the most famous line of all, “…when you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth”, that made me feel that the attitude displayed by Conan Doyle’s detective was that required by a good Software Maintenance Engineer.
From my experience one of the hardest and most undervalued skills to really master is that of analysing and resolving customer issues. Even the cleverest, most practised team of developers have issues, though the tolerance level of the domain may affect at what stage of the development lifecycle these become apparent. It is fairly clear that the systems on an aeroplane need to be somewhat more robust say than your average computer game. However the principles of resolving an issue apply at whatever stage the problem is identified. In fact, if you are adept at addressing these problems with the more limited resources concomitant with customer issues, failures identified during the development phases should be a much easier prospect; so it is never a redundant skill.
‘There is nothing like first-hand evidence.’ (A Study in Scarlet)
Being a skilled maintenance engineer starts well before the issue arrives at your desk. Initial development planning must consider how you will maintain your code. The chances are some other poor developer is going to have to fix it! Plan to provide as much evidence as possible to deduce the cause of a problem.
The quality of your logging is important, whether it be product specific log files, Windows Event Log, syslog, an audit table in your database, or some equivalent. You cannot assume you will be able to reproduce an issue, so you need to be able to plan to provide rely on historical evidence.
Make it simple, make it succinct and make it clear. I have seen code with extremely complex logging mechanisms allowing logging levels to be controlled at an almost infinitesimally precise level. From my experience, if you think carefully about what you are logging the generally common model of four levels; information, warning, error and debug, will suffice in most circumstances.
Look at the log files whilst you are developing your code.
- Is the level appropriate?
- Is it accurate?
- Can it be more succinct without losing clarity?
- Can you tell the user how to resolve the issue, and if so do you?
- If you have multiple data sources are they structured consistently?
These may seem petty considerations, but when you are trawling through thousands of lines in a log file they become highly relevant. If you need to cross-reference data from two files a consistent structure simplifies this. If you can avoid customers having to raise an issue in the first place, why not?
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” (The Adventure of the Copper Beeches)
You need the evidence to identify suspects. Your Support Team should have a standard set of items they automatically collect when a customer raises an issue. By the time you get your hands on the problem you should have details of the customer environment, including any bespoke tweaks that they have. You should have log files and dump files if appropriate, clear steps to reproduce, actual outcome and expected outcomes; you may even want database backups. You can never have too much information, because generally you find that fairly quickly you can filter out the irrelevant and focus on the valuable.
‘It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.’ (A Scandal in Bohemia)
You cannot come to an issue with a conclusion already made. You must analyse all the data; which means you need to have an excellent Support Team that will get you this. Take time to look at the errors, the logs, the databases, the environment, that the software is running in.
I have seen many situations where we have had two customer issues and the symptoms appeared to be the same, but after analysis it became clear that they had different causes
Reproduce the error. The Support Team need to have the skills and information to be able to at least attempt to reproduce the issue in a simplified environment. The means the Development Team must have taken time to give them training on the products. Clearly they must be able to install, configure and run your products, but are there other technologies they need to be familiar with; SQL, AD, IIS? They need at least a basic understanding of the architecture and what each component is responsible for. If you have a third line support team that can do some basic debugging and database analysis all the better, but you must give them the skills. Time taken to give this additional training will pay dividends by freeing up the development team; reducing the impact caused by context switching, changing development environments and diverting development away from newer projects.
‘I cannot live without brain-work. What else is there to live for?’ (The Sign of Four)
You have to be willing and able to dive wholesale into a new domain, unknown code and study. I have always enjoyed the challenge of this part of the job. Whether it be working on a wholly new product or in the realm of maintenance, this is a job where you can never rest on your laurels.
One of the first issues I addressed when I joined 1E over five years ago was a customer issue with IIS. A technology that as a developer I had not had to touch in my previous roles. Within two weeks I had gone from complete novice to resolving an issue with configuration inheritance causing application pools to be recycled when they shouldn’t have and hence causing the user session state to be lost. Probably not the most complex of IIS issues for an expert, but I wasn’t.
The challenge needs to push you on, direct your study and inform your investigation.
‘They say that genius is an infinite capacity for taking pains,’ he remarked with a smile. ‘It’s a very bad definition, but it does apply to detective work.’ (A Study in Scarlet)
Very few of us are geniuses, but as a developer trying to resolve a customer issue you cannot take short-cuts. As with Holmes’s magnifying glass you need to be able to get into the minutiae of the crime scene. This may be the analysis of hundreds of lines of code or investigating the contents of a database table with a million rows of data; you just have to do it. What you do need to do is make sure you know the basic techniques that can reduce this pain.
Basic tools and techniques, like setting breakpoints, viewing the content of the registers or other such things; are covered in a million articles and books. Any experienced developer should already have a toolbox full of these techniques, but what about the junior members of your team? Make sure your senior engineers work with them on real issues. Just knowing the different kinds of breakpoints and how to use them may not have been something they have been exposed to. Many products contain lines of code might be iterated over thousands of times within seconds, try using a standard positional breakpoint to step through that and you will never find anything. Learn how to debug multithreaded applications, know how to look at the call stack or check the contents of the registers.
Sometimes you may have work with optimised code, so you make sure you understand how to do this. The disassembly window shows what is really being executed, the code may not be an accurate description of the actual behaviour. Play with the various tools around that can help you, above and beyond the standard debuggers for your environment, things like Fiddler for examining HTTP traffic, or HexEdit for examining binary files. Learn some simple techniques for reducing the amount of data you have to deal with, like binary splitting (bisection algorithm).
Make sure you are teaching the novices in your team. Like any knowledge silo, having a limited set of developers with these skills is only going to cause you problems down the line.
“…when you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth” (The Blanched Soldier)
One of my colleagues has recently been spending inordinate amounts of time examining a customer issue. It caused a process to hang completely and never complete. He tried all the usual techniques to identify the cause. The culprit was eventually identified as a “quirk”, as described by my colleague, with SQL server. (This is described in more detail in his blog “The Dark Side of SQL Server”)
I myself have spent many days investigating an issue which finally was identified as being caused by a particular string value of one column in a single row of data being imported with thousands of other rows.
Other examples of unexpected causes of issues include a bug in a third party library, where you had to call one particular method twice, with no other call in between with the same parameters for it to commit the change you requested.
Let the information lead you to the answer, however obscure.