We had a very nasty issue yesterday (into today) at work. It involved an issue we saw once before, last year, and never figured out what the issue was. Well it was back yesterday and it reminded me of some of the important aspects of problem resolution; closely related to debugging skills.
The specifics of this issue are not really important except in that the infrastructure of the system has multiple physical tiers and that it is a vendor application hosted internally. Multi-tier applications have a higher level of complexity that comes from the fact that there is way more code running that just the application itself (VIPs, routers, firewalls, communication stacks, different platforms, etc). Vendor applications can just be a pain since you don’t know the internals of what is going on; which means you are making assumptions (aka educated guesses) sometimes.
Below are a couple of my favorite principles when debugging a issue.
Write Everything Down
- This not a time to test your powerful memory skills. You will get tired and you will forget. As you try things you will find things that work and things that don’t. If you are lucky you will find the fix quickly. If you are not (it took us 20+ hours this time) then you will have many things that did not work. These are very important and you are going to probably have lots of them
- You are going to have people rotating in and out of the virtual team that are going to be a distraction if you have to bring each of them up to speed on what you have already tried.
- You will resolve this issue at some point and want to restore some of the things you changed. Write down the current state of the system – “which knobs have been turned”.
- You may look back on the path you followed to resolution and be able to identify ways to improve the system overall. If you write this stuff down you will appreciated a couple days later when your life returns to normal.
- Names and phone numbers. If you have an diversified organization, as we do, you are going to need/get lots of people coming and going. Many of these folks will have knowledge of or authorization to change things that you do not. Once they are members of the virtual team you want to keep them since they already have a context which in and of itself is valuable.
Change One Thing a Time
Thankfully this is something that I learned very early on and have tried to live by. I use the word tried, because I have succumbed to temptation to do otherwise and often lived to regret it. My implementation of this principle is the following…
- Draw a conclusion about what you think is wrong. In other words don’t go shooting in the dark. If you don’t know what to do next then stop. It doesn’t mean that you won’t be doing something soon, but don’t go trying things without first knowing what you think may be the issue may be. At some point your conclusion will either be correct and you have “solved” the issue or it won’t be and you have eliminated another thing that is NOT the issue. More on what does not work later.
- Evaluate your options for correcting. Write them down; you may want to try all of them. This is a good time for brainstorming. You may want to bring some others into the virtual team for a short time to help out here. Treat them as consultants (see roles and distractions below) and don’t let them linger too long unless they are able to fit in.
- Decide your approach for correction. One person owns the decision as to what the next course of action is (see roles). There a million ways of coming to a decision that I will not go into – the key here is that you choose one and let everyone know what the decision is.
- Plan your implementation. This is not as heavy as it may sound. You don’t want to spend too much time here; not that it is a waste, but at some point in this discussion you will get to a point of diminishing returns.
- Identify what you believe the new outcome will be.
- How will you know if it worked?
- Do you know how to roll back the changes you made?
- How are you going to test your change?
- What can go wrong?
- What may other outcomes be and what do they tell you?
- All things to consider BEFORE you actually implement the change. I feel another who blog topic just on this point. If you don’t understand why all these are important things to consider; then I need way more space than I want to spend here to show you why…so I won’t. Trust me.
- Implement the change.
- Identify who is going to do what and make sure they are clear on what they are changing. Hopefully they are an expert and not learning as they go.
- Pair programming was never more helpful than now. Work together to ensure accuracy. You will be getting tired and mistakes will happen. Put everyone to use here and let them help by watching for gross errors. Don’t be afraid to show someone how this works; you don’t want the fireworks effect where every time someone types something the entire rooms gasps. This take patience, it hard to watch someone else type and it is equally hard to be watched.
- Run your test. This is what we have been working for. Write down the result(s). Did something totally unexpected happen? What does this tell you? What conclusions can you draw?
- Repeat. Look this iteration of conclusions, options, predictions and outcomes and decide what you going to do next. Do you start over at step 1? Or someplace in between here and there?
- Don’t forget to reset. You need to choose to restore/reset the environment. Record what you do and make sure everyone knows the current state.
Clear Roles and Responsibilities
Important to consider and feels kind of bland in some ways. Not to mention that this is probably a whole entry in itself. A couple of things I wanted to record now are..
- Who is running the show? Make sure they can delegate. Make sure everyone respects the decision when it is made.
- Who is communicating? Boy this is a big topic in itself…
- What are the different audiences?
- Who make the decision to communicate?
- Is this the same/related as escalation?
- Frequency? Email? Phone? etc.
- Blah, blah, blah…
- Who is a spectator? Make sure they know they are.
A good example of poor role definition happened to us during this most recent incident. The system came back up and an excited member of the team sent an email to the entire customer base that the system was available. Whoa! Yes the system did come back up but it was not ready for the business to start using it yet. We still had not assessed why the system came back up and whether we thought our success was going to last. What a pain if the system failed a couple minutes later. Also, the system was still in a debug mode. We had lots of logs turned on and test settings configured that need to be changed in order to get the system back to it’s production state. Luckily the users figured out that someone was not quite right and let us know. We recovered before anything really bad happened but it could have gone horribly wrong.
Did I follow my own principles this time? I tried. But sometimes when you have lots of people involved it is just not possible. People get anxious and/or want to contribute. They have good intentions but in the end it muddies the whole thing.
In our case we had a couple people off in the corner of the room trying things. One with elevated privileges and the other with a little bit of knowledge but not a core member of the team. They started hacking around and without anyone else knowing. They changed a bunch of things on a test server and found that production environment was back up. The likelihood that they actually did anything is very low, but now we don’t know since we don’t know what they did or the state of the environment before they did it. Chaos. Now we are left with a nagging question. This has left me with a couple new principles. It is not very well thought out at this point but I wanted to get it down now before I forgot.
Distractions come in many forms and they can slow you down or just plain hurt.
- Don’t have anyone involved that does not to need to be. Excitement tends to draw crowds, so you need to know when to put up the yellow tape. I don’t want to be militant about this because there are some people out there that are comfortable being in a peripheral role (see above) and know when to contribute and when to stay out of the way.
- Get to a war room or isolated area that makes all the other principles easier.
- We have several big rooms with 80″ smart board/displays and lots of whiteboard space; which can aid in the documentation.
- They also have table mounted speaker and lots of ceiling speakers for good audio because you will likely have a distributed team and communication with all them is going to be hard enough – forget it if you cannot hear one another.
- Getting away from the crowds can keep the crowds away.
- Don’t forget the creature comforts; food, drink, restrooms. These are obvious things that the team will need during a incident; but they can also be distractions. If the restrooms are way far away; then it just hurts. If people are hungry they can be distracted. You also don’t want everyone fending for themselves if you don’t have to. I kept bringing in food for the team
- Get sleep when you need it. No heroes. If you are getting punchy then are probably going to become a distraction for the entire team. There are all kinds of studies out there that related being tired to being drunk – don’t debug drunk. You will swerve over the yellow line.
Understand Vendors in Scope
Make sure you understand the vendor products or services in your application before you have an issue. What support arrangement do you have with them? Is it 24×7? How do you reach them? Make sure the contact information is current. What is their engagement / escalation model? Do they know your environment? If not how are going to educate them? Are you sure that the sharing technology (webex, etc) hey use is compatible inside your firewall? Are you current/do they support the version you are on?
More to say here, but I am running out of steam. I may revisit this at a later time.
Deep breadth. I am down here at the bottom of this long entry and liking the brain dump. Not sure how coherent it all is but it feels pretty good. Let me know if you find any of this helpful.
As I was wrapping this up I found this interesting article that I though was worth linking to here. I am constantly amazed at how many topics there are “out there”.