Evidence-based Troubleshooting: Observations on Problem-solving from Economic Psychology

My latest read, The Undoing Project: A Friendship that Changed our Minds, explores the friendship of Daniel Kahneman and Amos Tversky — two psychologists whose study of irrational human economic choices led to a Nobel Prize in Economics. While the book discusses their friendship in detail, it also highlights their work which has influenced a wide array of disciplines including medicine, professional athletics, and and military recruitment.

I was particularly struck by their work on decision making. Through several studies, Kahneman and Tversky realized that human judgement is skewed by how questions are asked, how information is presented, and by how closely a scenario aligns with their existing biases. For example, they discovered when given 5 seconds to determine the product of 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8, subjects consistently made a much smaller guess than when asked to determine the product of 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1, although the actual product is exactly the same.

The implications of their work stuck a chord.

For those of us who work in technology, we pride ourselves in our rationalism. We deal in the territory of binary math, algorithms, and logic. Yet we often fail to step back and see ourselves as mere mortals. We acknowledge some disciplines suffer from irrationalism, but not ours. We are thinkers! Kahneman and Tversky, who studied doctors, graduate students, and economists, document the irrationality of human decision making. Even as technologists, we’re not immune from the human condition.

***

It doesn’t take long for an IT professional to find themselves on a bridge-line with several dozen people working to recover from a production outage. People ask questions, present theories, and there are usually more questioners than problem-solvers on the call. What can we learn from the discoveries of Tversky and Kahneman in this situations? What are strategies we can follow to solve the problem and get off that bridge line faster? Here are a few of my take-aways after reading about their work.

Understand the problem. Skip this step at your peril. Ask: What is the problem? When did it start? What systems are affected? Has a trustworthy IT professional verified the problem? Ask questions to help determine scope: Is the problem local, regional, national, global? What systems are affected? Can the problem be replicated? Ask questions that have measurable answers.
Gather real data. Don’t delve hours into an outage without doing your research. We don’t always have perfect metrics but there are several places we can look to help us determine what’s wrong:
- Log data: Gather log data, but unless the error messages explicitly point to the issue at hand consider these a data point.
- Testing tools: For network engineers these tools may be as simple as ping and trace route. Record your results.
- Monitoring systems: Netflow data, snmp data, and alerting systems can be extremely helpful in deconstructing the issue.
- Metrics and CLI output: Take a look at interface counters, show commands, and other device level data.
Know your normal. Often, troubleshooting can be derailed for hours chasing log messages or incidents that are completely normal. For example, know how many sites typically experience an outage in a normal day. Review historical data to see if network utilization is particularly high, low, or normal. Understand your environment well enough to know when something is normal and when it’s out of place. In Kahneman and Tversky’s vernacular, this relates to the base-rate.
Trust but verify. (The Dr. House corollary would be “Everybody lies.”) If someone on the bridge line suggests a course of action, ask questions — especially if they’re very confident. Why do you think that? What led you to that conclusion? Does the data we’re gathering support your theory? How can we test your theory? Don’t tweak configurations based on guesses that aren’t supported by data. You may introduce more problems that will muddy the data you’re collecting.
Use existing models to conceptualize the problem. OSI is particularly helpful here. For example, if traceroute consistently works from end-to-end, you can be fairly confident layer 3 connectivity is solid and the problem is higher up in the stack. You can direct application support to investigate the application and you can check appliances like WAN-OP, Proxy, or SSL-Decryption that primarily work at the application layer.
Avoid blame and emotionalism. I was on a call recently where the first words I heard a contributor say were, “We haven’t changed anything in months so it can’t possibly be <our problem>!” It’s fair and reasonable to mention that you have not made any changes as a data point for discovery. But it’s not helpful to deflect questions to avoid blame. The problem is what the problem is. Deflecting blame only slows troubleshooting. Eventually, the team will determine the problem and restore service (unless the issue spontaneously resolves leaving everyone scratching their heads). If the problem is not in your area of responsibility, the facts will prove it. If the problem is within your area of responsibility, you will only delay efforts to restore service by deflecting and covering your tracks. You cannot will the problem to be outside of your area of responsibility. Don’t hide information. Don’t try to shape the narrative. You’ll increase resolution time, make the call drag on, and gather a few enemies in the process.

A few additional thoughts:

Ask what changed, but don’t stay there too long. Outages are often caused by a change in state. But typically, if a change in state is known, the change will be reversed and operations will return to normal quickly. I’ve seen troubleshooting devolve into a drawn-out who-done-it before the problem is solved. If you use rational and systematic troubleshooting techniques you will find the problem. Don’t get marred in trying to solve the wrong problem.
Be wary of your initial judgement. Just because you had an outage last month that looked similar doesn’t mean this outage is the same scenario. Often, we will form an initial judgement of the situation and cling to it too long — even when data contradicts that judgement.

Tomes can be written on the troubleshooting process. For me, The Undoing Project was a reminder that human reasoning has its limits and that we need tools and processes to keep our brains on track.

6 Troubleshooting Lessons I Learned from Economic Psychology

Comments

Leave a Reply Cancel reply

Share this:

Comments

Leave a Reply Cancel reply