Ever been in that situation, its a deadline, you are under the gun, things are going wrong. You are putting in the long hours, but no matter how hard you work things seem out of control. I’ve been in that situation, a few times. I’ve worked in a couple of start-ups, and one in particular had a couple of occasions where it almost all went wrong, we pulled it in, but boy it was close.

So you think, we’ve done it before but we didn’t fail, we made that release or we went live with that customer… or maybe you didn’t… or maybe you did but in the process you burnt the relationship with the client… or burnt yourself (or your team) out or maybe you found yourself in firefighting mode for weeks post release scrambling to keep things running behind the schemes.

The problem is it can be addictive, it can be a rush to pull that rabbit out of the hat, and it tends to propagate the Hero culture and then it feeds on itself. Eventually this will lead to an implosion where projects and possibly the company will fail.

How do we break that cycle, given that we are in the middle of it?

Context note: This is more aimed at small teams/startups and I am not going to go into how to avoid the situation in this post, rather it is about the situation at hand

First, find some calm.

Second, step back, go for a walk, go to a cafe, take a nap - do what you need to do in order to get some effective head space.

Third, you need to plan, and I don’t mean a detailed plan, I mean just enough of a plan to keep you on the right track, a meta-plan if you like, something to guide you.

So you are now calm, you can judge risk more clearly, you can make decisions, you will make less mistakes.

If you are still doing features I recommend setting up a really simple “Kanban” board. Something like: backlog, dev, test, deploy, accept

Put up a few of the most important/riskiest tasks (negotiate, argue, decide amongst the team and the product owner) on the left, and pull them across one at a time right through to the end before starting the next - don’t be tempted to work on too many at once, more team members than cards in flight is a good rule of thumb. When that bunch is done, put up the next few important and repeat, keep the number of cards on the board small, don’t overwhelm it or yourselves. Don’t get hung up on the column names and the card content/structure - just do the minimum to keep a tab on what you are doing, adjust as appropriate.

Its likely that you are close to release, perhaps your stuck in a nasty bug/quick fix cycle. “This change will fix it, ah no, damn, ah, try this, and this….. “. Time to break cycle. This sort of thing will destroy the confidence of your client, your credibility and damage the likelihood of future work.

So how do you break out of this? Don’t guess, be data driven, work from facts and not supposition, that meas work from test results. Ideally you’ll be using TDD or BDD style methods but if you aren’t you can still leverage some of those ideas to help break the cycle. Reproduce failures, if possible work out how to automate that test - even a simple bash script using curl and grep with a simple assertion (exit 1) is better than nothing. And will help prevent that problem creeping back in later. You can take these tests and later build them into a proper BDD suite.

Take small steps: when confidence is low move slow. By taking small steps you can confidently move forward as you take less risk and its easier to roll back to a known, or at least better known state.

“But I don’t have an environment to test in” … so this is a huge risk, usually it is actually the case that the live environment is very different to the test one. Time to change that, a few hours (even days) will make all the difference. “We don’t have the hardware” … sorry that excuse is running out, AWS and other cloud providers allow you to put together a lot of hardware quickly at pretty reasonable cost that you can turn off when done - a few hundred $/£/€ may be worth it if you can solve the problems effectively.

Remember its not easy, if it was someone would have already done it, so looking for silver bullets is counter productive. Whilst you can get away with cutting some corners (with acceptable consequences usually in the form of technical debt) there are some you can’t - sometimes you just have to accept spending that time and money to make sure you can test what you are doing adequately. Keep note of those decisions, where you have incurred technical debt, so it can be addressed later.

The important thing is to maintain perspective. Yes things are bad, but its possible to bring things under control, when things are in control you can make better decisions, you can make clear decisions on trade offs.

Setting up a ‘war room’ is sometimes a good idea, as long as it promotes that calm control, and doesn’t become a yelling chaos factory. Protect your space, make it clear that you don’t want those who will disturb the groups balance invading that space. Agree to provide status updates on your terms - ideally your simple Kanban board will give enough of that status for you!

Protect your relationship with the client/customer - keep a clear consistent communication channel, one person to inform, update and ask questions. I sometimes call this an Anchor, like a news Anchor they are there to keep consistency and to keep it (the channel) together under pressure. They can also help prevent the ‘quick fix/release/break’ cycle by keeping check on the team output. They should be professional, keep emotion aside (including not reacting to emotion from the other side) and try to build a collaborative approach.

“It’s not our fault its broken … It works for us” … this is a very common attitude, however whilst at the time you may think you are right, all too often you turn out not to be. That is embarrassing and possibly worse. If its not your fault, be prepared to be able to prove it. If you can’t prove it and more importantly if you can’t prove it in the environment that your client is seeing it in, then accept its more likely your problem than theirs. The worst thing about this sort of response is that it breaks confidence, and reduces trust because once you do it once it becomes all the more difficult to work closely with the client to solve the next problem.

No plan survives contact with the enemy be prepared to adapt as necessary. Periodically step back, breathe and review the board. Take a quick poll of the team, chat to the client and sanity check your priorities.

One boss early in my career gave me this advice, “In the next quarter they are not going to remember that you were two weeks late, but they are going to remember if you delivered them a steaming turd on time”.

Now days I view this advice in terms of the time/scope/cost triangle - they won’t remember if you were missing a couple of features, but they will if it didn’t work at all.

The goal is to move toward sustainable development and to not be in this situation to begin with. But things don’t always work out, you and your team may not have the experience or the available resources, you may just be moving too fast so as not to miss the opportunity. Some times you find yourself in a fire fight and you need some help with the hose and not a lecture on the dangers of playing with matches.

Importantly when the dust settles a bit you make the time to look back, do a retrospective, try and work out where things went wrong in the first place and look to improve.

Something I have seen as a quite common approach in an organisation is to start a skunkworks project, to get something critical off the ground. I have been involved in several in the course my career, with varying success.

Typically these initiatives are actually done for the wrong reasons, but that’s not what I am going to talk about here.

What I have noticed is that more often than not, a skunkworks project is started; a small, skilled and motivated team is split off, it is isolated and allowed to focus on a specific issue. Now this (sometimes) results in fantastic results, the team is hugely productive and makes brilliant in-roads into the issue. Their progress reports are astounding and it promises great things.

Then they reach that critical juncture and the team is brought back into the fold. Unfortunately then more often than not things start to unravel. The business views the final outcome as if not a failure, then a partial one.

Looking at what has happened we can find a lot of potential ‘reasons’ for this: The skunkworks team just failed, they misreported their progress, they built false hope. The process of creating a skunkworks team created too much animosity, those left behind were resentful and they then either consciously or unconsciously sabotaged the final integration of the solution. The team produced something which whilst it solved the problem, couldn’t realistically be integrated into the ‘real’ system, it was either too advanced or couldn’t be understood by the mere mortals expected to continue or support the work.

Regardless of these and many other possible reasons I actually think there is something very different at work here – I think the perceptions of the above don’t take into account something fundamental.

When you split off your skunk-works team, you removed them from your ‘System’. You gave them freedom, they probably self-organised, they were smart and motivated. In their work they created their own little system in a bubble, one that suited their needs and as a result their productivity exploded. They probably even ignored the remnants of the system they were supposed to use (timesheets… nah can’t be bothered waste of time …) and got away with it, because they were the skunkworks team, they were special.

Then you dragged them out of that bubble.

Pulled back into the ‘System’, and you crushed them, that bubble dissolved, productivity plummeted and problems rocketed – you re-imposed those constraints which held them back in the first place.

Chances are some of that team you pampered by allowing that microcosm walked out the door, they saw what was possible and had it cruelly ripped out from under them. The ‘System’ defeated them.

You see the huge gains in productivity were not down to isolating them, giving them their own coffee machine or supplying them with pizza late into the evenings. Sure they loved that stuff, but what they really loved was the freedom – the new system they created specifically to meet their goals.

By System, I don’t mean process (Waterfall, SCRUM, etc) , I don’t mean the Feng Shui of the office, I don’t mean free pizza - the system encompasses all of those things and much more. In turn the system produces a culture, and that culture plays a part in driving the behaviour of the people who work there.

Its an extremely complex system with positive and negative feedback loops and a million variables. More people, more variables, more loops.

I don’t have a magical solution if you are considering or have an existing a skunkworks project, every situation is different. But I do suggest you step back, and really think about why there needs to be a one. I am betting that stepping back, thinking about the way you work, the system and its culture, that it will help you find alternative paths to solve your problems.

One of the questions I like to ask candidates during the interview process is, “What is the most memorable problem you have had to solve”. Often I am disappointed in the answers I get and thats with candidates with 10+ years of experience who I would have expected to come across some gnarly problem they had to solve.

Thinking about the question from my own experience I have several answers, but one in particular is now over 10 years old I thought was worth documenting before it fades from my memory.

In the late 90’s and early 00’s I worked in a startup in the telephony space, or more correctly the computer telephony integration space. It was a niche that was growing fast and one we were doing very well in. We had secured a contract with DEC (just as they merged with Compaq and since merged with HP) to port our software onto DECUnix (later Tru64). We had been running on Intel based systems for sometime using SCOUnix, Linux and Solaris. I had joined the team just after the port had been done, we had a couple of systems live and a couple running in labs.

There were some concerns raised on performance during integration testing with a client, and we embarked on some benchmarking. We pitted a brand new DS-20 on loan from DEC against our latest and greatest PIII-500 - in theory the DS-20 should have left the P-III in its dust. But it didn’t, it didn’t even come close to matching it, in fact in some tests it came in at almost 50% slower.

There was much gnashing of teeth. We were relying on the higher performance of the DS-20 to drive our SS7/ISUP solution into new markets at a scale our competitiors couldn’t match.

I was still pretty early in my carreer, wet behind the ears if you like. I set about looking at what changes had been made during the port, and after a few days could not find any significant changes, those which were made I did some micro-benchmarks on and couldn’t find any issue. We tried tuning the DECUnix kernel, we spoke to Compaq and adjusted everything to what was considered the optimum for our situation, no real change.

Tempers were rising, client was threating to pull the plug, my boss and I were spending hours on end, stepping through code, testing, tuning all to no avail. Eventually our CTO rang Compaq and gave them a barrage about how ^&*# their Alpha’s were and how we were so dissappointed with their performance we were going to drop our port and withdraw from the contract. After some too’ing and fro’ing Compaq flew an engineer out to take a look at our benchmarks and to see if they could help us out.

So the engineer arrived, I walked him through our tests, ran the benchmark and showed him the outputs. He looked over the tuning we’d done and agreed we’d done all the right things. I asked if he’d like to see the code, he said not yet, instead lets profile it. Profile? I’d not really thought of doing that on the Alpha, we’d profiled on the intel, ok so how can we do that on DECUnix? So he spent a few minutes showing me how to do it, I did a fresh build with profiling enabled, and then we re-ran the benchmark with profiling on. As expected it was slower still, but we generated a profile.

Right lets take a look.

The engineer looks at the results and within 30 seconds he said, ‘Any reason you are opening /dev/null 11 million times during that test?’….huh, WTF?!

I looked through the application code, no references to ‘/dev/null’ - it must be from somewhere else. Some find, strings and grep later we found the culprit, a shared library with not much in it, but it wasn’t a DEC library. A search through our CVS repository, and I found a small module written by my predecessor who left a few days after I started - just after he’d finished setting up the new server.

In that module was a single function snprintf … and a lot of comments abusing DEC for not including this function in their standard libraries……

And in that function was the reference to /dev/null

He had implemented his own snprintf, and thought he’d done a smart job. He was using the fact that fprintf would return the length of what was outputted, so in order to determine the length of the input he would open /dev/null, fprintf it, get the length, close /dev/null and then use that length to determine if the input needed truncation before calling sprintf. Oh crap.

Some quick hacking, a fresh build, and bang our Alpha was now screaming along just under twice as fast as the Intel.

Red faced, I thanked the engineer, and slunk off to explain to the CTO where the problem was. A day or so later we shipped a new version to our client using the GNU library instead of our insane one.

Contracts saved, faces red and a lot learnt.

Recently I had a discussion with a colleague about his strong ‘fear of failure’. We spoke about it and hopefully I put his fears into context, and I will try and build on that here.

Now the Fear of Failure is important, it helps keep us motivated, it keeps us diligent and for some people its is truly what drives them. However in our industry it also stifles growth, innovation and promotes over conservative behaviour.

I am quite lucky in my current role, I lead a team of engineers who have explicitly been given ‘The Freedom to Fail’ - it is actually in the team mission statement, how neat is that?

Now that doesn’t mean its ok to sit back relax and just let projects crash on to the rocks. There are some projects that we should not ever fail on, those where a solution is self-evident, where we have more than adequate resources and where risks are contained.

But what it does mean is, we can be more experimental in our approach, we can truly try and learn whilst tackling the challenge, and we can try things that normally would be considered risky. We get a chance to learn and grow.

In an organisation that in many ways is very conservative, and in a current climate that promotes that conservatism, we break the mold.

Now that doesn’t mean we do things purely because we like the idea of them (well I lie, we occasionally do but as a learning exercise). But we question the traditional approaches, we ask ourselves if the “typical” solution is appropriate, if the corporate template is appropriate and seek out other approaches which may indeed prove to be a better fit.

Perhaps luckily our choices have proved successful more often than not. But things do go wrong, we on occasion do fail and wind up either having to restart, rework or abandon a piece of work.

It is not wrong to fail, however it is very very wrong to hide, deny or ignore failure. That failure may only be partial or minor, but it is still a failure and needs to be addressed.

So here is what is important…

Identify Failure

Have a way of identifying failure (and subsequently success). There are many useful techniques that can be applied here, including time boxing, acceptance testing and “done means done”.

Time boxing can be vitally important. Estimated work often overruns its a fact of life, we are poor at estimating and often something simple takes a lot longer than expected. Using a time-box you can identify that you have failed, and you need help, need to change approach or need to give up.

Acceptance testing lets you prove that you haven’t failed. In many cases I take a pessimistic view of success, if its not passing all the tests its a fail.

“Done means done” - all too often you can ask if feature X finished and you’ll get the reply “oh yes that’s done but this part needs tweaking”. Sorry pal, if that part needs tweaking its not done, now put that card back into “in progress” and finish it properly before declaring it done.

Understand the consequences of Failure

It is important to understand what the consequence of failure is at every point. There are varying magnitudes to this, both at a project planning “will we get this done on time” point of view and real life consequences of failure of the project after release.

It is important to know what can be done (if anything) to mitigate the failure. Can a different approach be taken, can another team undertake the work, can we just ditch it?

Structure what you do to ensure any failure occurs as early as possible in the process. Failing quickly increases the chances that we can mitigate the failure or if that is not possible then stop work before more resources are wasted.

Understand the Root Cause of the Failure

All too often the root cause of a failure is not fully understood. “We think its because the connection pool was used up and that caused …” - wait up here, stop, “We think?”

In some cases the root cause is obvious and at other times it requires investigation (root cause analysis) however something that is often over looked is what I consider “people problems”.

Dealing with people is hard, especially in an industry known for having anti-social tendencies. Quite often people without the correct skills, experience or support are given problems they are just not equipped to handle - and thats a management failure.

Disseminate the Failure

This is the bit a lot of people shy away from. I have a great deal of respect for someone who can stand up in a meeting and say, “yeah I screwed that up, I need help to find a way to resolve it”.

In my experience the failure to discuss failure is worse than the original failure - it dooms others to repeat it. And covering up a failure is far far worse than than actually failing, perpetuating a lie that something works when it doesn’t.

Honesty is indeed the best policy. Of course I don’t mean that companies should publicly air their dirty laundry but rather within your team (and hopefully within your entire company) there should be open, honest and frank communication.

Final Whitterings

Does your team fear failure to the extent they don’t think outside what they think is expected of them? Worse yet do they fear it to the extent they will actively hide or deny it? Are you holding your team back through your own fears?

Building a sense of trust within a team is vital, how you address the issue of failure I think is vital.

If you work in an environment where failure is expensive then try and create a space where it is acceptable. Code Dojo’s, Hack Days and similar events give people a chance to flex their mental muscles, their creativity and lets them grow, it gives them that freedom to fail.

Can’t afford to fail? Can’t afford not to learn from failure.

Lately I’ve been doing some story generation for a few projects, and I’ve really been focusing on the idea of putting value first.

Its a subtle, but powerful idea. I was introduced to it by Dan North during his tutorial on BDD at QCon London 2009 and it really clicked.

I’ve been doing quite a bit of process work involving Kanban, which really focuses on value - so it has a kind of synergy (pardon me while I wash my mouth out for using management speak). Liz Keogh also wrote a good article for InfoQ recently on Kanban and BDD.

So the format we’ve chosen is the same Liz describes in her article:

In order to … As a … I want …

And its working really well so far. By putting the value first it makes more sense to our domain experts, who typically being non-technical in most cases got stuck on “As a”, and “I want”. And we’ve found that we keep our scope better contained as people stop thinking about “I want” and concentrate on the end value of “In order to”.

So if you are working on stories I would really encourage you to try this one simple change. Hopefully you’ll see the benefits.