The best estimates we have

It was a late winter afternoon. Most of the people have already left the office or were about to do so. I was heading to my lab after having a tasty late lunch in our cafeteria. (Ok, I should call it just the late lunch.) I was hoping to get an hour or two of a lone, quiet work time, to finish one of my supreme visualizations. To my surprise, when I entered the lab I have noticed that there is someone in, sitting by my desk.

This was a man, around 50 years old. One could easily feel that he is a person of extraordinary intelligence and wisdom. And he was reminding me someone. After a moment of thought I was almost sure, so I said:

Me: You must be me, but from the future.

Stranger: Yes, you are right.

Me: Wow, I have so many questions…

Stranger: Be quiet and listen. I do not have much time here… or should I say – now. I came to warn you about a serious mistake that your company is about to make. In 2014 you have started an important project. It was called “Project Wombat”.

Me: Yes, I know this project. It has started about a half a year ago. We are now to provide estimates for the remaining work, before signing the long-term contract.

Stranger: This is why I came. The estimates you will provide will be terribly wrong. This will end the cooperation with the customer. What’s more, the entire company will feel hugely embarrassed. Many good people will leave and others will lose their spirit. In grief, many people will start to dress formally, stop making jokes and the company atmosphere will die. It will take many years until the company regains its strength again.

Me: But we have our good way of estimating the remaining work in projects. We calculate an average of the story points delivered during already finished iterations. Then we check how many story points there is still to deliver. We divide the two numbers and we know how many iterations we still need to finish.

Stranger: What? Are you a moron? You calculate an average and estimate one number? A man and a dog have 3 legs on average! One number… So tell me, what is the chance that you will hit exactly this estimate? What is the chance of hitting +10%? + 20%? +50%? +200%? Is it probable? Is it impossible? By presenting one number you can make more harm to your customer than by telling nothing. They might believe you and assume that the estimate is accurate.

Me: Well… When we need a range, we sometimes take two numbers: the number of story points delivered in the worst iteration and the number for the best iteration. We calculate two estimates of finishing: optimistic and pessimistic.

Stranger: Are you joking? And you call yourself a data scientist! So you take two completely unusual iterations and calculates two estimates that are almost impossible to be correct. You are modelling two situations where either everything goes ideally or everything goes terribly. It is not very useful, I must say. Maybe stop bothering with calculations and just say “we will finish between 0 and 1000 iterations”? You will give the similar value to your customer.

(I was starting to hate that guy. Probably sometime in the future something bad happened to him and he got bitter. I promised myself to look out and not become like this.)

Me: So maybe you could just tell me what the correct estimate are?

Stranger: I can’t. It has something to do with preserving time continuum or something. I cannot give you any precise information from the future. But I can motivate you to better use what you already have.

Me: Pity, I was hoping that you will tell me what company I should invest in.

Stranger: What do you mean? Ah, you are talking about the stock market. A few years ago it was finally classified as gambling and forbidden in Europe and most of the US states. You can still invest in a little stock market in Las Vegas, though… Oh, it seems that I need to go. Listen, I have one very important personal information for you. Remember, under no circumstances, you cannot…

…and he disappeared.

After the initial shock, I realised that he was right. We should be able to produce better estimates than we were providing so far. To do so, we will have to get to a lower level of details and think on single user stories, not the whole iterations. And to think in the categories of probability distribution, not single numbers. I started to broadly understand what needs to be done:

  1. We need to gather historical data about the team’s estimates and assess if they are good enough for drawing conclusions.
  2. We need to gather historical data about amount of work the team was doing per iteration and check how variable this was.
  3. Finally, we need to use the historical data to estimate the effort needed to implement the remaining scope. And we need to visualise this properly.

Historical estimates

I was very lucky this time, as the person running the project Wombat was gathering detailed project data from its beginning. Not only had he kept the initial estimates of all the user stories. He has also organised the project timesheet structure in such a way that it was possible to read how many hours of effort was spent on every single finished user story. The first data set we have created was as follows:

Iteration User story Estimate in SP Effort spent in hours
Sprint 1 User story 1 5 42
Sprint 1 User story 2 3 30
Sprint 2 User story 3 5 25
Sprint 2 User story 4 1 6

As you can see, for every finished user story we have we have gathered both the estimate in Story Points and the effort spent on implementing it (in hours, including bug fixing time). We also know in which sprint the story was done. This is very useful, as usually teams work much slower during a few first iterations (people learn the new domain, how to cooperate with new team members, etc.). As this initial lower efficiency should not happen in the following iterations of the project (unless the team composition changes significantly) the data from the initial iterations should be excluded from the data set.  This was the case also in our project, so we excluded the user stories from the first three sprints.

The next step was to assess if our historical data could be used to draw sensible conclusions about the future. To do this, we have created a visualisation of the data:

V1 distribution good

On the graph you can see seven boxplots, one for each possible size of user stories (the vertical axis shows: 1, 2, 3, 5, 8, 13 and 20 Story Points). On the horizontal axis we have worked hours. Each boxplot describes the distribution of the worked hours needed to finish stories of its size. E.g. the boxplot for the size 3 SP shows that among all the finished stories of this size, the minimum effort needed was 5 worked hours, maximum was 30 hours, the median was 12 hours and 1st and 3rd centile were 9 and 18 hours.

The first insight is obvious. Story points do not translate to a single number of hours. In fact, the spread of the distribution is significant. Usually, the maximum value for a story size is much bigger than the minimum value for the next story size. This is a bit surprising. However, despite the wide spreads, the distributions above suggests that the historical data are good enough to be used for future estimations. To describe what would not be good enough, have a look at one more visualisation created for a different project:

V2 distribution bad

In this case the team did not have a sufficient understanding of the user stories when the estimates were created. The order of the distributions is wrong. Most of the “fives” are smaller than most of the “threes”. “Eights” have enormous distribution. One’s and two’s are very similar. You could not base any sensible prognosis on such data. The team should throw away all the high level estimations of the remaining backlog and once again discuss and assign sizes to the stories. The proper data about the effort per size would be available after a few sprints the earliest.

Fortunately, the data in our project were good enough. (The width of the spreads could have at least two reasons: the team could have too little information about the scope or they were not good at estimating in this project yet. Additional analysis could help to distinguish between the two reasons, but this is a separate story.) The next step was to look at the second side of the coin – the team’s capacity.

Historical capacity consumption

To a huge surprise of every fresh project manager (including me at one time), the software project teams do not only write code (!). They spend time on meetings, communication, estimates, deployments, spikes, playing table soccer and many more. To understand how much time is really spent on the user stories work, you must gather proper data (believe me, you will be surprised when you do this).

Once again we were lucky. The Project Wombat’s PM was gathering such information every sprint. The visualisation below shows the split of the sprints’ capacity on different categories.

V3 capacity

After the three initial iterations, the share of hours spent on user stories (Feature Work) became quite stable. The team was spending about 70 % of their capacity on user stories work (including development bugs fixing). This number will be the base for calculation of the remaining project time. As we want to calculate how many sprints we need to finish the project, we have to know how much capacity we will have in each coming sprint and what part of it we will spend on user stories development. When preparing the list of future sprints capacity, together with the PM we have taken into consideration all planned holidays, long weekends, trainings, company events etc. and applied the 70% to the expected available hours. Finally, we have got the following table:

Sprint Expected hours of feature work
N + 1 550
N + 2 440
N + 3 560

Simulation

Having the historical data prepared, we were ready to run the simulation. The remaining backlog was well understood, all the user stories were quite well broken down and had their sizes assigned (in Story Points). We have gathered the following information of the remaining backlog:

Size (SP) How many stories left
2 12
3 16
5 11

For the simulation we have used the simple Monte Carlo method. We have calculated 10 000 simulation cases using the following algorithm:

  1. For each story in the remaining backlog, randomly choose a historical effort (in hours) from the historical data for the proper story size bucket (e.g. for a 2 SP story we randomly choose an effort from the bucket of all finished 2 SP stories, for a 3 SP story – from the bucket of all finished 3 SP stories, etc.)
  2. Calculate the sum of the efforts chosen for all the remaining stories.
  3. Check how many sprints are needed to cover the sum of efforts with the expected feature work hours.
  4. Note the result.

(Such a simulation can be quickly implemented in any statistical package. What’s more, it can be also easily prepared in an Excel spreadsheet or any other calculation tool of your choice.)

In the end, we had the list of expected project lengths for 10 000 simulated cases that were basing on the project historical data. The following visualisation shows the cumulative distribution function for the results:

V4 simulation

On the vertical axis you can see the percentage of all the simulation cases. On the horizontal axis you can see how many sprints are needed to finish the remaining backlog. The visualisation should be read as follows: “If vertical axis shows 33% and horizontal axis shows 22 sprints, this means that in 33% of simulated cases the effort needed to finish the backlog was 22 sprints or less.”

As you can see, the chart allows to pose very quantifiable statements about the estimate. Not only we can tell that the average estimate is 23 sprints (which means that in 50% of cases the project will take less and in 50% of cases it will take more). We can also tell a lot about what is probable and what is not. We can say that there is very small chance (less than 10%) to finish the project in less than 20 sprints. On the other hand, we can say that there is very high chance (more than 90%) to finish the project in less than 26 sprints. This allows to draw informed conclusions about the budget that must be secured for the project and about contingencies that should be made.

What if

Once again, the world has been saved. The conversation with the customer went marvellous. The estimates were understood and accepted. The team has rushed into implementation.

However, I could not resist the feeling that we were lucky to have all that data. What if we didn’t gather them from the beginning? Could we make better estimates without them? Unfortunately, the answer will have to wait to another post.

2 comments

  1. Great article!

    My 5 cents would be, that instead of having two variables, you end up with three.
    Statistically, the impact of capacity on the velocity may be neglected if we talk > 10-ish sprints project (if we have got a sufficient track record and would dismiss the extremes).

    Two points are worth bringing up though:
    1. the importance of the backlog refinement (and breaking down larger USs)
    2. the futility of the translation the USs into hours, i.e. 10SP + 10SP != 20SP

    My conclusion would be that the more uniform (SP-wise) the backlog is, the better approximation one would get, no matter what method would be used. The variances in the actual effort reported for similar (SP-wise, again) should dictate the frequency and depth of backlog refinement exercises (consider making this an KPI).

    We should also carefully consider giving in the velocity institution as it may and does strongly support team’s commitment.

    IMHO method above would prove itself better in mid-term approximations, 3-5 sprints, great to validate and diagnose delivery issues.

    My personal favourite in terms of satisfying clients’ urge to manage the budget’s risk would be familiarising him with a MVP concept and the time-box model.

    Nevertheless, it was a pleasure to read!

    1. Thank you for all your suggestions! I will try to comment on each one by one.

      It is worth looking on capacity to make sure that we are not missing any unexpected factor. If the feature work capacity is stable, its impact is not that important. However, if the capacity has a great variance, we need to take this into consideration in our estimation calculation, as this will make our estimation range wider. Also increasing or decreasing trend of our feature work time should influence our calculation.

      I fully support the two points and your conclusion about backlog refinement.

      I also think that commitment and velocity are very important (e.g. for commitment). But we don’t have to use it for all the purposes, like estimations. We can have both.

      The method will work if you have a well refined remaining backlog. Often, more than 3-5 sprints into the future the stories are neither well understood nor estimated. In such cases the method will not help.

      Finally, MVP and timeboxing are great concepts. I also believe that most of the projects need a budget, not an estimate. But often the customers are asking for it anyway, and when they do, we should try to answer as good as we can.

      Thank you again for reading and commenting.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s