The best estimates we have

It was a late winter afternoon. Most of the people have already left the office or were about to do so. I was heading to my lab after having a tasty late lunch in our cafeteria. (Ok, I should call it just the late lunch.) I was hoping to get an hour or two of a lone, quiet work time, to finish one of my supreme visualizations. To my surprise, when I entered the lab I have noticed that there is someone in, sitting by my desk.

This was a man, around 50 years old. One could easily feel that he is a person of extraordinary intelligence and wisdom. And he was reminding me someone. After a moment of thought I was almost sure, so I said:

Me: You must be me, but from the future.

Stranger: Yes, you are right.

Me: Wow, I have so many questions…

Stranger: Be quiet and listen. I do not have much time here… or should I say – now. I came to warn you about a serious mistake that your company is about to make. In 2014 you have started an important project. It was called “Project Wombat”.

Me: Yes, I know this project. It has started about a half a year ago. We are now to provide estimates for the remaining work, before signing the long-term contract.

Stranger: This is why I came. The estimates you will provide will be terribly wrong. This will end the cooperation with the customer. What’s more, the entire company will feel hugely embarrassed. Many good people will leave and others will lose their spirit. In grief, many people will start to dress formally, stop making jokes and the company atmosphere will die. It will take many years until the company regains its strength again.

Me: But we have our good way of estimating the remaining work in projects. We calculate an average of the story points delivered during already finished iterations. Then we check how many story points there is still to deliver. We divide the two numbers and we know how many iterations we still need to finish.

Stranger: What? Are you a moron? You calculate an average and estimate one number? A man and a dog have 3 legs on average! One number… So tell me, what is the chance that you will hit exactly this estimate? What is the chance of hitting +10%? + 20%? +50%? +200%? Is it probable? Is it impossible? By presenting one number you can make more harm to your customer than by telling nothing. They might believe you and assume that the estimate is accurate.

Me: Well… When we need a range, we sometimes take two numbers: the number of story points delivered in the worst iteration and the number for the best iteration. We calculate two estimates of finishing: optimistic and pessimistic.

Stranger: Are you joking? And you call yourself a data scientist! So you take two completely unusual iterations and calculates two estimates that are almost impossible to be correct. You are modelling two situations where either everything goes ideally or everything goes terribly. It is not very useful, I must say. Maybe stop bothering with calculations and just say “we will finish between 0 and 1000 iterations”? You will give the similar value to your customer.

(I was starting to hate that guy. Probably sometime in the future something bad happened to him and he got bitter. I promised myself to look out and not become like this.)

Me: So maybe you could just tell me what the correct estimate are?

Stranger: I can’t. It has something to do with preserving time continuum or something. I cannot give you any precise information from the future. But I can motivate you to better use what you already have.

Me: Pity, I was hoping that you will tell me what company I should invest in.

Stranger: What do you mean? Ah, you are talking about the stock market. A few years ago it was finally classified as gambling and forbidden in Europe and most of the US states. You can still invest in a little stock market in Las Vegas, though… Oh, it seems that I need to go. Listen, I have one very important personal information for you. Remember, under no circumstances, you cannot…

…and he disappeared.

After the initial shock, I realised that he was right. We should be able to produce better estimates than we were providing so far. To do so, we will have to get to a lower level of details and think on single user stories, not the whole iterations. And to think in the categories of probability distribution, not single numbers. I started to broadly understand what needs to be done:

  1. We need to gather historical data about the team’s estimates and assess if they are good enough for drawing conclusions.
  2. We need to gather historical data about amount of work the team was doing per iteration and check how variable this was.
  3. Finally, we need to use the historical data to estimate the effort needed to implement the remaining scope. And we need to visualise this properly.

Historical estimates

I was very lucky this time, as the person running the project Wombat was gathering detailed project data from its beginning. Not only had he kept the initial estimates of all the user stories. He has also organised the project timesheet structure in such a way that it was possible to read how many hours of effort was spent on every single finished user story. The first data set we have created was as follows:

Iteration User story Estimate in SP Effort spent in hours
Sprint 1 User story 1 5 42
Sprint 1 User story 2 3 30
Sprint 2 User story 3 5 25
Sprint 2 User story 4 1 6

As you can see, for every finished user story we have we have gathered both the estimate in Story Points and the effort spent on implementing it (in hours, including bug fixing time). We also know in which sprint the story was done. This is very useful, as usually teams work much slower during a few first iterations (people learn the new domain, how to cooperate with new team members, etc.). As this initial lower efficiency should not happen in the following iterations of the project (unless the team composition changes significantly) the data from the initial iterations should be excluded from the data set.  This was the case also in our project, so we excluded the user stories from the first three sprints.

The next step was to assess if our historical data could be used to draw sensible conclusions about the future. To do this, we have created a visualisation of the data:

V1 distribution good

On the graph you can see seven boxplots, one for each possible size of user stories (the vertical axis shows: 1, 2, 3, 5, 8, 13 and 20 Story Points). On the horizontal axis we have worked hours. Each boxplot describes the distribution of the worked hours needed to finish stories of its size. E.g. the boxplot for the size 3 SP shows that among all the finished stories of this size, the minimum effort needed was 5 worked hours, maximum was 30 hours, the median was 12 hours and 1st and 3rd centile were 9 and 18 hours.

The first insight is obvious. Story points do not translate to a single number of hours. In fact, the spread of the distribution is significant. Usually, the maximum value for a story size is much bigger than the minimum value for the next story size. This is a bit surprising. However, despite the wide spreads, the distributions above suggests that the historical data are good enough to be used for future estimations. To describe what would not be good enough, have a look at one more visualisation created for a different project:

V2 distribution bad

In this case the team did not have a sufficient understanding of the user stories when the estimates were created. The order of the distributions is wrong. Most of the “fives” are smaller than most of the “threes”. “Eights” have enormous distribution. One’s and two’s are very similar. You could not base any sensible prognosis on such data. The team should throw away all the high level estimations of the remaining backlog and once again discuss and assign sizes to the stories. The proper data about the effort per size would be available after a few sprints the earliest.

Fortunately, the data in our project were good enough. (The width of the spreads could have at least two reasons: the team could have too little information about the scope or they were not good at estimating in this project yet. Additional analysis could help to distinguish between the two reasons, but this is a separate story.) The next step was to look at the second side of the coin – the team’s capacity.

Historical capacity consumption

To a huge surprise of every fresh project manager (including me at one time), the software project teams do not only write code (!). They spend time on meetings, communication, estimates, deployments, spikes, playing table soccer and many more. To understand how much time is really spent on the user stories work, you must gather proper data (believe me, you will be surprised when you do this).

Once again we were lucky. The Project Wombat’s PM was gathering such information every sprint. The visualisation below shows the split of the sprints’ capacity on different categories.

V3 capacity

After the three initial iterations, the share of hours spent on user stories (Feature Work) became quite stable. The team was spending about 70 % of their capacity on user stories work (including development bugs fixing). This number will be the base for calculation of the remaining project time. As we want to calculate how many sprints we need to finish the project, we have to know how much capacity we will have in each coming sprint and what part of it we will spend on user stories development. When preparing the list of future sprints capacity, together with the PM we have taken into consideration all planned holidays, long weekends, trainings, company events etc. and applied the 70% to the expected available hours. Finally, we have got the following table:

Sprint Expected hours of feature work
N + 1 550
N + 2 440
N + 3 560

Simulation

Having the historical data prepared, we were ready to run the simulation. The remaining backlog was well understood, all the user stories were quite well broken down and had their sizes assigned (in Story Points). We have gathered the following information of the remaining backlog:

Size (SP) How many stories left
2 12
3 16
5 11

For the simulation we have used the simple Monte Carlo method. We have calculated 10 000 simulation cases using the following algorithm:

  1. For each story in the remaining backlog, randomly choose a historical effort (in hours) from the historical data for the proper story size bucket (e.g. for a 2 SP story we randomly choose an effort from the bucket of all finished 2 SP stories, for a 3 SP story – from the bucket of all finished 3 SP stories, etc.)
  2. Calculate the sum of the efforts chosen for all the remaining stories.
  3. Check how many sprints are needed to cover the sum of efforts with the expected feature work hours.
  4. Note the result.

(Such a simulation can be quickly implemented in any statistical package. What’s more, it can be also easily prepared in an Excel spreadsheet or any other calculation tool of your choice.)

In the end, we had the list of expected project lengths for 10 000 simulated cases that were basing on the project historical data. The following visualisation shows the cumulative distribution function for the results:

V4 simulation

On the vertical axis you can see the percentage of all the simulation cases. On the horizontal axis you can see how many sprints are needed to finish the remaining backlog. The visualisation should be read as follows: “If vertical axis shows 33% and horizontal axis shows 22 sprints, this means that in 33% of simulated cases the effort needed to finish the backlog was 22 sprints or less.”

As you can see, the chart allows to pose very quantifiable statements about the estimate. Not only we can tell that the average estimate is 23 sprints (which means that in 50% of cases the project will take less and in 50% of cases it will take more). We can also tell a lot about what is probable and what is not. We can say that there is very small chance (less than 10%) to finish the project in less than 20 sprints. On the other hand, we can say that there is very high chance (more than 90%) to finish the project in less than 26 sprints. This allows to draw informed conclusions about the budget that must be secured for the project and about contingencies that should be made.

What if

Once again, the world has been saved. The conversation with the customer went marvellous. The estimates were understood and accepted. The team has rushed into implementation.

However, I could not resist the feeling that we were lucky to have all that data. What if we didn’t gather them from the beginning? Could we make better estimates without them? Unfortunately, the answer will have to wait to another post.

Brain aware stand-up

One of the perks of being a Data Scientist is that you are able to attend to interesting conferences to look for new ideas and trends. I have just come back from ITxpo, the major Gartner’s event prepared for the IT world. The amount of stuff presented there was overwhelming. This time I would like to share with you just a few ideas from one particularly interesting session about the latest research on human brain.

Brain at work

To put things simply, human brain is not very good at long distances. The presented research show that a regular, healthy adult person can achieve around 3 hours of high performance brain work per day. After using this time, the ability to focus on tasks drops, IQ decreases and you need more time to accomplish work items. I think all of us know the state, when work just do not get done as quickly as it could.

Does “3 hours a day” sound badly? In reality it is even worse – there are many factors that can decrease even that, e.g.:

  • lack of sleep (how much sleep is not enough depends on the particular person, but a common “minimum level” is 6.5 hours),
  • lack of food or wrong food (one that increases your glucose level significantly),
  • stress, or work environment that makes you feel threatened or disturbed (it was measured than one’s IQ drops by 10% for some time after each stressful situation – e.g. after a meeting with a manager, not even necessarily a negative one),
  • two days of very hard work (a few hours of overtime) in a row (also a 10% IQ drop was measured).

Luckily, there are also factors that can give you your 3 hours back:

  • physical exercise (e.g. 30 minute walk),
  • laugh (to relieve stress),
  • time to think.

Three things to start from

Basing on the above, I wanted to propose to you three simple things that you can try yourself tomorrow.

  1. Move your stand-up / daily scrum / daily meeting to the end of the day.

Most of the teams nowadays run some kind of daily meeting in the morning to coordinate their work. Agile practices suggest this also. Unfortunately, for most of the people the high-performance hours occur in the morning. It would best to spend this time on solving the most difficult and complex tasks. Communication is not that demanding and can be done in the afternoon. There are also other reasons:

  • not all the people start their work exactly at the stand-up time, some would come earlier and waste more high-performance time waiting, chatting or checking e-mail,
  • in the end of the day you should be able to remember what you were doing during the day, so it would be easier to properly choose your next actions.

For sure there are other advantages and disadvantages of afternoon stand-ups – please share your thoughts in comments.

  1. If the weather allows, change your office meeting into a walking meeting.

This way you will not only regain some of you high-performance potential. Also all the participants will reduce their stress level a bit, so the meeting will be more effective. And if this is not possible (e.g. due to the arctic wind), you can always use stairs instead of the elevator a few times a day.

  1. Shut down the evil candy machine.

Do not increase your glucose level by eating sweets. This will give you a short energy kick but you will get worse very quickly. Choose a fruit or some vegetables. You can give back all your chocolate bars to your local Operations Research lab. They will know how to cope with the threat.

Good Bye

I do not expect to survive too long after this post. If I will not get lynched by the people defending their right to candies, I will for sure get captured and tortured by the Agile Coaches due to my heresies about moving the morning stand-up to the afternoon. At least I will have a chance to have a little run when they will be chasing me. So, in case you will never hear from me again, Good Bye and Good Luck.

Team dilution (aka delusion)

We are sitting in an awkwardly quiet room. I know that I have just said something stupid. I know this from the look in his eyes. He is staring at me and is thinking how to politely say that I’m obviously suffering from some kind of delusion. He knows that he must be polite, because this is a job interview and he is applying. So I try to think hard what I’ve said, but I fail. And finally, he speaks: “Sir, you are trying to convince me to join your company: small, managed in a family way, non-corporate. But a nice lady from HR has just said that you are over 300 employees. It’s not a small company!” And I realise that he is right. In my mind, this is still a 30 person company fitting in one small open space office. This has obviously changed. Yet, I feel that I am a bit right, too.

The growth

My company size has changed, but not in the way that I was afraid of. I’d had a chance to see a few big companies before I joined Objectivity. There was a lot of bureaucracy, politics, hierarchy and mediocrity. I never wanted to work in such a place again. I know that many of my colleagues thought this way, as we had come to Objectivity from bigger companies, looking for something different. Maybe this is why we were working hard not to lose the family-like, non-corporate style of the company. We were breaking hierarchical barriers, looking for the best ways to codify and share our culture, even listed the “Seven deadly sins of corporations” to keep an eye at (there were 12 sins on the list, but 7 sounded better). Lately, we have even changed the way the company is organised to create sub-companies that could work like small start-ups. As we are fairly successful in our attempts, I keep thinking about my company as a small one and I put myself in an awkward position during interviews.

But I don’t want to bore you to death by telling what to do to reduce the impact of growth. I wanted to tell you about something much more interesting: how to measure the growth, so that you know how strong this will hit you (and where).

How to measure the growth

There are many HR measures that can help you look at the company growth. You can look at the turnover, happiness of the teams, percentage of hires after recommendations and many others. I wanted to show just two of them: back-office size trend and team dilution.

The first one is fairly simple and shows what part of the organisation is not doing the core work (in my business: is not producing the software).

BackOffice v2 w600

Many big organisations suffer from fast-growing administration. Management layers are multiplying, internal audit departments strengthen. Also the number of fancy but useless roles is increasing (like these damn data scientists, they just walk around asking weird questions and showing these damn graphs). This simple trend allows you to keep an eye on this threat.

The second measure addresses a different problem, illustrated by the following conversation:

Person A: Hey, man. Do you know who should I tell that I have accidentally deleted one of the databases?

Person B: No, but you can always ask Bob. He is a veteran here – he joined the company more than 3 months ago.

How can a company keep its culture (and ways of doing things), when people who are to share it didn’t have enough time to absorb it themselves. When the company is growing, you will have new people in the team, it’s a fact. But you need to monitor if this is not going out of proportions. And, as the problem will not be the same in all areas of the organisation, you better know where it is the worst.

Team dilution

This simple measure shows what percent of employees in a given group joined the organisation less than 6 months ago. The measure is definitely not ideal. Time is not enough to absorb the culture, one also needs to get proper experience and to be open to it. Also some people will learn faster, others slowly. People’s job histories differ, so their learning starting points differ. Additionally, you could argue if 6 months is the proper time period. (In this case a truly scientific method was used. We wanted the threshold to be calculated in a rigorous and exact way, not just guessed. So we have gathered the whole management in one room, everyone gave his expected number of months and then we have calculated the average in a rigorous and exact way.) Having said that, a simple measure is better than nothing and allows you to control the problem recurrently without lengthy employees’ assessments. So we are using it in two perspectives:

a) per job role:

TD roles v2 w600

Here you can see that size of the problem differs depending on the role. (Also, what is worth noting – for Team Leaders we use 12 months threshold instead of 6.) When the measure for one of the roles is going up, it is time to react: you can invest more in internal learning, you can ask HR to support leaders, you can codify more rules, etc. Anything that could mitigate the problem.

b) per project:

TD projects v2 w600

Here we split each project effort into 3 groups: done by people that joined less than 6 months ago, less that 12 months ago and earlier. Again, the distribution is never equal. You can react and support the endangered projects by adding veterans (at least part time), knowledge sharing sessions with other teams, additional reviews, etc. It’s better than firefighting later.

Other ideas

In the domain of HR there is a lot of space for thinking outside the box. If you have an idea on how to measure / assess negative aspects of growth, could you share it in comments? Help me keep my company “small”.

The Ultimate Data Question

It all starts very innocently. I have just finished the new visualisation presenting a historical view on some subject (let’s say, concerning the Catering Department). As with all good visualisations, it tells some story about what happened in the past (time spent on dishwashing is increasing rapidly) and allows one to draw conclusion of what could happen in the future (occupational strike of the crew, involving the taking of hostages). I am excited as the conclusions are surprising, revealing a problem or a threat. I invite the people that are interested in the subject; people that are responsible for the area in which the subject exists (the Head of Catering as well as a SWAT team representative) and people who could make a decision based on the new insight. I show the graphs, explain how to read them and where the data come from. Next, I tell them about my conclusions, trying to list all the assumptions that I have made. The audience take a moment to digest what they have just seen. I am extremely happy, expecting that in a moment, a good, constructive discussion will start which will help the company to improve itself. And then, when change seems inevitable, the person responsible for the subject (Mr Stevens, Head of Catering) raises The Ultimate Data Question: “This is all nice, but are you absolutely sure about your data?” And, indeed, a discussion starts; however, it is not about improvement but about the data quality. “Dear audience,” I think, “I thought that the quality of data is my part of the job – yours is to act on the conclusions!” So there I am, standing with a silly look on my face, wondering why this has turned this way.

Why is the question asked?

There are legitimate reasons why the question is asked:

  • the person is very surprised by the conclusion and needs to be reassured before moving forward,
  • the person wants to know what probability there is that an incorrect conclusion has been drawn,
  • the person knows his business and thinks that if the conclusions were correct, they would have been discovered much earlier

Sometimes, there could also be a wrong reason:

  • the person wants to transfer the discussion from the inconvenient “what should be done with the conclusions” to the safe “how can the presenter prove that the data is correct”. Depending on the situation, the discussion can then turn into a very detailed review of the raw data and the meeting will finish without any action taken to make improvement.

Whatever the reason is, the question needs an answer.

What is the answer?

In most real-life situations the simple, short and true answer is “No”. Of course, there are areas where we can be sure, like when the data is recorded by a machine (e.g. number of assembled parts or amount of bytes transferred etc.); it is also sometimes possible to define the maximum error in the data, like when a physical process is measured and you know the accuracy of measurement instruments. However, most of the interesting data (at least in my area) is produced either by people or in a process supported by people. In such cases, you may expect errors, and you really cannot be certain about the expected accuracy. As an example, let’s look at the reasons that could spoil the Catering Department data about the time spent on dishwashing (the list is definitely not complete):

  • people do not have ideal memories, so when they have to record information about the past, sometimes they are just guessing (“How much time have I spent yesterday on washing the dishes? Well, it could really be anything between 1 or 3 hours… let’s put in 2 hours”),
  • often people need to trick the faulty data gathering system (“I worked 5 hours but half an hour was a lunch break; there is no code for lunch in the system… so I will just add this to the smallest task, which is dishwashing”)
  • people do not always only work during work (“I worked 5 hours but I have spent half an hour talking with this nice lady from HR and I don’t want to reveal this; I will just add this to the smallest task – again, dishwashing”)
  • people are not always told why the data is important (“I will just log time anywhere, it’s quite probable that nobody looks at this anyway”)
  • sometimes people just do not give a shit (“I will just log time anywhere”)
  • people like to be funny (“I will report 6 hours on dishwashing, let’s see if someone notices…”)
  • people do make mistakes (2 and 3 are quite close on the keyboard)
  • people are not always trained properly (“It seems to be saving only when I put all the time on dishwashing… maybe this is the way it should be”)
  • people have different interpretations of one thing (“Does dishwashing include fetching dishes from the cafeteria or not?”)
  • people are over-worked (“Man, it’s the end of the month already and I haven’t filled in the time for a single working day… okay, don’t panic, how much time have I spent on dishwashing that Tuesday, 30 days ago?”)
  • people are “tired of these bloody data scientists that always want more and more data gathered and now I’m filling in these stupid spreadsheets instead of cooking penne all’arrabbiata. Curse them and their whole profession!”

Of course, most people will do their best to correctly record time or will raise a hand when facing troubles. What I want to say is that such errors can happen and you cannot really say you are absolutely sure the data is correct.

The problem with the simple, short and true answer is that it usually results in the rejection of all the conclusions – even if the impact of the questionable part of the data set is irrelevant to the conclusions. Why? Many of us feel uncomfortable in situations of uncertainty. The data is either ideal or has no value for us. What we do not always see is that rejecting the conclusions means keeping the status quo, which often is not supported by any current data (or could have been established a long time ago in different circumstances). The reason for this may be that it seems safer to act “as usual” than to risk a change influenced by uncertain data. Nevertheless, the truth is that a good opportunity for improvement could be lost this way.

What can you offer instead?

As you cannot say that the data is absolutely correct, and the simple, short and true answer is not helping neither, is there a better line of discussion? There are a few possibilities of what you can add to the “No”:

  • If you have involved all the stakeholders in the data preparation and data review process, you can say that the data is indisputable.
  • You can describe what you have done to educate people involved in data gathering, and what has been done to make the data gathering tool as usable as possible.
  • You can describe how you have reviewed the data and how it was cleaned.
  • You can describe how you have identified and investigated outliers.
  • You can propose an experiment that will validate your conclusions.
  • You can propose an investment in a more automated way of measuring the data (if possible).
  • You can share your feeling of the expected error rate and its influence on the conclusions.
  • You can, in advance, state in you assumptions that the data is not ideally correct – this works better than stating this when the question is asked.

Of course, in the end, the audience can still decide that the conclusions should not be accepted; this is their place to decide, just as your role is to honestly answer the Ultimate Data Question. What could help you is to be prepared for it, because it will inevitably be asked in the end.

Have a nice day… and don’t get captured in the cafeteria.

The Horror

This was a regular morning on a regular Wednesday. I was heading to my lab (which is really a standard office that I share with one very talkative guy… a “lab” sounds more professional, though), when I heard a heart-breaking cry from one of the rooms. When I looked inside I saw a terrible scene. One of my colleagues was shouting at his whiteboard, cursing it, its manufacturer, the markers, their manufacturers, and their entire families to the third generation. While doing this, he was furiously trying to wipe the whiteboard, without achieving any greater effect than stains on his shirt and hands. I realised that I had become a witness of the greatest drama that can happen in the office during daylight hours – permanent marker on a whiteboard.

The Horror

“The horror,” Kurtz said at the end of Heart of darkness, “the horror”. I wonder what would he say if he got delivered a whiteboard in the jungle with a set of permanent markers. This would be the true horror. I really cannot understand why they keep producing them. A whiteboard marker can be used both on a whiteboard and on paper. The permanent one you can use on paper but it will screw you when you use it on a whiteboard. If you want to make the world a little better place, I urge you to join me in my ITAATPMYF quest (Immediately Throw Away All The Permanent Markers You Find). Ufff…

Coming back to the story. When I saw the poor guy I expressed my condolences and invited him to join the ITAATPMYF quest. To my surprise, he said that he was not using the permanent marker, but the whiteboard one. I had to check myself and indeed, only a part of the paint was vanishing while wiping. It’s true, the eraser was not a premium one, but still it should have worked! What the hell?

I realised that this was exactly the kind of case for the Operations Research department. I put on my white coat and focused on this as a matter of priority. After short interrogation it was revealed that the whiteboard drawing was made two weeks ago, with a regular black Pentel marker. I started to wander around the office in search of old drawings. Whenever I spotted one, I was erasing it immediately (some people were trying to stop me but they were clearly underestimating the power of the determined scientist). Some drawings were easy to erase, others didn’t come off at all. It was difficult to draw any conclusions as people, instead of providing useful facts about which markers were used and the age of the drawings, were usually babbling something about their lost whiteboard notes. It was clear that I need to make an independent experiment.

The Experiment

I gathered all the kinds of whiteboard markers that were available in our company. I assumed that the colour of a marker does not make any difference (which should be verified in a separate research). I chos a low-end eraser. I prepared some typical whiteboard space, far from windows, air conditioners etc. The experiment started with the following board:

start

I drew 5 lines, each using a different kind of marker: Pentel, Titanum, Edding, Staedtler and Pentel Maxiflo. Then I started to erase a part of each line after certain periods of time: two days, 1 week, 2 weeks and finally 3 weeks. I was erasing using a moderate strength of action, wiping only once.

In the beginning, all the lines were easy to erase. All the markers came out completely after two days. The same occurred after one week, maybe with a small shade of ink left for Pentel and Titanum. After two weeks, it started to be a bit more interesting:

mid

As you can see, despite wiping, both the Pentel and Titanum lines stayed on the whiteboard (with the Pentel line a bit paler). You can also see a light shade of Pentel Maxiflo line. Edding and Staedtler lines were fully removed. And here you can see the final result:

end

After three weeks Pentel Maxiflo was also difficult to wipe. Edding and Steadtler performed the best, with a small advantage going to Steadtler as Edding has left a minor trace of the last piece of line.

Conclusions

As I am expecting to be sued by the marker manufacturers in the next 24 hours, I need to admit that the markers are not the only factor in the experiment. It could be that the results are valid only for my particular whiteboard or my particular eraser. The colour impact would also have to be considered. However, as I have chosen the most common board and eraser in my company, my conclusion is that using red Staedtler markers would save us from quite a few office dramas in the future.

And remember, ITAATPMYF.

Trial and error – performance comparison

„I have invented a great visualisation that will help you to compare the system’s performance between two configurations“, I’ve said. And I was wrong three times in this single sentence. First of all, I have not invented this but probably saw this somewhere else. Secondly, my visualisation definitely was not great. And finally, it was not helping much neither.  What is worse, I have repeated this sentence a couple of times during my journey of trial and error, while presenting new versions of my visualisation to the team. Resulting with something that, maybe, is just a bit closer to the truth. But let’s start from the beginning…

The beginning

I was informed that one of the teams was struggling with a problem of having too much data. They were trying to optimise the performance of the system that, when deployed, will be used by 1200 concurrent users. To decide if a certain optimisation change made in the system resulted in performance improvement, they had to easily compare two systems: before and after the change. To measure the system performance, they have prepared a set of performance test scenarios that were simulating the expected interactions of future users with the system. As a result, they got 23 test scenarios that could be run multiple times in parallel, simulating the future system load. During such a simulation (that usually took 3 hours of recurrent scenarios running), they used the JMeter tool to collect the response times of all the users’ actions. And this is where they have faced the problem. In the 23 test scenarios they had 278 different actions. Each action could be executed a couple of million times. The single test run were producing great amount of data. Comparing two tests were difficult.

Due to my sixth sense I realised that this is exactly the problem that Operations Research department could help solving. I put on my white coat and rushed to the upper floor. On the scene I met two guys debating about the performance runs that have just finished. After some introduction, the conversation started to show the following scheme:

First team member: “For most of the actions, you can see that their average response times in the system with the latest optimisation are shorter than the same averages in the system without the optimisation. So we can say that the optimisation improves the system performance.”

Me: “You are right.”

Second team member: “But if you take the longest actions, which are the ones that need improving the most, you can see that their average response time is worse with the latest optimisation implemented. So we can say that it does not improve the system performance.”

Me: “You are right.”

Third team member: “But we should focus on the actions from the most important test scenario. This will tell us if the optimisation is improving the system performance.”

Me: “You are right.”

All the team members: “But we all cannot be right together.”

Me: “You are right.”

After losing most of their respect I could, without further delay, focus on the task of creating the proper visualisation. (I have lost the rest of their respect by suggesting that the best and the easiest optimisation would be to prevent these damn users from logging in.) The ideal visualisation would help them to easily understand which actions are gaining performance, which are losing and to what extent. And so the journey begins.

Trial and error

My first attempt was pathetic. See yourself:

Post_2_vis1

Each graph shows a comparison made for one optimisation change. On the horizontal axis we have system configurations that we are comparing. On the vertical axis we have the average response time in seconds. Each action is represented by a line between two points. As there are multiple actions, you cannot see the comparison for each of them, as one hides another. But you can try to gain a high level understanding of the two runs comparison – if most of the lines are rising then the first run had better performance. To sum up – this was poor and I hated it. So I had to start from scratch AGAIN.

Post_2_vis2

The major concept of the new visualisation was to use a scatter plot, where each action would be represented by a point. On the horizontal axis you have the action’s response time from the first run, on the vertical axis – from the second run. All the points above the 45 degrees diagonal mean that the first run showed better performance and vice versa. Additionally, you can see which actions – shorter or longer – are better in which run.

(One note about the statistics used. Instead of analysing the actions’ response time averages, we started to focus on the 90th centiles. This means that each action is represented by such a time that is longer than 90% of all the action executions’ response times. The reason was to make sure that most of the users’ interactions with the system perform properly.)

The problem with this visualisation is that most of the points are gathered in a small area in the bottom left corner, with only a few longest in the top right corner. The second problem is that you cannot easily read how big (relative, e.g. in %) is the difference between the runs for an action. So, AGAIN.

Post_2_vis3

As you can see, the axes are now in the logarithmic scale, which makes a better use of the chart area. The lines marking +20%, +50% and +100% are now helping to assess how big the differences between the runs are. The problem with this visualisation is that most of the chart area is still unused, as even such a big difference as 100% is still very close to the diagonal. AGAIN.

Post_2_vis4

The concept was to rotate the diagonal, so that it becomes the horizontal axis. We have there the action response time from the first run, still in the logarithmic scale. On the vertical axis we can see how different the response time in the second run was.

equation1 Actions that are faster in the second run (after application of the optimisation change) are above the horizontal axis, the slower are below. The whole area chart is used now, and it is much easier to read precise values. However, there is a problem with the vertical axis. It is not symmetrical. If the second run is faster, the points have values from 0 to +infinity. If it is slower, the points have values from 0 to –100%. The visualisation deceives the viewer in favour of the second run. AGAIN.

Post_2_vis5

This is my latest visualisation. The major change here is that the values on the vertical axis are symmetrical. Post_2_equation2 Additionally, the actions from the most important test scenario were marked red. As a result, we have a way to show which actions gain and which lose by applying an optimisation. We can also assess the level of the change. And hopefully, we can make better decisions on the system development.

The end?

This was my way of learning the lesson stated by Edward Tufte in the classical The Visual Display of Quantitative Information: the process of repetitive review of the graphical design is crucial in achieving a satisfactory result. In this process, it is important to get constructive feedback after each iteration, which in my case was provided in strong words by the team members (and for which I am grateful). Thanks to this, by repetitive application of major and minor changes, I was able to improve my visualisation a lot. It does not mean that the visualisation has no drawbacks. First of all, the logarithmic scale is not very intuitive and could be deceiving. Secondly, we are only looking at the 90th centile, not the whole distribution. There are probably many other problems about which I haven’t got any feedback yet.

Dear Reader, I would be grateful also for your feedback. Don’t hesitate, leave a comment or send me a message.

I’m a Data Scientist

ImadatascientistSo what do you do for a living nowadays? Experimenting on mice?” This accusation, stated during the Annual-Day-of-Misery (sometimes referred to as “aunt Suzan’s birthday family reunion”), has brought home to me the fact that my profession is not well-known. Yes, that’s right. I’m a data scientist. I’m working in the Operations Research department of a Polish/UK mid-size software house. You could say that it’s not that unusual, but on our side of the historical Iron Curtain it still is. Such departments are rare and such positions are uncommon. So, additionally motivated by the fresh label of a person that mistreats animals (as my family has only recently acknowledged my previous domain – software development: people fixing routers, configuring laptops and creating these funny web pages with these sweet little cats), I’ve decided to create a blog describing what we do here on a daily basis. The main goal is to create interest and drag more people into the field of OR. And the second goal is to get peer review from others that already are experts.

Why unusual

So, why is this rare to see an OR department in Poland (and I think the situation is quite similar in the other “off-shoring / near-shoring” directions like India, Romania, Ukraine, Brazil, etc.)? To simplify (outrageously), we have two kinds of companies here. The first group are locally owned, either privatised huge nation-wide companies struggling not to collapse or fairly young companies grown from family businesses. Usually, they do not have much experience in successful investing in the research. They focus on day-to-day operations, trying to survive and prosper on a difficult market, not thinking much on data research. The universities are also not helping, as they are not used to cooperate with business, nor the local business is used to cooperate with the universities (and the paper work at the universities is just horrifying). Only small part of these first group companies decide to create OR departments. This will for sure get better in the future, with the increase of these companies maturity.

The second group are foreign companies that have decided to open an office in Poland to benefit from the “off-shoring / near-shoring” model. Many of them already have good experiences of investing in research. However, the model does not include moving the research departments away from their headquarters. Most of the companies just move the production positions (in software domain this would be developers, testers and support engineers, together with the necessary line management). This means that also in these companies it is unusual to find an OR department in Poland.

Having said that, I was unbelievably lucky to join a company which does not fear to locate any department here, breaking the established ways of doing business. And as a result, I’m a data scientist now.

Why the blog

As this is an unusual role around here, I thought that I could make it a little bit better understood (especially as there is huge hype around the role, with the trend to tie it exclusively with the Big Data problems solving). Hopefully this will drag attention of smart, data-literate, business aware people, that could decide one day to choose “data science” as their professional path. And this will for sure result in (a) more companies bringing more interesting work up here and (b) more bright people to cooperate with.

To help in achieving this, I will describe in my blog what problems we are trying to solve on a daily basis, broadly in the domains of data analysis, data visualisation, solving business problems using data driven models, defining and gathering business processes measures (ordinary and extraordinary), etc. I’m hoping to get fair amount of constructive critique from the blog readers. I will also treat the blog as a form of free therapy, so you are reading this on your own risk.

I plan to publish 1-2 posts monthly 1 post monthly. I plan to answer all questions from readers in two business days the latest. I plan not to think about the fact that 95% of blogs gets abandoned.  Hope you will enjoy it!

 


BTW. The company name is Objectivity. We develop and maintain software for UK and German customers. We have offices in Wroclaw/Poland and Coventry/UK. (We have a good laugh every time someone is going to the UK office, as there is a British idiom “sent to Coventry”.) Our way of doing business is: being agile, hiring only professionals and being painfully honest with our customers and employees. It seems to be working well, so, against all the odds, maybe there is some hope for this world after all.