Trial and error – performance comparison

„I have invented a great visualisation that will help you to compare the system’s performance between two configurations“, I’ve said. And I was wrong three times in this single sentence. First of all, I have not invented this but probably saw this somewhere else. Secondly, my visualisation definitely was not great. And finally, it was not helping much neither.  What is worse, I have repeated this sentence a couple of times during my journey of trial and error, while presenting new versions of my visualisation to the team. Resulting with something that, maybe, is just a bit closer to the truth. But let’s start from the beginning…

The beginning

I was informed that one of the teams was struggling with a problem of having too much data. They were trying to optimise the performance of the system that, when deployed, will be used by 1200 concurrent users. To decide if a certain optimisation change made in the system resulted in performance improvement, they had to easily compare two systems: before and after the change. To measure the system performance, they have prepared a set of performance test scenarios that were simulating the expected interactions of future users with the system. As a result, they got 23 test scenarios that could be run multiple times in parallel, simulating the future system load. During such a simulation (that usually took 3 hours of recurrent scenarios running), they used the JMeter tool to collect the response times of all the users’ actions. And this is where they have faced the problem. In the 23 test scenarios they had 278 different actions. Each action could be executed a couple of million times. The single test run were producing great amount of data. Comparing two tests were difficult.

Due to my sixth sense I realised that this is exactly the problem that Operations Research department could help solving. I put on my white coat and rushed to the upper floor. On the scene I met two guys debating about the performance runs that have just finished. After some introduction, the conversation started to show the following scheme:

First team member: “For most of the actions, you can see that their average response times in the system with the latest optimisation are shorter than the same averages in the system without the optimisation. So we can say that the optimisation improves the system performance.”

Me: “You are right.”

Second team member: “But if you take the longest actions, which are the ones that need improving the most, you can see that their average response time is worse with the latest optimisation implemented. So we can say that it does not improve the system performance.”

Me: “You are right.”

Third team member: “But we should focus on the actions from the most important test scenario. This will tell us if the optimisation is improving the system performance.”

Me: “You are right.”

All the team members: “But we all cannot be right together.”

Me: “You are right.”

After losing most of their respect I could, without further delay, focus on the task of creating the proper visualisation. (I have lost the rest of their respect by suggesting that the best and the easiest optimisation would be to prevent these damn users from logging in.) The ideal visualisation would help them to easily understand which actions are gaining performance, which are losing and to what extent. And so the journey begins.

Trial and error

My first attempt was pathetic. See yourself:


Each graph shows a comparison made for one optimisation change. On the horizontal axis we have system configurations that we are comparing. On the vertical axis we have the average response time in seconds. Each action is represented by a line between two points. As there are multiple actions, you cannot see the comparison for each of them, as one hides another. But you can try to gain a high level understanding of the two runs comparison – if most of the lines are rising then the first run had better performance. To sum up – this was poor and I hated it. So I had to start from scratch AGAIN.


The major concept of the new visualisation was to use a scatter plot, where each action would be represented by a point. On the horizontal axis you have the action’s response time from the first run, on the vertical axis – from the second run. All the points above the 45 degrees diagonal mean that the first run showed better performance and vice versa. Additionally, you can see which actions – shorter or longer – are better in which run.

(One note about the statistics used. Instead of analysing the actions’ response time averages, we started to focus on the 90th centiles. This means that each action is represented by such a time that is longer than 90% of all the action executions’ response times. The reason was to make sure that most of the users’ interactions with the system perform properly.)

The problem with this visualisation is that most of the points are gathered in a small area in the bottom left corner, with only a few longest in the top right corner. The second problem is that you cannot easily read how big (relative, e.g. in %) is the difference between the runs for an action. So, AGAIN.


As you can see, the axes are now in the logarithmic scale, which makes a better use of the chart area. The lines marking +20%, +50% and +100% are now helping to assess how big the differences between the runs are. The problem with this visualisation is that most of the chart area is still unused, as even such a big difference as 100% is still very close to the diagonal. AGAIN.


The concept was to rotate the diagonal, so that it becomes the horizontal axis. We have there the action response time from the first run, still in the logarithmic scale. On the vertical axis we can see how different the response time in the second run was.

equation1 Actions that are faster in the second run (after application of the optimisation change) are above the horizontal axis, the slower are below. The whole area chart is used now, and it is much easier to read precise values. However, there is a problem with the vertical axis. It is not symmetrical. If the second run is faster, the points have values from 0 to +infinity. If it is slower, the points have values from 0 to –100%. The visualisation deceives the viewer in favour of the second run. AGAIN.


This is my latest visualisation. The major change here is that the values on the vertical axis are symmetrical. Post_2_equation2 Additionally, the actions from the most important test scenario were marked red. As a result, we have a way to show which actions gain and which lose by applying an optimisation. We can also assess the level of the change. And hopefully, we can make better decisions on the system development.

The end?

This was my way of learning the lesson stated by Edward Tufte in the classical The Visual Display of Quantitative Information: the process of repetitive review of the graphical design is crucial in achieving a satisfactory result. In this process, it is important to get constructive feedback after each iteration, which in my case was provided in strong words by the team members (and for which I am grateful). Thanks to this, by repetitive application of major and minor changes, I was able to improve my visualisation a lot. It does not mean that the visualisation has no drawbacks. First of all, the logarithmic scale is not very intuitive and could be deceiving. Secondly, we are only looking at the 90th centile, not the whole distribution. There are probably many other problems about which I haven’t got any feedback yet.

Dear Reader, I would be grateful also for your feedback. Don’t hesitate, leave a comment or send me a message.