It all starts very innocently. I have just finished the new visualisation presenting a historical view on some subject (let’s say, concerning the Catering Department). As with all good visualisations, it tells some story about what happened in the past (time spent on dishwashing is increasing rapidly) and allows one to draw conclusion of what could happen in the future (occupational strike of the crew, involving the taking of hostages). I am excited as the conclusions are surprising, revealing a problem or a threat. I invite the people that are interested in the subject; people that are responsible for the area in which the subject exists (the Head of Catering as well as a SWAT team representative) and people who could make a decision based on the new insight. I show the graphs, explain how to read them and where the data come from. Next, I tell them about my conclusions, trying to list all the assumptions that I have made. The audience take a moment to digest what they have just seen. I am extremely happy, expecting that in a moment, a good, constructive discussion will start which will help the company to improve itself. And then, when change seems inevitable, the person responsible for the subject (Mr Stevens, Head of Catering) raises The Ultimate Data Question: “This is all nice, but are you absolutely sure about your data?” And, indeed, a discussion starts; however, it is not about improvement but about the data quality. “Dear audience,” I think, “I thought that the quality of data is my part of the job – yours is to act on the conclusions!” So there I am, standing with a silly look on my face, wondering why this has turned this way.
Why is the question asked?
There are legitimate reasons why the question is asked:
- the person is very surprised by the conclusion and needs to be reassured before moving forward,
- the person wants to know what probability there is that an incorrect conclusion has been drawn,
- the person knows his business and thinks that if the conclusions were correct, they would have been discovered much earlier
Sometimes, there could also be a wrong reason:
- the person wants to transfer the discussion from the inconvenient “what should be done with the conclusions” to the safe “how can the presenter prove that the data is correct”. Depending on the situation, the discussion can then turn into a very detailed review of the raw data and the meeting will finish without any action taken to make improvement.
Whatever the reason is, the question needs an answer.
What is the answer?
In most real-life situations the simple, short and true answer is “No”. Of course, there are areas where we can be sure, like when the data is recorded by a machine (e.g. number of assembled parts or amount of bytes transferred etc.); it is also sometimes possible to define the maximum error in the data, like when a physical process is measured and you know the accuracy of measurement instruments. However, most of the interesting data (at least in my area) is produced either by people or in a process supported by people. In such cases, you may expect errors, and you really cannot be certain about the expected accuracy. As an example, let’s look at the reasons that could spoil the Catering Department data about the time spent on dishwashing (the list is definitely not complete):
- people do not have ideal memories, so when they have to record information about the past, sometimes they are just guessing (“How much time have I spent yesterday on washing the dishes? Well, it could really be anything between 1 or 3 hours… let’s put in 2 hours”),
- often people need to trick the faulty data gathering system (“I worked 5 hours but half an hour was a lunch break; there is no code for lunch in the system… so I will just add this to the smallest task, which is dishwashing”)
- people do not always only work during work (“I worked 5 hours but I have spent half an hour talking with this nice lady from HR and I don’t want to reveal this; I will just add this to the smallest task – again, dishwashing”)
- people are not always told why the data is important (“I will just log time anywhere, it’s quite probable that nobody looks at this anyway”)
- sometimes people just do not give a shit (“I will just log time anywhere”)
- people like to be funny (“I will report 6 hours on dishwashing, let’s see if someone notices…”)
- people do make mistakes (2 and 3 are quite close on the keyboard)
- people are not always trained properly (“It seems to be saving only when I put all the time on dishwashing… maybe this is the way it should be”)
- people have different interpretations of one thing (“Does dishwashing include fetching dishes from the cafeteria or not?”)
- people are over-worked (“Man, it’s the end of the month already and I haven’t filled in the time for a single working day… okay, don’t panic, how much time have I spent on dishwashing that Tuesday, 30 days ago?”)
- people are “tired of these bloody data scientists that always want more and more data gathered and now I’m filling in these stupid spreadsheets instead of cooking penne all’arrabbiata. Curse them and their whole profession!”
Of course, most people will do their best to correctly record time or will raise a hand when facing troubles. What I want to say is that such errors can happen and you cannot really say you are absolutely sure the data is correct.
The problem with the simple, short and true answer is that it usually results in the rejection of all the conclusions – even if the impact of the questionable part of the data set is irrelevant to the conclusions. Why? Many of us feel uncomfortable in situations of uncertainty. The data is either ideal or has no value for us. What we do not always see is that rejecting the conclusions means keeping the status quo, which often is not supported by any current data (or could have been established a long time ago in different circumstances). The reason for this may be that it seems safer to act “as usual” than to risk a change influenced by uncertain data. Nevertheless, the truth is that a good opportunity for improvement could be lost this way.
What can you offer instead?
As you cannot say that the data is absolutely correct, and the simple, short and true answer is not helping neither, is there a better line of discussion? There are a few possibilities of what you can add to the “No”:
- If you have involved all the stakeholders in the data preparation and data review process, you can say that the data is indisputable.
- You can describe what you have done to educate people involved in data gathering, and what has been done to make the data gathering tool as usable as possible.
- You can describe how you have reviewed the data and how it was cleaned.
- You can describe how you have identified and investigated outliers.
- You can propose an experiment that will validate your conclusions.
- You can propose an investment in a more automated way of measuring the data (if possible).
- You can share your feeling of the expected error rate and its influence on the conclusions.
- You can, in advance, state in you assumptions that the data is not ideally correct – this works better than stating this when the question is asked.
Of course, in the end, the audience can still decide that the conclusions should not be accepted; this is their place to decide, just as your role is to honestly answer the Ultimate Data Question. What could help you is to be prepared for it, because it will inevitably be asked in the end.
Have a nice day… and don’t get captured in the cafeteria.