Monday, May 11, 2020

An Introduction to Data Analysis and Models


I graduated from Lakehead University in 1977 and ever since my career has focused on data collection, management, and analysis with the last 25 or so years dealing primarily with computer modelling. As that topic has been much in the news for some time now, I thought it might be helpful to explain to the lay person what some of the characteristics as well as strengths and weaknesses are.
When dealing with digital data a primary factor can be summarised thusly: garbage in, garbage out. In other words if you use poor data do not expect anything but poor results. What makes “poor” data you ask? Combining different “quality” of data is one example. Or using data from different sources, collected using different methods, and inconsistency in the data coverage (data clustering). All introduce severe biases.

Once a sample has been collected it may be sent to a laboratory for analysis. Here we add another layer of biases that complicate the quality of the data set. As there are a few different methods available and all have strengths and weaknesses, especially when what we are looking for is present in very tiny quantities, such as gold. This introduces the concepts of “accuracy” and “precision”. The former is the measure of how repeatable the results are and the other how close to the actual value. Ideally you want both, but that does not always happen.

Let us look at two real data sets; one is the average monthly temperature as reported by Hydro One and the other is the average monthly temperature as reported by Enbridge, for the same residence and for the same time period of November through to March this past winter: -6.8, -11.7, -13.4, -12.15, -6.45; and -2, -9, -9, -11, -6. First thing to note is the “precision” of the two data sets; one is reported to 2 decimal places whereas the other is only as integers. The average of the first set is -10.10 and for the other it is -7.4. Now both are supposedly for the same location and for the same time. Why are they different? One bias has been introduced in that in both cases I divided the sum of each set by 5, the number of elements. But each month covered has a different number of days and we did not allow for that. We really do not know where the temperatures were read but I am going to speculate that the “smart” meter on my house has a temperature sensor whereas the Enbridge readings are from their facility in Thunder Bay. Amazing then the temperature differences 100 kilometres makes! 

We should not assign a precision to the Enbridge data set that we do not have. If the source data is integers, then the average needs to be reported as an integer, -7. Similarly, for decimal data we should not report more decimal places than that for the input date. We should not imply a precision that is not there! And we have just scratched the surface of the complexities of data that need to be addressed!

Now let us jump to computer modelling, the process that uses our input data. A computer model uses an algorithm consisting of a set of mathematical equations used to process the data and generate an interpretation. The more complex the system being modelled the more complex the equations. Many assumptions are made in the choice of equations. We then process the data and then analyse the output. But is the output reliable? One way to verify is to validate using a data set where we know the results. If the model prediction matches reality, then we have confidence in the algorithm used. To illustrate, if we have a temperature prediction model then we would take a set of historical temperature input data and see what the prediction is for the present.

This is where scientific peer review comes in. If different researchers use the same algorithm, but different data sets, all with known results and they can duplicate those results then we have confirmed it is reliable.

Unfortunately, in the real world we have “researchers” using bad data, such as mixing “proxy” temperature data with actual temperature readings. Then they do not adjust for existing biases within the data. Next, they use a computer model that has not been validated. This generates bad results, garbage out. To complicate things further they report the results to a precision not present in the input data. These severely flawed results are then given to the media who strip off any caveats that may have been included thus making a speculative statement one of “fact”. This result is complicated further in that far too many people apply confirmation bias to use these to prove their own flawed thesis and then the politicians get involved making the matter even worse. As Churchill said, “A lie gets halfway around the world before the truth has a chance to get its pants on”.

In conclusion, unless a computer model has been validated using real data and has shown it can predict real events then it is not to be trusted. Any “researcher” who is unwilling to share their algorithm and data with others so it can be tested cannot be trusted. Any politician who uses the results from an unvalidated model too cannot be trusted. If you see any of these three truths do not believe what you are being told because to do otherwise is at your own peril.

No comments:

Post a Comment

I don't want to live in a bubble so if you have a different take or can suggest a different source of information go for it!