I graduated from Lakehead University in 1977 and ever since
my career has focused on data collection, management, and analysis with the
last 25 or so years dealing primarily with computer modelling. As that topic
has been much in the news for some time now, I thought it might be helpful to
explain to the lay person what some of the characteristics as well as strengths
and weaknesses are.
When dealing with digital data a primary factor can be
summarised thusly: garbage in, garbage out. In other words if you use poor data
do not expect anything but poor results. What makes “poor” data you ask?
Combining different “quality” of data is one example. Or using data from
different sources, collected using different methods, and inconsistency in the
data coverage (data clustering). All introduce severe biases.
Once a sample has been collected it may be sent to a
laboratory for analysis. Here we add another layer of biases that complicate
the quality of the data set. As there are a few different methods available and
all have strengths and weaknesses, especially when what we are looking for is
present in very tiny quantities, such as gold. This introduces the concepts of
“accuracy” and “precision”. The former is the measure of how repeatable the
results are and the other how close to the actual value. Ideally you want both,
but that does not always happen.
Let us look at two real data sets; one is the average
monthly temperature as reported by Hydro One and the other is the average
monthly temperature as reported by Enbridge, for the same residence and for the
same time period of November through to March this past winter: -6.8, -11.7,
-13.4, -12.15, -6.45; and -2, -9, -9, -11, -6. First thing to note is the
“precision” of the two data sets; one is reported to 2 decimal places whereas
the other is only as integers. The average of the first set is -10.10 and for
the other it is -7.4. Now both are supposedly for the same location and for the
same time. Why are they different? One bias has been introduced in that in both
cases I divided the sum of each set by 5, the number of elements. But each
month covered has a different number of days and we did not allow for that. We
really do not know where the temperatures were read but I am going to speculate
that the “smart” meter on my house has a temperature sensor whereas the
Enbridge readings are from their facility in Thunder Bay. Amazing then the
temperature differences 100 kilometres makes!
We should not assign a precision to the Enbridge data set
that we do not have. If the source data is integers, then the average needs to
be reported as an integer, -7. Similarly, for decimal data we should not report
more decimal places than that for the input date. We should not imply a
precision that is not there! And we have just scratched the surface of the
complexities of data that need to be addressed!
Now let us jump to computer modelling, the process that uses
our input data. A computer model uses an algorithm consisting of a set of
mathematical equations used to process the data and generate an interpretation.
The more complex the system being modelled the more complex the equations. Many
assumptions are made in the choice of equations. We then process the data and
then analyse the output. But is the output reliable? One way to verify is to
validate using a data set where we know the results. If the model prediction
matches reality, then we have confidence in the algorithm used. To illustrate,
if we have a temperature prediction model then we would take a set of
historical temperature input data and see what the prediction is for the
present.
This is where scientific peer review comes in. If different
researchers use the same algorithm, but different data sets, all with known
results and they can duplicate those results then we have confirmed it is
reliable.
Unfortunately, in the real world we have “researchers” using
bad data, such as mixing “proxy” temperature data with actual temperature
readings. Then they do not adjust for existing biases within the data. Next,
they use a computer model that has not been validated. This generates bad results,
garbage out. To complicate things further they report the results to a
precision not present in the input data. These severely flawed results are then
given to the media who strip off any caveats that may have been included thus
making a speculative statement one of “fact”. This result is complicated
further in that far too many people apply confirmation bias to use these to
prove their own flawed thesis and then the politicians get involved making the
matter even worse. As Churchill said, “A lie gets halfway around the world
before the truth has a chance to get its pants on”.
In conclusion, unless a computer model has been validated
using real data and has shown it can predict real events then it is not to be
trusted. Any “researcher” who is unwilling to share their algorithm and data
with others so it can be tested cannot be trusted. Any politician who uses the
results from an unvalidated model too cannot be trusted. If you see any of
these three truths do not believe what you are being told because to do
otherwise is at your own peril.
No comments:
Post a Comment
I don't want to live in a bubble so if you have a different take or can suggest a different source of information go for it!