Apparent Trends and Random Variations
Posted by softestpawn on July 28, 2009
Take a look at these graphs:
You can see trends and features: The noisy but steady drop in the first, a similar but slower rise in the second. The fourth has a slow rise followed by a sharp and maybe accelerating rise. And so on.
The thing is, these are randomly generated, using what are known as ‘random walks’. There are no underlying trends, no feature-causing events. There are just a set of random numbers added together. And these charts have not been specially picked, they are just six consecutive runs of the same random walk spreadsheet. Try it yourself:
Make Your Own Random Walk
Random walks are created where a value is changed by, rather than set to, a random value at each step.
For example, get a willing volunteer to walk across a field tossing a coin; if it shows heads he takes a step to the left and one forward, and if it shows tails he takes a step to the right and one forwards. If he’s really helpful, carries a leaky paint pot, and walks across several fields you get something like those graphs above.
However as it’s raining out and I don’t want to waste any paint, they were created in a very simple OpenOffice spreadsheet: enter the formula A1+RAND()*1 -0.5 into cell B1. This gives B1 a number between -0.5 and +0.5. Then copy it into, say, 200 cells below. Each cell creates a random number between -0.5 and 0.5 and adds it to the previous number.
What is perhaps remarkable, if you think of random numbers as being, well, random, are these long trends of steadily decreasing or increasing apparent trends.
Smoothing and Trends
So those are examples of randomly generated graphs that at first sight look quite similar to many graphs that we get when measuring features of the environment.
For example, if we look at a buoy bobbing about on the sea, it appears to change height from the sea floor in an almost random, unpredictable way, as the overlapping waves, boat wakes, splashes and wind push it about. These small changes are not random, or noise; the measurements are measurements of height at a particular point, and so are signal; they tell us how high the buoy is. However they are chaotic, and so (for most practical purposes) not predictable.
The difficulty is in telling the difference between random systems and ones that do actually have underlying trends. Sometimes simply time well tell: long observations of those random walks above will give us fewer and fewer consistent patterns. Long observations of buoy heights give us predictable tides.
When we’re looking at ways in which complex systems work, we can sometimes find underlying causes by smoothing out the inconvenient small scale changes that confuse the larger patterns. We look to remove the ‘noise’ to reveal the underlying signal.
This requires long observations though, where the patterns can be consistently and reliably repeated, and it requires looking at the right scale in the data. If we smoothed our buoy height data over weeks, we may see seasonal patterns but would miss out the tidal ones.
The simplest trend analysis is to see whether the data is tending to go up or down overall. We can see in the first two graphs at the top above that there is a steady fall; what about the last one? We can find out by a method called ‘linear regression’. The way it works doesn’t really concern us, but using it gives us a line through the data that is as close as it can be to every single point in that data set. The angle of the line tells us how fast the values have been increasing or decreasing, overall.
Sometimes this doesn’t tell us anything very useful. We can see in that last graph a sudden drop at the end; is this merely a disturbance to the underlying trend, or is it part of the underlying trend’s events? Similarly the third graph, with it’s large trough in the middle, doesn’t lend itself well to a straight line.
In fact none of them do. The key thing to remember here is that there is no underlying trend in these graphs; they are merely random numbers added together.
Fooling the Eye
Being humans we tend to look for patterns and trends, and the way our mind is wired we’ll spot them too – even in random data. This is probably because of something excitingly dangerous such as being able to spot predators, prey or mates in the dappled jungles of Africa. Whatever it is, it can lead us astray too, to think we’ve found things we haven’t.
Take the second graph above, and we can see that if we look at low scale trends (trends for short timescales), the trend lines (yellow) are much steeper – up and down – than the overall one (the blue one):
If we look at longer scales, the trend lines (red here) gradually flatten out to be come closer to the blue one:
Until we get to scales of the same order of magnitude as the whole graph, and the trend lines are very nearly those of the overall blue one:
This can lead us to think that the longer trends are ‘better’. But if we have a look at where that graph fits into the much longer run that it was clipped from, we can see that even the overall blue line trend of the clip (above the red bar) has little to do with the bigger picture:
The apparent smoothed trends we see above are only features of the length of the graph. They tell us nothing about longer term trends (well, they can’t, there aren’t any…).
Scale and Granularity
The above are actually clips from runs of 10,000 points. If we look at these longer ones, we can see similar effects: at no point do we start getting an overall smoothing, as the more steps we have, the more likely we are to have long runs of apparently biased direction. Here’s the first and third longer runs (the second is above):
(bear in mind the Y axis scales are different)
There is no ‘natural’ scale where a noisy-looking system can be smoothed out to. It is tempting to look at the data you have in front of you and fit a trend-line, but without more knowledge behind that data, that trend says nothing about any underlying one without something more.
We need either long enough observations to establish a pattern, and/or enough knowledge about the mechanics of the thing being recorded that we can relate features and trends in the data set to known changes in those mechnics.
For example, if someone’s body temperature is unusually high and increasing, we should worry. We should probably do more than just worry, but it’s not something to ignore because it might be random; I don’t know how the body works in detail, but I do know what a body temperature behaves like; it’s been observed so many times and for so long that patterns have been established, even if working knowledge has not.
So… Global Temperature…
So, yes, the next examples come from my favourite subject, Global Warming, because some people seem to have forgotten that it’s not enough to draw a straight lines through data and imply things about the future from it:
(IPCC showing how trends were increasing in the run up to their 2007 report, Working Group 1, chapter 3)
The recent claims that temperatures ‘are’ decreasing are on similarly shaky ground:
We could even take the full dataset that we have, which for Hadley runs 1850-2009, and look at the apparent trend there (ignore the green patch and line):
But, again, that tells us nothing about future behaviour by itself.
Some of the more
frothing deluders enthusiastic GW advocates* say that small changes are ‘noise’ over an underlying trend or signal. However there is very little noise in the records; the values in datasets like Hadley’s are pretty much all signal (ignoring for the moment systematic errors).
There’s a fundamental error in an approach that dismisses inconvenient short-term variations as ‘natural’ but does not understand the range of time scales that ‘natural’ is valid for; there is no reason to assume that longer-term variations are not also natural. The temptation is just to ‘smooth’ the data until it looks right to the eye, but that tells us interesting things about how the eye and brain interpret shapes and nothing about the data.
We really can’t say anything useful about temperature trends by just examining the recent record.
We need knowledge of how the climate works, captured usually as models. A lot of people are working hard to understand the climate based on the various records of various measurables; but most work on some small aspect of it, few but the most enthusiastic deluders claim anyone has complete understanding. And the good quality data is fairly recent, there are huge systematic problems with it (such as surface station placements, urban heat island effects, tidal station changes) and for anything more than a few decades we tend to have to use proxies and add another layer of systematic problems.
When we check the models, they need to be checked against features of the dataset, not carefully selected subsets or the overall trend. So we’ve seen a steady rise of CO2; why did temperature rise 1910ish-1940ish at the same rate as recently when little man-made CO2 was present? ie what was that natural variation and has it been included in our knowledge base – and eliminated as a candidate for the recent rise; do the models that ‘predict’ the 1980-2000 rise also ‘predict’ the earlier periods?
These models and their validation are key; it’s not sufficient to establish underlying trends by drawing a straight line through some data.
* I must remember not to use the same tone here as I would in forum arguments. Apologies to tamino who was not at all frothing below.