DATA MINING
Data Mining involves things like Artificial Intelligence, Neural Networks, Genetic Algorithms, and so on. Frankly I know practically nothing about how these systems work; but I'm trying to get a handle on their application to 'real' problems. In this article I will describe some explorations I did with 'Knowledge Miner' - a piece of software for the Macintosh.
Mining Time-Series Data
Suppose you throw a stone at 100 m/sec in a horizontal direction off a cliff. If air resistance is ignored, you can calculate its position and vertical velocity at any time using the equations:
x = vt, v = at, and y = yo - gtHere we see the calculated results over an eleven second period.
Time Xacc Yacc Xspeed Yspeed Xpos Alt. 0.00 0.00 9.80 100.00 0.00 0.00 500.00 1.00 0.00 9.80 100.00 9.80 100.00 495.10 2.00 0.00 9.80 100.00 19.60 200.00 480.40 3.00 0.00 9.80 100.00 29.40 300.00 455.90 4.00 0.00 9.80 100.00 39.20 400.00 421.60 5.00 0.00 9.80 100.00 49.00 500.00 377.50 6.00 0.00 9.80 100.00 58.80 600.00 323.60 7.00 0.00 9.80 100.00 68.60 700.00 259.90 8.00 0.00 9.80 100.00 78.40 800.00 186.40 9.00 0.00 9.80 100.00 88.20 900.00 103.10 10.00 0.00 9.80 100.00 98.00 1000.00 10.00Each of the columns is a time series data set.
Now, suppose these are observed results, and we do not know the equations which govern the behaviour of the falling stone, can we find the equations? Can we, on the basis of the stone's position at the beginning of each of the first ten seconds predict the stone's position at the beginning of the eleventh second?
This is not a tutorial on how to use Knowledge Miner. So I will just give you the results that were obtained when I tried it.
Knowledge Miner (KM) was provided with the first ten numbers in the last column (500.00, 495.10 ... 103.10) and asked to find the eleventh value (10.00). The number it came up with was 10.09. This was based on some sort of weighting of three functions similar to this one:
X1= + 1.97e+0X1(t-1) - 9.66e-1X1(t-2) - 1.33e+1where X1(t-1) and X1(t-2) are the two previous values.
Now, it is immediately obvious that this function is linear, whereas the 'real one' is not. But on the other hand, the predicted value is pretty close to what 'it's supposed to be'.
The data table shown above is taken from the article on ballistics. This article also contains a much more realistic data set in which air resistance has been taken into account. Here it is:
TIME INCREMENT: 0.010 SECONDS BALLISTIC CONSTANT: 1.000 Time Xacc Yacc Xspeed Yspeed Xpos Alt. 0.00 -3.24 9.80 100.00 0.00 0.00 500.00 1.00 -2.53 9.65 97.43 9.73 98.68 495.12 2.00 -2.45 9.49 94.94 19.30 194.83 480.60 3.00 -2.37 9.30 92.53 28.69 288.53 456.59 4.00 -2.30 9.10 90.20 37.90 379.87 423.29 5.00 -2.23 8.88 87.93 46.89 468.90 380.88 6.00 -2.16 8.50 85.74 55.56 555.71 329.63 7.00 -2.10 8.27 83.61 63.95 640.36 269.87 8.00 -2.03 8.04 81.54 72.11 722.91 201.84 9.00 -1.97 7.81 79.54 80.04 803.42 125.77 10.00 -1.92 7.58 77.59 87.73 881.96 41.89Given the first ten values in the last column, KM came up with 42.03 for the eleventh. Pretty close, I'd say. Considerably more sobering if you have a look at the complexity of the equations (given in the article on ballistics) that produced the actual numbers.
Data mining involves looking for patterns in empirical data that can be used to make better-than-guesswork-predictions.
Leaving aside legitimate objections for the moment, let's apply this notion to some economic data. This table shows the change in the consumer price index for Singapore for the years 1974 to 1998. In 1974 prices rose by 22.3%; and so on.
consumer price change 1974 22.3000000 1975 2.60000000 1976 -1.9000000 1977 3.20000000 1978 4.80000000 1979 4.00000000 1980 8.50000000 1981 8.20000000 1982 3.90000000 1983 1.20000000 1984 2.60000000 1985 0.50000000 1986 -1.4000000 1987 0.50000000 1988 1.50000000 1989 2.40000000 1990 3.40000000 1991 3.40000000 1992 2.30000000 1993 2.30000000 1994 3.10000000 1995 1.70000000 1996 1.40000000 1997 2.00000000 1998 -0.3000000When KM was provided with the data for the years 1974 - 1997 and asked to predict the change in the consumer price index for the next three years it came up with: 2.34463334, 2.34696221 and 2.21553325. This is not a good prediction as the value for 1998 was -0.3.
Another example. Here are the value for Singapore's Gross Domestic Product during this period:
GDP 1974 19684.0000 1975 21014.5000 1976 21846.5999 1977 23416.9000 1978 25233.7000 1979 27400.5000 1980 29951.5999 1981 32855.6999 1982 36011.6999 1983 38487.5000 1984 41635.0999 1985 45096.9000 1986 44367.9000 1987 45386.4000 1988 49800.0000 1989 55594.0999 1990 60944.8000 1991 66413.8999 1992 71139.1000 1993 75810.8000 1994 85478.3000 1995 95208.8000 1996 102982.000 1997 110733.699 1998 120712.500Given the values for 1974 - 1997, KM predicted 16748.9 and 125279.3 for 1998 and 1999 versus the real value for 1998 of 120712.5
In contrast to the last example, this prediction is much more 'in the ballpark'. It is also much less impressive. If the data is graphed, the general upward trend is readily discernible and any intelligent guess would also be pretty close.
It seems then, that Data Mining in time series information - if it works at all - is only better than intelligent guesswork some times. These times are, as I understand it, when there are very large amounts of data to look at and when there is reason to suspect that there is an underlying pattern.
Making predictions on the basis of time-series models is the simplest application of data mining techniques. It gets quite a bit more involved than that. Next I'm going to investigate Input-Output modelling.
INPUT-OUTPUT MODELLING Here is a set of artificial data created using Excel, that is not time series related. The columns data a and data b are simply random numbers. Data c was calculated using the formula =2*(data a) + 3*(dat a). In the case of the first line of numbers: (2*0.711327) + (3*0.153794) = 1.884. Etc.
# data a data b data c 1.00000000 0.71132700 0.15379400 1.88403600 2.00000000 0.62219935 0.83119106 3.73797189 3.00000000 0.33872289 0.80881084 3.10387831 4.00000000 0.54262732 0.35427095 2.14806749 5.00000000 0.50631348 0.71599532 3.16061290 6.00000000 0.00132503 0.22447315 0.67606951 7.00000000 0.76211535 0.94620700 4.36285170 8.00000000 0.91026206 0.89499186 4.50549970 9.00000000 0.92640874 0.47156928 3.26752532 10.0000000 0.49323546 0.27673696 1.81668179 11.0000000 0.04501477 0.30142353 0.99430013 12.0000000 0.49180000 0.17909135 1.52087404 13.0000000 0.06747225 0.85629071 2.70381663 14.0000000 0.84239974 0.41916601 2.94229750 15.0000000 0.84067115 0.44264313 3.00927169 16.0000000 0.40952997 0.20516224 1.43454665 17.0000000 0.10966830 0.26365868 1.01031263 18.0000000 0.60323225 0.55522712 2.87214585 19.0000000 0.09686064 0.47962391 1.63259301 20.0000000 0.59777858 0.99479464 4.17994107 21.0000000 0.08944736 0.67384674 2.20043494 22.0000000 0.06014718 0.91676159 2.87057914 23.0000000 0.72694087 0.49757195 2.94659757 24.0000000 0.86540192 0.32359363 2.70158473 25.0000000 0.22437308 0.77932968 2.78673521 26.0000000 0.00814655 0.21861716 0.67214460 27.0000000 0.25049354 0.30840101 1.42619010 28.0000000 0.01760057 0.06654875 0.23484739 29.0000000 0.78657626 0.17680760 2.10357534 30.0000000 0.63881337 0.99740651 4.26984627 31.0000000 0.74067272 0.35810086 2.55564802 32.0000000 0.11989300 0.68048251 2.28123353 33.0000000 0.23006162 0.64650203 2.39962933 34.0000000 0.50777425 0.06225799 1.20232247 35.0000000 0.64703427 0.73486782 3.49867198 36.0000000 0.34814598 0.35296414 1.75518438 37.0000000 0.67217705 0.66214549 3.33079057 38.0000000 0.14218590 0.61907854 2.14160742 39.0000000 0.18164762 0.17261367 0.88113625 40.0000000 0.45016614 0.29302707 1.77941350 41.0000000 0.03019514 0.75783563 2.33389718 42.0000000 0.91506069 0.02236730 1.89722327 43.0000000 0.88055584 0.15027279 2.21193006 44.0000000 0.04039877 0.96769558 2.98388430 45.0000000 0.94965348 0.75820167 4.17391198 46.0000000 0.50993780 0.31050420 1.95138820 47.0000000 0.67304307 0.16730296 1.84799503 48.0000000 0.29373339 0.96699594 3.48845461 49.0000000 0.07845276 0.69584199 2.24443147 50.0000000 0.07548299 0.52973340 1.74016616 51.0000000 0.72301849 0.97594044In the 51st row the two random inputs were provided, but the 'correct' output value of 4.37385831 was left out. KM was asked to provide the model equation and predict what the 51st value should be. It came up with 4.34791231 based on the equation: = 1.97*(data a) + 2.96*(data b) + 0.0324
Sure, it's a simple case; but the result was pretty close! Let's try a hairier example.
In this case data a, data b and data c all hold random numbers. The column headed data d is calculated from the first three using the formula: =23*(data a)-4.5*(data b)+(data a + data c)
# data a data b data c data d 1 0.36864296 0.65386625 0.40411185 6.55889922 2 0.06061834 0.54399706 0.83180874 -0.4491497 3 0.80648571 0.70749339 0.40493154 16.8794302 4 0.50394854 0.48992599 0.04397313 10.3800240 5 0.77446668 0.24861655 0.07142878 17.7170425 6 0.87444864 0.17139003 0.71340502 20.3869022 7 0.43278067 0.55030938 0.56201046 8.46065332 8 0.79971223 0.18515499 0.71604982 18.5450511 9 0.61849910 0.49100718 0.53662014 13.1254533 10 0.39283341 0.22820812 0.35768140 8.62927332 11 0.44329609 0.82225069 0.00034563 7.76122881 12 0.53533866 0.77234685 0.60572050 10.1449140 13 0.42973125 0.60189846 0.99236760 8.20690530 14 0.45612063 0.77200775 0.25352474 8.24486793 15 0.09944484 0.85909722 0.07780505 -0.6201641 16 0.40508627 0.56353591 0.33474803 7.74969469 17 0.69745747 0.94118228 0.77170042 13.4448414 18 0.56247703 0.29828192 0.08116068 12.4554621 19 0.63802394 0.24448065 0.29031742 14.4568924 20 0.25582291 0.64813418 0.41870592 3.87128023 21 0.53711130 0.18138763 0.32214463 12.2558144 22 0.61927711 0.13185221 0.99370445 14.4011680 23 0.13187562 0.36181360 0.38275221 1.89866734 24 0.58268236 0.73477809 0.22079027 11.4126533 25 0.46698298 0.45120940 0.59257884 9.62835870 26 0.53886887 0.44246637 0.92808073 11.3842205 27 0.67353019 0.95135621 0.89221016 12.8349779 28 0.48063655 0.54289506 0.60728367 9.63514451 29 0.98372972 0.42092103 0.34427530 22.1362897 30 0.07675168 0.98952878 0.33903848 -1.6213105 31 0.37343393 0.70591336 0.90826794 6.49171746 32 0.98644871 0.12413985 0.31071953 23.2402796 33 0.38883920 0.00106327 0.78785689 9.32841924 34 0.65369624 0.16205189 0.75383855 15.1215280 35 0.72295709 0.37290331 0.65972765 16.0458086 36 0.49470420 0.70129297 0.39658201 9.41837545 37 0.60957949 0.89147468 0.04329461 11.5097463 38 0.38411519 0.60662680 0.40765212 7.09557081 39 0.89310663 0.41158015 0.76275131 19.9940287 40 0.33997246 0.08086937 0.19193467 7.87629626 41 0.42944827 0.82276971 0.20175195 7.42706444 42 0.63269035 0.86322722 0.61726964 12.1632731 43 0.96582623 0.59070164 0.41647084 21.1123737 44 0.49212195 0.34099901 0.37149190 10.6174303 45 0.16256283 0.74084534 0.63330071 1.30854913 46 0.05344039 0.04936626 0.85757786 1.10978738 47 0.03735671 0.09155394 0.48347381 0.57612219 48 0.36258919 0.19973359 0.40761613 8.00307294 49 0.79488327 0.75988661 0.40937883 16.4175952 50 0.05777102 0.58054165 0.72077253 -0.6453912 51 0.71087000 0.66554973 0.91838774 14.7314558 52 0.57522492 0.49524214 0.69733116 12.0720505 53 0.98434503 0.46382326 0.70065546 22.0008992 54 0.41956168 0.72661341 0.34864766 7.52633344 55 0.28166374 0.43092937 0.28003185 5.25167700KM was given this data, but with the last five entries in data d missing. It was asked to predict what they ought to be. It used this equation:
= -0.0936*(3.98*data a - 2) + 0.6609*(4*data a - 0.575*data - 1.72) + 1.03... to predict the next five values to be:
14.7341613 12.0731391 22.0080223 7.52465867 5.24861860Sure, the equation seems to be quite different from the one that generated the data, but you have to admit that the results are right on!
Now, let's see what happens when we try this with some real data.
consumer price change chg Exchange GDP growth lab male lab female pop 1974 22.3000000 2.31190000 19684.0000 11.3000000 57.7000000 78.4000000 2229.80000 1975 2.60000000 2.48950000 21014.5000 6.80000000 57.3000000 79.3000000 2262.60000 1976 -1.9000000 2.45550000 21846.5999 4.00000000 57.6000000 78.5000000 2293.30000 1977 3.20000000 2.33850000 23416.9000 7.20000000 58.5000000 78.7000000 2325.30000 1978 4.80000000 2.16350000 25233.7000 7.80000000 60.0000000 79.8000000 2353.60000 1979 4.00000000 2.15900000 27400.5000 8.60000000 61.4000000 80.7000000 2383.50000 1980 8.50000000 2.09350000 29951.5999 9.30000000 63.2000000 81.5000000 2413.90000 1981 8.20000000 2.04780000 32855.6999 9.70000000 63.0000000 81.1000000 2532.80000 1982 3.90000000 2.10850000 36011.6999 9.60000000 63.4000000 81.5000000 2646.50000 1983 1.20000000 2.12700000 38487.5000 6.90000000 63.8000000 81.6000000 2681.10000 1984 2.60000000 2.17800000 41635.0999 8.20000000 63.4000000 81.2000000 2732.20000 1985 0.50000000 2.10500000 45096.9000 8.30000000 62.2000000 79.9000000 2736.00000 1986 -1.4000000 2.17500000 44367.9000 -1.6000000 62.3000000 79.4000000 2733.40000 1987 0.50000000 1.99850000 45386.4000 2.30000000 62.7000000 78.6000000 2774.80000 1988 1.50000000 1.94620000 49800.0000 9.70000000 62.9000000 78.5000000 2846.10000 1989 2.40000000 1.89440000 55594.0999 11.6000000 63.1000000 78.6000000 2930.90000 1990 3.40000000 1.74450000 60944.8000 9.60000000 66.0000000 79.0000000 3016.40000 1991 3.40000000 1.63050000 66413.8999 9.00000000 64.8000000 79.8000000 3089.90000 1992 2.30000000 1.64490000 71139.1000 7.10000000 65.3000000 79.9000000 3178.00000 1993 2.30000000 1.60800000 75810.8000 6.60000000 64.5000000 79.1000000 3259.40000 1994 3.10000000 1.46070000 85478.3000 12.8000000 64.9000000 79.6000000 3363.50000 1995 1.70000000 1.41430000 95208.8000 8.40859127 64.4000000 78.4000000 3467.50000 1996 1.40000000 1.39980000 102982.000 8.05859279 64.6000000 78.7000000 3612.00000 1997 1.54120230 1.36214172 126128.327 9.05264472 64.5783233 75.6854782 3766.99609 1998 1.03943836 1.28638827 169097.592 8.68398666 65.0783767 79.8487014 3999.81274 11.4000000 8.20000000 -74.284721 4.70188522This is a table of economic data. again based on Singapore. KM was given this table, but with the last four years of GDP missing. Its predictions based on the remaining information (12.8, 8.04, 8.05, 9.05, and 8.68) are to be compared with the numbers immediately below (11.4, 8.2, -74.28, and 4,70). We see that the first two a pretty good and that the other two are way off.
I guess the lesson to be learned is that we have to be very careful about making assumptions about which variables affect what we are looking for; and that predictions must be limited in time. But on the other hand, there may be some real possibilities here.
The exploration continues.