DATA MINING

 

Data Mining involves things like Artificial Intelligence, Neural Networks, Genetic Algorithms, and so on. Frankly I know practically nothing about how these systems work; but I'm trying to get a handle on their application to 'real' problems. In this article I will describe some explorations I did with 'Knowledge Miner' - a piece of software for the Macintosh.

Mining Time-Series Data

Suppose you throw a stone at 100 m/sec in a horizontal direction off a cliff. If air resistance is ignored, you can calculate its position and vertical velocity at any time using the equations:

x = vt,
v = at, and
y = yo - gt

Here we see the calculated results over an eleven second period.

 Time  Xacc  Yacc  Xspeed  Yspeed   Xpos    Alt.  
 0.00  0.00  9.80  100.00   0.00    0.00  500.00 
 1.00  0.00  9.80  100.00   9.80  100.00  495.10  
 2.00  0.00  9.80  100.00  19.60  200.00  480.40  
 3.00  0.00  9.80  100.00  29.40  300.00  455.90  
 4.00  0.00  9.80  100.00  39.20  400.00  421.60  
 5.00  0.00  9.80  100.00  49.00  500.00  377.50  
 6.00  0.00  9.80  100.00  58.80  600.00  323.60  
 7.00  0.00  9.80  100.00  68.60  700.00  259.90  
 8.00  0.00  9.80  100.00  78.40  800.00  186.40  
 9.00  0.00  9.80  100.00  88.20  900.00  103.10 
10.00  0.00  9.80  100.00  98.00  1000.00  10.00

Each of the columns is a time series data set.

Now, suppose these are observed results, and we do not know the equations which govern the behaviour of the falling stone, can we find the equations? Can we, on the basis of the stone's position at the beginning of each of the first ten seconds predict the stone's position at the beginning of the eleventh second?

This is not a tutorial on how to use Knowledge Miner. So I will just give you the results that were obtained when I tried it.

Knowledge Miner (KM) was provided with the first ten numbers in the last column (500.00, 495.10 ... 103.10) and asked to find the eleventh value (10.00). The number it came up with was 10.09. This was based on some sort of weighting of three functions similar to this one:

X1=  + 1.97e+0X1(t-1) - 9.66e-1X1(t-2) - 1.33e+1

where X1(t-1) and X1(t-2) are the two previous values.

Now, it is immediately obvious that this function is linear, whereas the 'real one' is not. But on the other hand, the predicted value is pretty close to what 'it's supposed to be'.

The data table shown above is taken from the article on ballistics. This article also contains a much more realistic data set in which air resistance has been taken into account. Here it is:

TIME INCREMENT: 0.010 SECONDS
BALLISTIC CONSTANT: 1.000
 Time   Xacc   Yacc Xspeed Yspeed    Xpos    Alt.
 0.00  -3.24   9.80 100.00   0.00    0.00  500.00
 1.00  -2.53   9.65  97.43   9.73   98.68  495.12
 2.00  -2.45   9.49  94.94  19.30  194.83  480.60
 3.00  -2.37   9.30  92.53  28.69  288.53  456.59
 4.00  -2.30   9.10  90.20  37.90  379.87  423.29
 5.00  -2.23   8.88  87.93  46.89  468.90  380.88
 6.00  -2.16   8.50  85.74  55.56  555.71  329.63
 7.00  -2.10   8.27  83.61  63.95  640.36  269.87
 8.00  -2.03   8.04  81.54  72.11  722.91  201.84
 9.00  -1.97   7.81  79.54  80.04  803.42  125.77
10.00  -1.92   7.58  77.59  87.73  881.96   41.89

Given the first ten values in the last column, KM came up with 42.03 for the eleventh. Pretty close, I'd say. Considerably more sobering if you have a look at the complexity of the equations (given in the article on ballistics) that produced the actual numbers.

Data mining involves looking for patterns in empirical data that can be used to make better-than-guesswork-predictions.

Leaving aside legitimate objections for the moment, let's apply this notion to some economic data. This table shows the change in the consumer price index for Singapore for the years 1974 to 1998. In 1974 prices rose by 22.3%; and so on.

    consumer price change	
1974	22.3000000
1975	2.60000000
1976	-1.9000000
1977	3.20000000
1978	4.80000000
1979	4.00000000
1980	8.50000000
1981	8.20000000
1982	3.90000000
1983	1.20000000
1984	2.60000000
1985	0.50000000
1986	-1.4000000
1987	0.50000000
1988	1.50000000
1989	2.40000000
1990	3.40000000
1991	3.40000000
1992	2.30000000
1993	2.30000000
1994	3.10000000
1995	1.70000000
1996	1.40000000
1997	2.00000000
1998	-0.3000000

When KM was provided with the data for the years 1974 - 1997 and asked to predict the change in the consumer price index for the next three years it came up with: 2.34463334, 2.34696221 and 2.21553325. This is not a good prediction as the value for 1998 was -0.3.

Another example. Here are the value for Singapore's Gross Domestic Product during this period:

	    GDP
1974	19684.0000
1975	21014.5000
1976	21846.5999
1977	23416.9000
1978	25233.7000
1979	27400.5000
1980	29951.5999
1981	32855.6999
1982	36011.6999
1983	38487.5000
1984	41635.0999
1985	45096.9000
1986	44367.9000
1987	45386.4000
1988	49800.0000
1989	55594.0999
1990	60944.8000
1991	66413.8999
1992	71139.1000
1993	75810.8000
1994	85478.3000
1995	95208.8000
1996	102982.000
1997	110733.699
1998	120712.500

Given the values for 1974 - 1997, KM predicted 16748.9 and 125279.3 for 1998 and 1999 versus the real value for 1998 of 120712.5

In contrast to the last example, this prediction is much more 'in the ballpark'. It is also much less impressive. If the data is graphed, the general upward trend is readily discernible and any intelligent guess would also be pretty close.

It seems then, that Data Mining in time series information - if it works at all - is only better than intelligent guesswork some times. These times are, as I understand it, when there are very large amounts of data to look at and when there is reason to suspect that there is an underlying pattern.

Making predictions on the basis of time-series models is the simplest application of data mining techniques. It gets quite a bit more involved than that. Next I'm going to investigate Input-Output modelling.

INPUT-OUTPUT MODELLING

Here is a set of artificial data created using Excel, that is not time series related. The columns data a and data b are simply random numbers. Data c was calculated using the formula =2*(data a) + 3*(dat a). In the case of the first line of numbers: (2*0.711327) + (3*0.153794) = 1.884. Etc.

#	              data a	data b	       data c
1.00000000	0.71132700	0.15379400	1.88403600
2.00000000	0.62219935	0.83119106	3.73797189
3.00000000	0.33872289	0.80881084	3.10387831
4.00000000	0.54262732	0.35427095	2.14806749
5.00000000	0.50631348	0.71599532	3.16061290
6.00000000	0.00132503	0.22447315	0.67606951
7.00000000	0.76211535	0.94620700	4.36285170
8.00000000	0.91026206	0.89499186	4.50549970
9.00000000	0.92640874	0.47156928	3.26752532
10.0000000	0.49323546	0.27673696	1.81668179
11.0000000	0.04501477	0.30142353	0.99430013
12.0000000	0.49180000	0.17909135	1.52087404
13.0000000	0.06747225	0.85629071	2.70381663
14.0000000	0.84239974	0.41916601	2.94229750
15.0000000	0.84067115	0.44264313	3.00927169
16.0000000	0.40952997	0.20516224	1.43454665
17.0000000	0.10966830	0.26365868	1.01031263
18.0000000	0.60323225	0.55522712	2.87214585
19.0000000	0.09686064	0.47962391	1.63259301
20.0000000	0.59777858	0.99479464	4.17994107
21.0000000	0.08944736	0.67384674	2.20043494
22.0000000	0.06014718	0.91676159	2.87057914
23.0000000	0.72694087	0.49757195	2.94659757
24.0000000	0.86540192	0.32359363	2.70158473
25.0000000	0.22437308	0.77932968	2.78673521
26.0000000	0.00814655	0.21861716	0.67214460
27.0000000	0.25049354	0.30840101	1.42619010
28.0000000	0.01760057	0.06654875	0.23484739
29.0000000	0.78657626	0.17680760	2.10357534
30.0000000	0.63881337	0.99740651	4.26984627
31.0000000	0.74067272	0.35810086	2.55564802
32.0000000	0.11989300	0.68048251	2.28123353
33.0000000	0.23006162	0.64650203	2.39962933
34.0000000	0.50777425	0.06225799	1.20232247
35.0000000	0.64703427	0.73486782	3.49867198
36.0000000	0.34814598	0.35296414	1.75518438
37.0000000	0.67217705	0.66214549	3.33079057
38.0000000	0.14218590	0.61907854	2.14160742
39.0000000	0.18164762	0.17261367	0.88113625
40.0000000	0.45016614	0.29302707	1.77941350
41.0000000	0.03019514	0.75783563	2.33389718
42.0000000	0.91506069	0.02236730	1.89722327
43.0000000	0.88055584	0.15027279	2.21193006
44.0000000	0.04039877	0.96769558	2.98388430
45.0000000	0.94965348	0.75820167	4.17391198
46.0000000	0.50993780	0.31050420	1.95138820
47.0000000	0.67304307	0.16730296	1.84799503
48.0000000	0.29373339	0.96699594	3.48845461
49.0000000	0.07845276	0.69584199	2.24443147
50.0000000	0.07548299	0.52973340	1.74016616
51.0000000	0.72301849	0.97594044	

In the 51st row the two random inputs were provided, but the 'correct' output value of 4.37385831 was left out. KM was asked to provide the model equation and predict what the 51st value should be. It came up with 4.34791231 based on the equation: = 1.97*(data a) + 2.96*(data b) + 0.0324

Sure, it's a simple case; but the result was pretty close! Let's try a hairier example.

In this case data a, data b and data c all hold random numbers. The column headed data d is calculated from the first three using the formula: =23*(data a)-4.5*(data b)+(data a + data c)

#	data a	  data b	  data c	  data d
1	0.36864296	0.65386625	0.40411185	6.55889922
2	0.06061834	0.54399706	0.83180874	-0.4491497
3   0.80648571	0.70749339	0.40493154	16.8794302
4	0.50394854	0.48992599	0.04397313	10.3800240
5	0.77446668	0.24861655	0.07142878	17.7170425
6	0.87444864	0.17139003	0.71340502	20.3869022
7	0.43278067	0.55030938	0.56201046	8.46065332
8	0.79971223	0.18515499	0.71604982	18.5450511
9   0.61849910	0.49100718	0.53662014	13.1254533
10  0.39283341	0.22820812	0.35768140	8.62927332
11	0.44329609	0.82225069	0.00034563	7.76122881
12	0.53533866	0.77234685	0.60572050	10.1449140
13	0.42973125	0.60189846	0.99236760	8.20690530
14	0.45612063	0.77200775	0.25352474	8.24486793
15	0.09944484	0.85909722	0.07780505	-0.6201641
16	0.40508627	0.56353591	0.33474803	7.74969469
17	0.69745747	0.94118228	0.77170042	13.4448414
18	0.56247703	0.29828192	0.08116068	12.4554621
19	0.63802394	0.24448065	0.29031742	14.4568924
20	0.25582291	0.64813418	0.41870592	3.87128023
21	0.53711130	0.18138763	0.32214463	12.2558144
22	0.61927711	0.13185221	0.99370445	14.4011680
23	0.13187562	0.36181360	0.38275221	1.89866734
24	0.58268236	0.73477809	0.22079027	11.4126533
25	0.46698298	0.45120940	0.59257884	9.62835870
26	0.53886887	0.44246637	0.92808073	11.3842205
27	0.67353019	0.95135621	0.89221016	12.8349779
28	0.48063655	0.54289506	0.60728367	9.63514451
29	0.98372972	0.42092103	0.34427530	22.1362897
30	0.07675168	0.98952878	0.33903848	-1.6213105
31	0.37343393	0.70591336	0.90826794	6.49171746
32	0.98644871	0.12413985	0.31071953	23.2402796
33	0.38883920	0.00106327	0.78785689	9.32841924
34	0.65369624	0.16205189	0.75383855	15.1215280
35	0.72295709	0.37290331	0.65972765	16.0458086
36	0.49470420	0.70129297	0.39658201	9.41837545
37	0.60957949	0.89147468	0.04329461	11.5097463
38	0.38411519	0.60662680	0.40765212	7.09557081
39	0.89310663	0.41158015	0.76275131	19.9940287
40	0.33997246	0.08086937	0.19193467	7.87629626
41	0.42944827	0.82276971	0.20175195	7.42706444
42	0.63269035	0.86322722	0.61726964	12.1632731
43	0.96582623	0.59070164	0.41647084	21.1123737
44	0.49212195	0.34099901	0.37149190	10.6174303
45	0.16256283	0.74084534	0.63330071	1.30854913
46	0.05344039	0.04936626	0.85757786	1.10978738
47	0.03735671	0.09155394	0.48347381	0.57612219
48	0.36258919	0.19973359	0.40761613	8.00307294
49	0.79488327	0.75988661	0.40937883	16.4175952
50	0.05777102	0.58054165	0.72077253	-0.6453912
51	0.71087000	0.66554973	0.91838774	14.7314558
52	0.57522492	0.49524214	0.69733116	12.0720505
53	0.98434503	0.46382326	0.70065546	22.0008992
54	0.41956168	0.72661341	0.34864766	7.52633344
55	0.28166374	0.43092937	0.28003185	5.25167700

KM was given this data, but with the last five entries in data d missing. It was asked to predict what they ought to be. It used this equation:

=  -0.0936*(3.98*data a - 2) + 0.6609*(4*data a - 0.575*data - 1.72) + 1.03

... to predict the next five values to be:

14.7341613
12.0731391
22.0080223
7.52465867
5.24861860

Sure, the equation seems to be quite different from the one that generated the data, but you have to admit that the results are right on!

Now, let's see what happens when we try this with some real data.

consumer price change	chg	Exchange	GDP	growth	lab male	lab female	pop
1974	22.3000000	2.31190000	19684.0000	11.3000000	57.7000000	78.4000000	2229.80000
1975	2.60000000	2.48950000	21014.5000	6.80000000	57.3000000	79.3000000	2262.60000
1976	-1.9000000	2.45550000	21846.5999	4.00000000	57.6000000	78.5000000	2293.30000
1977	3.20000000	2.33850000	23416.9000	7.20000000	58.5000000	78.7000000	2325.30000
1978	4.80000000	2.16350000	25233.7000	7.80000000	60.0000000	79.8000000	2353.60000
1979	4.00000000	2.15900000	27400.5000	8.60000000	61.4000000	80.7000000	2383.50000
1980	8.50000000	2.09350000	29951.5999	9.30000000	63.2000000	81.5000000	2413.90000
1981	8.20000000	2.04780000	32855.6999	9.70000000	63.0000000	81.1000000	2532.80000
1982	3.90000000	2.10850000	36011.6999	9.60000000	63.4000000	81.5000000	2646.50000
1983	1.20000000	2.12700000	38487.5000	6.90000000	63.8000000	81.6000000	2681.10000
1984	2.60000000	2.17800000	41635.0999	8.20000000	63.4000000	81.2000000	2732.20000
1985	0.50000000	2.10500000	45096.9000	8.30000000	62.2000000	79.9000000	2736.00000
1986	-1.4000000	2.17500000	44367.9000	-1.6000000	62.3000000	79.4000000	2733.40000
1987	0.50000000	1.99850000	45386.4000	2.30000000	62.7000000	78.6000000	2774.80000
1988	1.50000000	1.94620000	49800.0000	9.70000000	62.9000000	78.5000000	2846.10000
1989	2.40000000	1.89440000	55594.0999	11.6000000	63.1000000	78.6000000	2930.90000
1990	3.40000000	1.74450000	60944.8000	9.60000000	66.0000000	79.0000000	3016.40000
1991	3.40000000	1.63050000	66413.8999	9.00000000	64.8000000	79.8000000	3089.90000
1992	2.30000000	1.64490000	71139.1000	7.10000000	65.3000000	79.9000000	3178.00000
1993	2.30000000	1.60800000	75810.8000	6.60000000	64.5000000	79.1000000	3259.40000
1994	3.10000000	1.46070000	85478.3000	12.8000000	64.9000000	79.6000000	3363.50000
1995	1.70000000	1.41430000	95208.8000	8.40859127	64.4000000	78.4000000	3467.50000
1996	1.40000000	1.39980000	102982.000	8.05859279	64.6000000	78.7000000	3612.00000
1997	1.54120230	1.36214172	126128.327	9.05264472	64.5783233	75.6854782	3766.99609
1998	1.03943836	1.28638827	169097.592	8.68398666	65.0783767	79.8487014	3999.81274
							
				                                        11.4000000			
				         8.20000000			
				       -74.284721			
			             4.70188522		

This is a table of economic data. again based on Singapore. KM was given this table, but with the last four years of GDP missing. Its predictions based on the remaining information (12.8, 8.04, 8.05, 9.05, and 8.68) are to be compared with the numbers immediately below (11.4, 8.2, -74.28, and 4,70). We see that the first two a pretty good and that the other two are way off.

I guess the lesson to be learned is that we have to be very careful about making assumptions about which variables affect what we are looking for; and that predictions must be limited in time. But on the other hand, there may be some real possibilities here.

The exploration continues.