[Help] [Aide] [Up]

Science Tribune - Article - October 1997


Have you heard about data mining ?

Christopher Le Bret

PMSI (a), 52 rue Mouffetard, F 75005 Paris, France; tel/fax : 33 - 01 45 35 87 99
E-mail : pmsi@altern.com.

Introduction by the editor

The vocation of statistics is to describe and predict. "Seven people out of ten enjoy watching comedies on television; if you are inactive or lazy, and with a sense of humour, it is highly likely that you watch comedies on television". Simple enough ? Well, yes and no. Were the people in question sufficiently diverse, had no features been omitted in their description, was the concept of a comedy defined, what else do the people do that might be relevant ? By increasing the amount of information and relying on databases from several sources, we can refine our statements so that they become more solid bases for making, for instance, predictions about behaviour patterns. The art of statistics is, in fact, the efficient handling of vasts amounts of complex and heterogeneous data for accurate and useful predictions. Increasingly sophisticated tools have been developed for this but, alas, they are not always user-friendly nor do they always provide clear easy-to-read outputs.

With no doubt, we are overwhelmed by more facts and figures than we can process in our daily personal and professional lives. There are the archives from the past, but also the reams of data constantly produced by our new wizard technologies. The human brain is one of the best organs for description and prediction but it cannot face this onslaught of information without resorting to selection. It selects on the basis of its past experience, often even when it resorts to the use of a computer. This is because there is an all pervasive tendency to make hypotheses and ask guided questions rather than to try and find out - without any foregone notion - what is the crux of the matter. By permanent selection, we forget to ask other questions that might also be relevant.

Is this a disguised appeal for a return to the notions of data classification, taxonomy, phylogeny... of the 18th century ? Computer algorithms derive from a form of mathematics relating to specific cases that was used in Mesopotamia well before the Greeks invented our modern mathematics based on general principles. So why not just go back a couple of centuries to refresh our concepts on how to handle observational data ?

Well, indeed why not, if we can find suitable up-to-date methods ! And this is where data mining (b) comes in. Data mining is based on traditional and more sophisticated prediction methods (e.g. neural networks, genetic algorithms) but places the tool in the hands of the person with a problem to solve and not the statistician. Let us unearth the rare minerals (crucial information) from the rocks and sludge (databases of heterogenous and highly redundant information) that will make us prosper. For let us not delude ourselves, statistics is used to obtain power through information. Its earliest applications were for ....... taxation and military service !


Data mining versus statistics

Data mining is the descendent and, to some, the heir and successor to statistics. Statistics and data mining have the same aim, which is to build compact and understandable models that incorporate the relationships ("dependencies") between the description of a situation and a result (or a judgement) relating to this description. The underlying assumption is that there is a relationship; in other words, the result, measurement or judgement we want to model is derived from some or all of the "descriptive variables" we have gathered. The main difference is that data mining techniques build the models automatically whereas classical statistics tools need to be wielded by a trained statistician with a good idea (a "dependency hypothesis") - or maybe a preconceived notion - of what to look for.

Data mining techniques offer enormous gains in terms of performance, user-friendliness and time-to-result. End-users able to build good statistical models without the need to subcontract - or negotiate with - a statistician, are freed from many technical and organisational constraints. Statistical modelling is, at last, out of the hands of the specialists. This turn of events is nothing new. The same happened with the telephone and the automobile. At first, they were handled by trained operators or drivers only but, with the advances in technology, they ended up being used by the public at large, with little or no training and no specialist intervention.

There are already data mining success stories from areas as diverse as manufacturing process control, human resources, and food-service management.

Why data mining is a good idea

By building dependency hypotheses instead of merely verifying them, data mining techniques occasionally unearth gems such as the link between the fatal Reyes syndrome (c) and children taking aspirin or, on a totally different level, the correlation between supermarket purchases of diapers and beer on a Saturday. Specialists in the retail business with knowledge of their clientele's habits will recognize in this correlation the portrait of a young couple with small children shopping on Saturdays - the mother buys diapers while the father stocks up on beer for the Sunday football match. This second example is a good illustration of the need to know one's field of business even when using powerful tools. Although data mining does away with the statistician, it does not dispense with expert knowledge in the application domain. However, surprise correlations are not encountered all that often and the main 'selling point' of data mining remains its ease and speed of use.

Data mining techniques offer many other advantages. They can handle very large or small datasets with ease. They are not hampered by large numbers of predictive variables; this is extremely useful for "selecting variables", i.e., identifying those within a set that are most relevant.

Many kinds of statistical models (predictors)

Statistical models (also called predictors when used to make forecasts) that are produced by data mining techniques - like everything else produced by machines - must be inspected by people who can understand and check what has been produced. The predictor must therefore be easy to read and preferably in a form with which the user is already familiar.

Examples of predictors are decision/regression trees, decision rule bases ("expert systems"), scorecards and neural networks (b).

There is a tug-of-war between the readability and performance of predictors as illustrated in Figure 1 below. The simpler the form of a predictor, the easier it is to understand but the less likely it is to account for subtle dependencies or ones that are too varied ("non-linear").

Fig. 1. Relationship between model readability and model performance

Decision trees and rule bases are highly descriptive and easy to understand but only handle 'hard' comparisons with Yes or No outcomes. They lack predictive finesse.

Scorecards, using either fixed weightings or logistical functions, are a bit better but, because they are additive-only (they score each variable separately, then sum up the separate "weightings"), cannot model three-sided relationships among variables (e.g. risk increases as a function of age for homeowners; it decreases as a function of age for tenants).

Neural networks (d) are the kings of the castle on the statistical modelling mound. They gracefully handle missing values as well as very noisy data but just cannot be meaningfully inspected. Inspection is as senseless as trying to find thoughts in a slice of dissected brain or a picture by looking at a hologram. A neural net is extremely predictive but not descriptive at all.

Users cannot access the neural net's "reasoning" but can inspect and display the predictions it makes. With knowledge of their field and the help of powerful display tools (e.g. SAXVIEW (b)), users can fathom the explanations for the decision. Moreover, as more and more problems are solved successfully, they begin to place trust in the tool and consider that the improved prediction quality more than makes up for the partial loss in clarity.

None of these predictors are new. Classification and regression trees (CART) were used by social scientists in the sixties; rule bases were made popular by the vogue of "expert systems" in the eighties; and scorecards have been familiar to bankers for decades. Even neural networks have been with us since the forties but it took the fabulous improvements in computers for them to become practical and to bring them into the limelight.

How to build predictors

Although the predictors are not new, the techniques for building them are. They are not built by direct computation from the data as in the past but using "adaptive" methods borrowed from the realm of "artificial intelligence". There are two main techniques :

- supervised learning, i.e., starting from a model built with random values and gradually improving how it fits with reality. The most common learning method is "backpropagation".

- the evolutionary technique or "genetic programming", i.e., starting from a population of thousands of random models and having them "evolve" in a competitive, Darwinian way. The process of generating useful knowledge from raw data is called induction and belongs to the realm of data mining.

Supervised learning is traditionally, but not necessarily, applied to neural networks only (even though the "backpropagation" technique can be applied to a large variety of objects) whereas genetic algorithms (e) are used to build rule bases, scorecards, and decision/regression trees.

This ability to "learn" any relationship, however complex, has been established with rigour ("Kolmogorov's Theorem"). Neural networks are bona fide "universal approximators" not a programming hack's wild dream. "Backpropagation" always leads to the "right" model (it is said to "converge"), in other words, the statistical model that is the "best" of its type for the amount and quality of data available is guaranteed. What the mathematicians do not specify is the time needed to reach this "best" solution. This time depends on the quality of the data mining software tools which vary widely in performance.

Genetic algorithms (e) do not calculate a solution to a problem but just foster and select the best solutions emerging at random. Users do not have to know how the problem is solved; they only have to assess the quality of the solution. To solve a problem by a genetic algorithm, all they have to do is to write a "fitness function", nothing else ! In time, the process mathematically converges to yield the best possible solution but, as for neural networks, "in time" arrives sooner with skilfully written software. Genetic algorithms can be used to produce a variety of objects (e.g. statistical predictors) as long as their quality (or "fitness") can be evaluated. For classifiers (see definition in (d)), fitness is just the rate of correctly predicted memberships on validation data.

Both the supervised learning and evolutionary techniques evaluate the importance of each variable for the prediction ("relevance"). Once the most predictive variables have been identified, one can, if one so wishes, address the simplified problem with more customary methods.

Supervised learning (neural networks) and genetic algorithms are techniques that are only becoming popular now for two reasons. First, they need considerable processing power. A learning process that takes 30 minutes on a modern PC would, in 1982, have required 320 hours' worth of an IBM-XT's processing power (13 days and nights !). Even the inventors of backpropagation, back in the '70s, would have needed all of NASA's resources to build the credit scoring predictors on the market today. Second, with the rampant computerisation of society, we have more and more records of all kinds to work with. The problem is no longer how to create data but how to sift through too much data, i.e. how to "data mine".

Text mining

The above techniques are designed for numerical processing of quantitative data and extracting "higher-order" knowledge from raw data. Still more pressing is the problem of automatically extracting meaning from texts. One need look no further for proof of this need than to the wealth - or is it glut ? - of responses spewed by Internet's search engines.

Literal searching showed its limits long ago when faced with a multitude of problems including misspellings, synonymy, multiple meanings, etc... There can be no truly intelligent query without at least some beyond-the-syntax understanding on the part of the computer. However, with the computer power now available, one finds that rather inelegant ,brute-force methods - precomputing massive lexicons for instant comparison, etc. - can nevertheless do the job.

The choice here is between simple word-counting, on the one hand, and providing a computer with common sense or "world knowledge", on the other. Computational linguists are hard at work on the knowledge problem and probably will continue working hard for quite a while yet !

There are a few techniques lying between the very simple and the very complex, some of which have been pioneered by PMSI (b). "Similarity searches", "context matching" or "conceptual clustering " can already reduce the amount of unwanted information by a few orders of magnitude but, overall, this remains a very open problem.


The advent of data mining is only the latest step in the extension of quantitative, "scientific" methods. Data mining enables every non statistician - that is virtually all of us - to study, understand and improve our operational or decisional processes whether in science, business or society.

For the first time, thanks to the increased power of computers, the skill of the statistics artisan is being replaced by massive computational methods that provide as good or better results in far less time and without the need for any specialist knowledge in statistics. Moreover, the end-user is on safe ground because these "newfangled" tools and techniques are either not really new, or mathematically confirmed, or both.

Data mining and its relatives (text and image mining) are probably the most useful ways to take advantage of the large scale processing power available on many desktop computers and, definitely, the most promising and exciting research field in advanced informatics.


(a) PMSI is a software company that provides custom-made tools for building several predictors such as rule bases (Galvano Rule Bases), scorecards (Galvano Score), decision/regression trees (Galvano Decision Trees), neural networks (GalvaNeurones). It also provides display tools (SAXVIEW) and software for text mining.

(b) Datamining or database mining

(c) Reyes syndrome: A rare disorder occurring in childhood characterized by the symptoms of encephalitis combined with evidence of liver failure.

(d) Neural networks are a statistical analysis tool for building behaviour models from a collection of examples of this behaviour (defined by a series of numeric or textual "descriptive variables"). The neural net, which is ignorant at the start, becomes, through a "learning" process, a model of the dependencies between the descriptive variables and the behaviour to be explained. The model is built automatically and straightforwardly from the data without the intervention of a skilled, and often costly, technician.

For instance, let us consider data on people who have taken out loans in the past : age, income, married or single, etc, on the one hand, and payment delays or problems, on the other. On the basis of these data, we can obtain a model that contains the relationship (if any) between each of the chosen variables and the outcome of the loan. We can then ask the model for a prediction regarding data relating to new customers. This kind of statistical model (or predictor) is called a classifier.

In each case, the neural net is built specifically to fit available data. (Of course, thorough statistical checks are made to avoid the evil effects of "overfitting".) The model does not blindly repeat the past nor does it fetch a prebuilt model that fits more or less well from a library. If there is a relationship between the descriptive variables (borrower description, previous values of a share, measurements, operation point, etc) and the outcome (loan outcome, share's next price, nature of a fault, process control variables, etc), there is a neural network that embodies it.

The neural net is robust and insensitive to a small proportion of erroneous or missing values. The examples in which these values occur are just ignored because they are inconsistent with the other examples.

(e) The genetic algorithm is a rather brutish approach that provides solutions to problems that do not have a well-defined method for solving them or, if they do, applying the method would take far too long. Most interesting open problems are of this kind. These problems are often characterised by multiple, complex, even contradictory constraints, that have to be satisfied at the same time. Examples are crew and team planning, delivery itineraries, finding the most beneficial locations for stores or warehouses, building statistical models, etc. Another example is finding the shortest path that links a number of cities. The only exact solution to this question is to try all the answers and compare them but this would take geological time !

The genetic algorithm works by creating many random solutions to the problem. Being random, these starting solutions are not very good: schedules overlap and itineraries do not pass through all the locations required. This population of starting solutions is then subjected to a quasi "evolution of species".

The solutions are coded the only way the computer knows how, that is as a series of zeroes and ones. The evolutionary process consists in considering these zeroes and ones as genetic "chromosomes" which, like their real-life biological counterparts, will "mate" by hybridisation, with an occasional spontaneous mutation thrown in here and there. The "offspring" will include some solutions that are better than the starting random solutions. The best offspring are then added to the population ("survival of the fittest"),the others are left aside. By repeating this crossover-cum-selection process, each time with the best elements, populations of increasingly good solutions are automatically generated by a form of "selective pressure".

The evolution and selection process is problem independent; only the fitness function and the one that decodes the chromosomes into a readable form are problem specific. The approach thus relies on a problem-independent "engine", requiring little problem-related work.

The main differences between an ad hoc and genetic algorithm-based approach are listed in the table below :

Ad hoc approachGenetic approach
Approach Analytical, specific General
SpeedGenerally goodAverage or low
PerformanceDepends on solutionGood to excellent
Problem understandingNecessaryNot necessary
Human workFew mins to several thesesA few days
Intermediary stepsAre not solutionsAre solutions**
*Most interesting problems have no usable mathematical expression, or are non-computable, or "NP-complete" (too many solutions to try them all)
** In the ad hoc approach, one must wait until the end of computation to obtain a solution; in the genetic approach, the solving process can be interrupted at any time to give possible solutions.


Goldberg DE. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley; Reading, MA, 1989.

Rudolph G. Convergence analysis of canonical genetic algorithm. IEEE Trans. Neural Networks 5(1), 96-101, 1994.