No Watermelons Allowed: Data mining ramblings

Sunday, June 06, 2004

Data mining ramblings

Rob expresses concern about potential govt abuse of information here. I agree in principle, but then I followed the link. The concern is about what is called "data mining". Now watch as I attempt to do some justice to such a vast and interesting field in a blog length post.

One source defines data mining as "the process of finding new and potentially useful knowledge from data". That's not too specific - let me try another angle. Historically we've gathered data in response to a specific need. Nowadays we have scads of data collected for innumerable purposes and we're attempting to learn from it in ways not anticipated when it was collected. IOW we're reusing others' data. Doing so effectively often requires a different way of thinking with new tools and methodologies, and it's new enough that definitions as vague as the above will have to do for now.

You can find any number of books attempting to describe data mining. Let me assure you that in many cases it is a lot like trying to draw conclusions about individual cattle by examining products from McDonalds. You can't. The best you can do is create a list of cattle like the ones you're eating.

The scope of application is almost limitless, but let's take it down to earth with an example of an application in a marketing context. XCo wanted to cross-market various products and they were looking for the most effective ways of identifying prospects. They went through millions of records for people who had bought various products and augmenting that with information available from other sources. Then analyses of the sales and demographic information permitted assigning various coordinates to the customers.

Concocting these coordinates well is something of a black art performed by experienced modelers. Consider "cluster analysis", for instance. You'll start with some customer information. You'll use it to construct some orthogonal "dimensions", conceptually not unlike the rectangular (Cartesian) coordinates from your calculus class. This is done by by powerful computers (big Unix boxes usually, but sometimes mainframes or others) using sophisticated software like SAS.

The "value" of any of these dimensions might be very complex functions of the customer data. Here is an example which is totally artificial but which gives a bit of the flavor: arbitrary coordinate value=.003*(owns a house)+ .044*(reads Wall Street Journal) -.34*(plays the horses) + .28*(went on a cruise in the last 5 years) + .00082*(household income) +.9*(number of cars owned). The form of these equations varies with the data available and the type of modeling chosen by the modeler. Incidentally, there is no "right" answer for the models, and different modelers would come up with different models - what's important is how well they predict consumer behavior.

Alright, now you've come up with several dimensions like the one above and you "plot" (conceptually) where your customers fall in the "space". If your data was relevant and you've developed a good set of dimensions, hopefully you'll find that your customers are "clustered" around certain sets of coordinates. (you'd like to see "clouds" to study instead of a uniform fog, so to speak).

A lot of effort has gone into studying such marketing clusters. For an example of of some identified by a major consumer research organization, look here. IMO it is clear that these different groups of people exist in a meaningful sense and will respond differently to marketing initiatives. In particular, some will be more profitable than others for a given firm, so effective marketers will take these factors into account. But those are generic classifications, and the more specific classifications that might arise from a well-crafted model are more useful for a particular firm.

For instance, suppose you're interested in cross-marketing product A with product B. Generic models cannot have taken your sales for these products by customer into account if only because you wouldn't release such detailed information. But with your own model, you'd look at the coordinates for the customers which have bought A and B. Then you'd look at the ones who had bought only one of the two to see which ones were closest to the ones who had bought both. Then you'd concentrate on those people to try to sell them the product they didn't already have.

The final deliverable in this example might be a list of names of existing customers who were most likely to respond to your initiative based on their past purchase activity and other characteristics. The results are transmitted to the sales force/direct mail/telemarketers/whatever for further action. They report on their results, and this information is used to refine future models.

(Don't let the above suggest that applications of data mining are limited to marketing. These are limited primarily by imagination and the availability of data. I'm particularly interested in those related to genetics, medicine, and bioinformatics.)

So what did we get for our trouble? Marketing costs are lower and the customers most likely to be interested are the ones contacted. Consumers get fewer solicitations, and the ones they get are more likely to be interesting. Everybody wins. Sheesh, is this something to fuss about?

Maybe not, in the commercial arena. But what about putting such information in govt hands? Outside of models which do not consider personal information, such as the Consumer Price Index or perhaps the distribution of variations of the human genome?

Hmm. Ask yourself this question - will the lack of good information prevent the govt from creating models? No, of course not. They'll just make lousier models, with more "misses" and "false positives" than they would have otherwise, and the resulting policies will be less useful and more expensive.

One can argue if the creation and use of such models with the intent of applying the results to individuals is justifiable by the govt. If a marketer uses it, about the worst outcome is a telemarketer phone call, while upsides might include special interest rate offers and other deals that might be interesting. But IMO the govt's powers and its obligations to treat all citizens equally under the law add complications that render almost anything it does with such modeling inherently suspect IMO.

Do I have a conflict of interest? Yep - that's why I know about stuff like this. I'm currently in the DC area, so govt work in this area potentially represents money in my pocket. Anyway, take it for what you paid for it...

No Watermelons Allowed

Sunday, June 06, 2004

Data mining ramblings

No comments:

Good stuff