O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Calculating Entropy for Data Miners
Pages: 1, 2, 3, 4, 5

Information Gain

What I have called the mutual information score, Han and Kamber refer to as the Gain score. While the formulas for computing this score appear different, the mutual information score (I(age; buys_computer) = 0.246) is the same value as their gain score (Gain(age) = 0.246).

In the datamining literature, the term gain often appears instead of mutual information, probably because the standard formulas differ and the gain score is sometimes "corrected" to make it more useful for datamining. Be careful, however, in correcting the computation too much, or else you will lose the poweful consistency properties of the information theory calculus.

The term gain is more suggestive of why you might want to compute the mutual information score in a datamining context, and I will use it from here on in. According Han and Kamber, "Gain(A) is the expected reduction in entropy cause by knowing the value of attribute A." Below is some code that computes the gain score for each of the columns in the AllElectronics.Customers table to determine which column most reduces the entropy in the buys_computer column.

<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/ConditionalEntropy.php";

/**
* Example of how to compute gain scores for several columns.
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("buys_computer");
$e->analyze();

$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConsequent('buys_computer');

$columns = array('age','income','student','credit_rating');

foreach($columns as $column) {
  $ce->setAntecedent($column);
  $ce->analyze();
  $gain = $e->bits - $ce->bits;
  echo "gain($column): $gain<br />";
}
?>

The setAntecedent() and setConsequent() methods in the ConditionalEntropy.php class exist to handle situations where you want to compute the conditional entropy score for a target column in terms of several other columns. The script outputs the following set of gain scores:

gain(age): 0.246749819774
gain(income): 0.029222565659
gain(student): 0.151835501362
gain(credit_rating): 0.0481270304083

The report shows that the age column reduces the entropy in the buys_computer column the most. You might consider developing a script that ranks the gains from highest to lowest for use in constructing decision tree classifiers, discussed in the next section.

Datamining Applications

Three common datamining applications of entropic analysis are data description, classifier construction, and data discretization. You can easily incorporate the entropy scores calculated previously into web-based tools that report the entropy, joint entropy, conditional entropy, and mutual information/gain scores between the various columns in the database. Use this information much as you would use means, standard deviations, and correlation coefficients to provide a useful descriptive summary of your data sets. Dorian Pyle's book Data Preparation for Data Mining provides examples of such entropic analysis reports along with a useful discussion of how to use such reported entropy information used in the early "data survey" stage of data mining.

Related Reading

MySQL Cookbook
By Paul DuBois

The previous article also developed several such reports; however, they reported only univariate entropy scores. With the new entropy classes and sample code discussed in this article, you should be able to develop a variety of new entropy-based reports to characterize your data columns.

A more advanced and very common application of entropic analysis to data mining is in the construction of classifiers in the form of decision trees. A decision tree might be useful for determining whether an All Electronics customer is likely to buy a computer. This decision tree would involve first asking the age of the customer. If the age were younger than 30 years old, it might then ask whether the customer is a student. After that, it might ask his income range. Eventually it would reach a terminal node in the decision tree that classifies the instance as to whether the customer is likely to buy a computer.

It is beyond the scope of this article to go too deeply into the theory of constructing decision trees, except to note that constructing such trees uses the recursive computation of gain scores extensively. The age attribute forms the root of the decision tree because it has the highest information gain score. It then takes the partitioned customers (in the ranges of <30, 31..40, >40) and computes the gain score again using the remaining attributes. The remaining attribute producing the highest gain score becomes the second-level branch in the decision tree. This process repeats until there is very little gain to be had by adding more attribute tests to the decision tree. See Quinlan's book and the CART literature for more details on how to use gain scores to construct decision trees.

In addition to data description and classifier construction, you can also use entropic analysis to discretize your data. For example, if customer ages were recorded in years, you might want to find appropriate bins to use by first ordering the age data from smallest to largest, selecting the number of bins and sizes to use, and then computing the entropy score for the data using this number and size of bins. Then compare this with other bin numbers and sizes to find the binning strategy that produces the lowest entropy score. A more sophisticated and useful form of discretization would incorporate information about the target class into the selection of bins to use--perhaps selecting bins so as to minimize a joint entropy, conditional entropy, or mutual information score.

In summary, you can use entropic analysis in datamining to prepare your data (converting continuous quantities into discrete quantities), to describe your data (producing entropic analysis reports), and to make inferences about new sample data (applying your buys_computer decision tree to a new customer). You can apply entropic analysis to many aspects and stages of the datamining process.

Consistency Proof

There are many ways to structure the code base of an information theory package. In this series I have proposed a class structure that reserves a class file for each basic logical element of information theory: Entropy.php, JointEntropy.php, and ConditionalEntropy.php.

By preserving the logical structure of the area in the code base class structure, it's possible to construct proofs that the logical house is in order:


<?php
/**
* @package IT
*/
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";
include_once PHPMATH."/IT/ConditionalEntropy.php";

/**
* Proof that H(Y,X) = H(Y|X) + H(X)
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();
echo "H(age) = ". $e->bits ."<br />";

$ce = new ConditionalEntropy;
$ce->setTable("Customers");
$ce->setConditional("buys_computer|age");
$ce->analyze();
echo "H(buys_computer|age) = ". $ce->bits ."<br />";

$joint_entropy = $ce->bits + $e->bits;
echo "H(buys_computer|age) + H(age) =
$joint_entropy<br />";

$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns(array("age","buys_computer"));
$je->analyze();
echo "H(age; buys_computer) = ". $je->bits ."<br />";

echo "Therefore, H(buys_computer;age) = H(buys_computer|age) + H(age) <br />";
echo "Or, more generally, H(Y;X) = H(Y|X) + H(X)";
?>

Point your browser at the consistency_proof.php script to see the following result:

H(age) = 1.57740628285
H(buys_computer|age) = 0.693536138896
H(buys_computer|age) + H(age) = 2.27094242175
H(age, buys_computer) = 2.27094242175
Therefore, H(buys_computer, age) = H(buys_computer|age) + H(age) 
Or, more generally, H(Y,X) = H(Y|X) + H(X)

In addition to using the IT Package to engage in datamining, you might also want to use it to explore and confirm the mathematical structure of information theory.

Future Technical Work

I spent quite a bit of time designing the set of column setting methods used by the IT Package classes:

  • Entropy::setColumn
  • JointEntropy::setColumns
  • ConditionalEntropy::setAntecedent
  • ConditionalEntropy::setConsequent
  • ConditionalEntropy::setConditional

I designed them with a view toward their use in multicolumn contexts.

For example, when computing the joint entropy between columns, you might specify more than two columns to use:

$je->setColumns("age,buys_computer,student");

Similarly, when computing a conditional entropy score, you might specify more than one column in the antecedent or consequent part of the conditional expression:

$ce->setConditional('buys_computer|age,student');

Future work on the IT Package might involve improving the joint and conditional entropy classes so that they can evaluate multicolumn queries in a useful manner. One way to evaluate a three-column mutual information query is by arithmetic over the outcomes generated by entropy, joint entropy, and conditional entropy objects.

I(A;B,C) = {H(A) - H(A|B)} + {H(A|B) - H(A|B,C)}

The last term in the three-column mutual information query requires implementing a conditional entropy object with three-column support. There are likely more efficient ways to compute queries with four or more columns.

Another useful class to add to the package is one that computes a relative entropy score (perhaps RelativeEntropy.php), which is useful for model-fitting applications. The relative entropy score is also known as the Kullback-Leibler distance. The formula for computing this score looks like:

D(p || q) = Σi:n p(xi) * log (p(xi)/q(xi))

Finally, I would like to extend these information theory concepts to handling continuous random variables. This involves incorporating a probability distributions library into the calculations. Because I have already coded a PDL Package that supplies such a library, this is a feasible technical direction to move in as well.

Resources

  • Download the code here. PHPmath.com will host further updates to the IT Package.
  • C. E. Shannon (1948). A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656.
  • Jiawei Han and Micheline Kamber (2000). Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufman.
  • Dorian Pyle (1999). Data Preparation for Data Mining. San Francisco: Morgan Kaufman.
  • Roberto Togneri and Christopher J deSilva (2003). Fundamentals of Information Theory and Coding Design, Chapman & Hall/CRC.
  • Thomas M. Cover and Joy A. Thomas (1991). Elements of Information Theory. John Wiley and Sons.
  • Paul DuBois (2002). MySQL Cookbook. O'Reilly.
  • J. Ross Quinlan (1993). C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann.

Paul Meagher is a cognitive scientist whose graduate studies focused on mathematical problem solving.


Return to ONLamp.com.


What connections have you found?
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment


Tagged Articles

Post to del.icio.us

This article has been tagged:

datamining

Articles that share the tag datamining:

Data Mining Email (10 tags)

Massive Data Aggregation with Perl (9 tags)

Top Ten Data Crunching Tips and Tricks (8 tags)

Calculating Entropy for Data Mining (5 tags)

Calculating Entropy for Data Miners (3 tags)

View All

php

Articles that share the tag php:

Understanding MVC in PHP (477 tags)

The PHP Scalability Myth (123 tags)

The Dynamic Duo of PEAR::DB and Smarty (53 tags)

PHP Form Handling (43 tags)

Very Dynamic Web Interfaces (39 tags)

View All

software

Articles that share the tag software:

What Is Web 2.0 (185 tags)

Rolling with Ruby on Rails (97 tags)

How Does Open Source Software Stack Up on the Mac? (79 tags)

Calculating the True Price of Software (68 tags)

Delve into DEVONthink (30 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com