O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Calculating Entropy for Data Miners

by Paul Meagher
03/24/2005

Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:

  • Entropy.php
  • JointEntropy.php
  • ConditionalEntropy.php

Calculating Entropy for Datamining discussed the Entropy.php class, which provides web database programmers with quick access to an entropic analysis of a single database column. This article discusses the concepts of joint and conditional entropy through developing the corresponding classes--JointEntropy.php, ConditionalEntropy--and illustrating how these classes allow web database programmers quick access to an entropic analysis of the relationship between two or more database columns considered at a time.

The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.

AllElectronics Database

In the acclaimed book Data Mining: Concepts and Techniques, Jiawei Han and Micheline Kamber illustrate many datamining concepts using the fictitious AllElectronics database. A version of the AllElectronics database also appeared in Ross Quinlan's datamining research. This database ably illustrates the information theoretic concepts that follow. Here is a Customers table derived from that database:

id age income student credit_rating buys_computer
1 <=30 high no fair no
2 <=30 high no excellent no
3 31..40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 31..40 low yes excellent yes
8 <=30 medium no fair no
9 <=30 low yes fair yes
10 >40 medium yes fair yes
11 <=30 medium yes excellent yes
12 31..40 medium no excellent yes
13 31..40 high yes fair yes
14 >40 medium no excellent no

I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.

Joint Entropy

The formula for computing the joint entropy between two database columns looks like this:

H(X,Y) = -Σi:nΣj:n p(xi,yj) * log(p(xi,yj))

This formula is similar in many respects to the formula for computing the entropy of a single database column.

H(X) = -Σi:n p(xi) * log(p(xi))

The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.

To see what a joint probability distribution looks like, use the JointEntropy.php class from the information theory package (IT Package) that the last article began to assemble. Below is a PHP script that applies the JointEntropy.php class to two columns (age, buys_computer) in the AllElectronics.Customers table. It outputs a joint frequency table, a joint probability table, and the summary joint entropy score.

<?php
/**
* @package IT
*/

require_once "../config.php";
require_once PHPMATH."/IT/JointEntropy.php";

/**
* Example of how to use JointEntropy class.   
*/

$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns("age, buys_computer");
$je->analyze();
?>

 <i>Joint frequency table.</i>

<?php
$je->showJointFrequency("%d");
?>

<br />

<i>Joint probability table.</i>

<?php
$je->showJointProbability("%.5f");
?>
<br />

<i>Joint entropy in bits is <?php
printf("%01.3f", $je->bits) ?>.</i>

The output produced by this script looks like:

  buys_computer  
  no yes Σi+
age <=30 3 2 5
31..40 0 4 4
>40 2 3 5
  Σ+j 5 9 14

Joint frequency table

  buys_computer  
  no yes Σi+
age <=30 0.21429 0.14286 0.35714
31..40 0.00000 0.28571 0.28571
>40 0.14286 0.21429 0.35714
  Σ+j 0.35714 0.64286 1

Joint probability table

Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:

H(X,Y) = -1 * [ (0.21429 * log(0.21429)) + 
                (0.14286 * log(0.14286)) + 
                (0.28571 * log(0.28571)) + 
                (0.14286 * log(0.14286)) + 
                (0.21429 * log(0.21429)) ]

H(X,Y) = -1 * [ -0.476226947429 + 
                -0.401050703151 + 
                -0.516387120588 + 
                -0.476226947429 + 
                -0.401050703151 ]

H(X,Y) = -1 * -2.270942421748

H(X,Y) = 2.271

Joint entropy in bits is 2.271.

The calculation of the joint entropy score is the same as the calculation of the univariate entropy score. In both cases, it takes a vector of probabilities summing to 1 and applies the Shannon formula to the probability vector. The interpretation of what the score means is a bit different insofar as what the probabilities encode is not a univariate distribution of signals. That is, P(buys_computer=no), P(buys_computer=yes), but a bivariate distribution of signals; for example, P(age=<30, buys_computer=no), P(age=<30, buys_computer=yes), and so on. The joint entropy score thus signifies the minimum number of questions you need to ask, on average, to identify a signal coming from this bivariate distribution of signals.

Pages: 1, 2, 3, 4, 5

Next Pagearrow




Tagged Articles

Post to del.icio.us

This article has been tagged:

datamining

Articles that share the tag datamining:

Data Mining Email (10 tags)

Massive Data Aggregation with Perl (9 tags)

Top Ten Data Crunching Tips and Tricks (8 tags)

Calculating Entropy for Data Mining (5 tags)

Calculating Entropy for Data Miners (3 tags)

View All

php

Articles that share the tag php:

Understanding MVC in PHP (477 tags)

The PHP Scalability Myth (123 tags)

The Dynamic Duo of PEAR::DB and Smarty (53 tags)

PHP Form Handling (43 tags)

Very Dynamic Web Interfaces (39 tags)

View All

software

Articles that share the tag software:

What Is Web 2.0 (185 tags)

Rolling with Ruby on Rails (97 tags)

How Does Open Source Software Stack Up on the Mac? (79 tags)

Calculating the True Price of Software (68 tags)

Delve into DEVONthink (30 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com