O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Calculating Entropy for Data Miners
Pages: 1, 2, 3, 4, 5

Independence

Beyond knowing the minimal number of questions needed to identify a signal from our bivariate distribution of signals, is there another use for the joint entropy score? Another use is to compare the joint entropy score H(X,Y) with the sum of the marginal entropies H(X) + H(Y) in the probability distribution table to determine the degree of independence between two random variables; for example, age and buys_computer. If H(X,Y) is approximately equal to H(X) + H(Y), then you can conclude that the two variables are independent of each other--knowing the value of one variable does not enlighten you about the value of the other variable. Here's a script to see whether age and buys_computer are independent.

<?php
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";

$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();

$age = $e->bits;
$e->setColumn("buys_computer");
$e->analyze();
$buy = $e->bits;

echo "H(age) = $age<br />";
echo "H(buys_computer) = $buy<br />";
echo "H(age) + H(buys_computer)= ".($age + $buy)."<br />";

$je = new JointEntropy;
$je->setTable('Customers');
$je->setColumns(array('age','buys_computer'));
$je->analyze();

echo "H(age, buys_computer) = ". $je->bits;
?>

The output of this script is:

H(age) = 1.57740628285
H(buys_computer) = 0.940285958671
H(age) + H(buys_computer)= 2.51769224152
H(age, buys_computer) = 2.27094242175

From these results, you conclude that H(age, buys_computer) < H(age) + H(buys_computer), which means that age and buys_computer are not totally independent variables (although the dependence does not seem too strong either). In general:

H(X,Y) <= H(X) + H(Y) with equality only if X and Y are independent.

One of the most important reasons for being concerned about whether two variables are independent is that data reduction is a critical aspect of any datamining analysis. One important way to reduce the data to manageable size is to eliminate variables that are independent of the output variable about which you want to reduce your uncertainty. If you find that your joint probability scores are nearly equal to the additive sum of your marginal probabilities, you can conclude that our variables are independent and should consider eliminating the independent variable from your analysis.

Conditional Probability

The next formula to discuss and implement is the conditional entropy formula. Before discussing this formula, however, you must first understand how to compute a conditional probability from a joint probability table. The concrete formula for computing a conditional probability looks like this.

P(Y=y | X=x) = P(Y=y, X=x) / P(X=x)

An instantiated formula for computing the conditional probability that customers will buy a computer given that they are under 30 years old looks like this:

P(buys_computer = yes | age = <30) = P(buys_computer = yes,
age = <30) / P(age = < 30)

To compute P(buys_computer = yes, age = <30), look up the joint probability cell with these specific row and column settings (that is, 0.14286). To compute the P(age = <30), look up the row marginal where age = <30 (0.35714). In a nutshell, computing a conditional probability involves dividing a joint probability by a marginal probability (0.14286/0.35714 = 0.4).

Conditional Entropy

The first step in computing the overall conditional entropy is to compute the specific conditional entropies using this formula:

H(X | Y = y) = -Σi:n P(X = x | Y = y) * log(P(X = x | Y = y))

Plug the specific conditional entropy formula H(X | Y = y) into the conditional entropy formula below to compute the amount of uncertainty remaining about X after Y has been observed:

H(X | Y) = -Σi:n P(Y = y) * H(X | Y =
y)

The specific conditional entropy formula computes the amount of uncertainty remaining after performing conditioning on one value of the signal distribution, whereas the conditional entropy is the amount of uncertainty remaining after summing the products of specific signal probabilities and specific conditional entropies.

Calculating Conditional Entropy

Now onto some code for computing the conditional entropy. The code below conditions buying on age ($ce->setConditional('buys_computer | age')) and outputs a joint probability and conditional entropy table.

<?php
/**
* @package IT
*/
require_once "../config.php";

require_once PHPMATH."/IT/ConditionalEntropy.php";

/**
* Example of how to use the ConditionalEntropy class.   
*/
$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConditional('buys_computer|age');

$ce->analyze();
?>

<i>Joint probability table.</i>

<?php
$ce->showJointProbability("%.5f");
?>

<br />

<i>Conditional entropy table.</i>

<?php
$ce->showConditionalEntropy();

?>

This code outputs the following tables:

  buys_computer  
  no yes Σi+
age <=30 0.21429 0.14286 0.35714
31..40 0.00000 0.28571 0.28571
>40 0.14286 0.21429 0.35714
  Σ+j 0.35714 0.64286 1

Joint probability table

  P(B | A = ai)  
Ai P(A = ai) no yes H(B | A = ai) P(A = ai) * H(B | A = ai)
<=30 0.357142857143 0.6 0.4 0.970950594455 0.346768069448
31..40 0.285714285714 0 1 0 0
>40 0.357142857143 0.4 0.6 0.970950594455 0.346768069448
Σi=1 ... 3 P(A = ai) * H(B | A = ai) 0.693536138896

Conditional entropy table

The first table appears again so that you can see how the second, third, and fourth columns in the second table were derived from it. The second column simply reproduces the row marginal from the first table. The third and fourth columns are conditional probabilities calculated by dividing a joint probability by a row marginal (for instance, 0.14286/0.35714 = 0.4). The fifth column calculates the specific conditional entropy for each age range. For example:

H(buys_computer | age = <30) = -1 * [ 0.6 * log(0.6) + 0.4 * log(0.4) ] = 
   0.970950594455

The specific conditional entropy column can be useful to examine in some detail, because low values are telling you that there is an uncertainty-reducing relationship between levels of your variables. In traditional statistical analysis, such minute relationships may not be theoretically interesting; however, in datamining contexts you might find it interesting to know that 30- to 40-year-old customers tend to purchase computers at your store. Of course, there is not enough data in this data set to draw any firm conclusions.

The specific entropy value in the fifth column is then multiplied by the corresponding probability in the second column to obtain the values in the sixth column. The values in the sixth column are summed to give the overall conditional entropy score reported in the bottom-right cell.

Pages: 1, 2, 3, 4, 5

Next Pagearrow




Tagged Articles

Post to del.icio.us

This article has been tagged:

datamining

Articles that share the tag datamining:

Data Mining Email (10 tags)

Massive Data Aggregation with Perl (9 tags)

Top Ten Data Crunching Tips and Tricks (8 tags)

Calculating Entropy for Data Mining (5 tags)

Calculating Entropy for Data Miners (3 tags)

View All

php

Articles that share the tag php:

Understanding MVC in PHP (477 tags)

The PHP Scalability Myth (123 tags)

The Dynamic Duo of PEAR::DB and Smarty (53 tags)

PHP Form Handling (43 tags)

Very Dynamic Web Interfaces (39 tags)

View All

software

Articles that share the tag software:

What Is Web 2.0 (185 tags)

Rolling with Ruby on Rails (97 tags)

How Does Open Source Software Stack Up on the Mac? (79 tags)

Calculating the True Price of Software (68 tags)

Delve into DEVONthink (30 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com