What is entropy and information gain

Question

I am reading this book  NLTK  and it is confusing   Entropy is defined as      Entropy is the sum of the probability of each label   times the log probability of that same label   How can I apply entropy and maximum entropy in terms of text mining   Can someone give me a easy  simple example  visual

User · Answer

Informally  entropy is availability of information or knowledge  Lack of information will leads to difficulties in prediction of future which is high entropy  next word prediction in text mining  and availability of information knowledge will help us more realistic prediction of future  low entropy    Relevant information of any type will reduce entropy and helps us predict more realistic future  that information can be word  meat  is present in sentence or word  meat  is not present  This is called Information Gain    Formally  entropy is lack of order of predicability

User · Answer

I assume entropy was mentioned in the context of building decision trees   To illustrate  imagine the task of learning to classify first-names into male female groups  That is given a list of names each labeled with either m or f  we want to learn a model that fits the data and can be used to predict the gender of a new unseen first-name   name       gender -----------------        Now we want to predict  Ashley        f              the gender of  Amro   my name  Brian         m Caroline      f David         m   First step is deciding what features of the data are relevant to the target class we want to predict  Some example features include  first last letter  length  number of vowels  does it end with a vowel  etc   So after feature extraction  our data looks like     name    ends-vowel  num-vowels   length   gender   ------------------------------------------------ Ashley        1         3           6        f Brian         0         2           5        m Caroline      1         4           8        f David         0         2           5        m   The goal is to build a decision tree  An example of a tree would be   length lt 7     num-vowels lt 3  male     num-vowels gt  3         ends-vowel 1  female         ends-vowel 0  male length gt  7     length 5  male   basically each node represent a test performed on a single attribute  and we go left or right depending on the result of the test  We keep traversing the tree until we reach a leaf node which contains the class prediction  m or f   So if we run the name Amro down this tree  we start by testing  is the length lt 7   and the answer is yes  so we go down that branch  Following the branch  the next test  is the number of vowels lt 3   again evaluates to true  This leads to a leaf node labeled m  and thus the prediction is male  which I happen to be  so the tree predicted the outcome correctly    The decision tree is built in a top-down fashion  but the question is how do you choose which attribute to split at each node  The answer is find the feature that best splits the target class into the purest possible children nodes  ie  nodes that don t contain a mix of both male and female  rather pure nodes with only one class    This measure of purity is called the information  It represents the expected amount of information that would be needed to specify whether a new instance  first-name  should be classified male or female  given the example that reached the node  We calculate it based on the number of male and female classes at the node   Entropy on the other hand is a measure of impurity  the opposite   It is defined for a binary class with values a b as   Entropy   - p a  log p a   - p b  log p b     This binary entropy function is depicted in the figure below  random variable can take one of two values   It reaches its maximum when the probability is p 1 2  meaning that p X a  0 5 or similarlyp X b  0 5 having a 50  50  chance of being either a or b  uncertainty is at a maximum   The entropy function is at zero minimum when probability is p 1 or p 0 with complete certainty  p X a  1 or p X a  0 respectively  latter implies p X b  1      Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes  not just two       the log in the formula is usually taken as the logarithm to the base 2     Back to our task of name classification  lets look at an example  Imagine at some point during the process of constructing the tree  we were considering the following split        ends-vowel        9m 5f            lt --- the         notation represents the class                             distribution of instances that reached a node     1           0  -------     -------   3m 4f       6m 1f    As you can see  before the split we had 9 males and 5 females  i e  P m  9 14 and P f  5 14  According to the definition of entropy   Entropy before   -  5 14  log2 5 14  -  9 14  log2 9 14    0 9403   Next we compare it with the entropy computed after considering the split by looking at two child branches  In the left branch of ends-vowel 1  we have   Entropy left   -  3 7  log2 3 7  -  4 7  log2 4 7    0 9852   and the right branch of ends-vowel 0  we have   Entropy right   -  6 7  log2 6 7  -  1 7  log2 1 7    0 5917   We combine the left right entropies using the number of instances down each branch as weight factor  7 instances went left  and 7 instances went right   and get the final entropy after the split   Entropy after   7 14 Entropy left   7 14 Entropy right   0 7885   Now by comparing the entropy before and after the split  we obtain a measure of information gain  or how much information we gained by doing the split using that particular feature   Information Gain   Entropy before - Entropy after   0 1518   You can interpret the above calculation as following  by doing the split with the end-vowels feature  we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0 1518  measured in bits as units of information    At each node of the tree  this calculation is performed for every feature  and the feature with the largest information gain is chosen for the split in a greedy manner  thus favoring features that produce pure splits with low uncertainty entropy   This process is applied recursively from the root-node down  and stops when a leaf node contains instances all having the same class  no need to split it further    Note that I skipped over some details which are beyond the scope of this post  including how to handle numeric features  missing values  overfitting and pruning trees  etc

User · Answer

I really recommend you read about Information Theory  bayesian methods and MaxEnt  The place to start is this  freely available online  book by David Mackay   http   www inference phy cam ac uk mackay itila   Those inference methods are really far more general than just text mining and I can t really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods   The connection between entropy and probability theory to information processing and storing is really  really deep  To give a taste of it  there s a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process  There s also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data   I don t think it s really necessary that you go learning about all those theorems on communication theory  but it s not possible to learn this without learning the basics about what is entropy  how it s calculated  what is it s relationship with information and inference  etc

User · Answer

As you are reading a book about NLTK it would be interesting you read about MaxEnt Classifier Module http   www nltk org api nltk classify html module-nltk classify maxent  For text mining classification the steps could be  pre-processing  tokenization  steaming  feature selection with Information Gain       transformation to numeric  frequency or TF-IDF   I think that this is the key step to understand when using text as input to a algorithm that only accept numeric  and then classify with MaxEnt  sure this is just an example

User · Answer

When I was implementing an algorithm to calculate the entropy of an image I found these links  see here and here   This is the pseudo-code I used  you ll need to adapt it to work with text rather than images but the principles should be the same     Loop over image array elements and count occurrences of each possible   pixel to pixel difference value  Store these values in prob array for j   0  ysize-1 do       for i   0  xsize-2 do begin        diff   array i 1 j  - array i j         if diff lt  array size 1  2 and diff gt - array size 1  2 then begin             prob array diff  array size-1  2    prob array diff  array size-1  2    1        endif      endfor    Convert values in prob array to probabilities and compute entropy n   total prob array   entrop   0 for i   0  array size-1 do begin     prob array i    prob array i  n        Base 2 log of x is Ln x  Ln 2   Take Ln of array element       here and divide final sum by Ln 2      if prob array i  ne 0 then begin         entrop   entrop - prob array i  alog prob array i       endif endfor  entrop   entrop alog 2    I got this code from somewhere  but I can t dig out the link

User · Answer

To begin with  it would be best to understand the measure of information    How do we measure the information   When something unlikely happens  we say it s a big news  Also  when we say something predictable  it s not really interesting  So to quantify this interesting-ness  the function should satisfy   if the probability of the event is 1  predictable   then the function gives 0 if the probability of the event is close to 0  then the function should give high number if probability 0 5 events happens it give one bit of information    One natural measure that satisfy the constraints is  I X    -log 2 p    where p is the probability of the event X  And the unit is in bit  the same bit computer uses  0 or 1   Example 1  Fair coin flip     How much information can we get from one coin flip   Answer   -log p    -log 1 2    1  bit   Example 2  If a meteor strikes the Earth tomorrow  p 2  -22  then we can get 22 bits of information   If the Sun rises tomorrow  p   1 then it is 0 bit of information   Entropy  So if we take expectation on the interesting-ness of an event Y  then it is the entropy  i e  entropy is an expected value of the interesting-ness of an event   H Y    E  I Y     More formally  the entropy is the expected number of bits of an event   Example  Y   1   an event X occurs with probability p  Y   0   an event X does not occur with probability 1-p  H Y    E I Y     p I Y  1     1-p  I Y  0          - p log p -  1-p  log  1-p    Log base 2 for all log

User · Answer

I can t give you graphics  but maybe I can give a clear explanation   Suppose we have an information channel  such as a light that flashes once every day either red or green  How much information does it convey  The first guess might be one bit per day  But what if we add blue  so that the sender has three options  We would like to have a measure of information that can handle things other than powers of two  but still be additive  the way that multiplying the number of possible messages by two adds one bit   We could do this by taking log2 number of possible messages   but it turns out there s a more general way   Suppose we re back to red green  but the red bulb has burned out  this is common knowledge  so that the lamp must always flash green  The channel is now useless  we know what the next flash will be so the flashes convey no information  no news  Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row  When the lamp flashes red  we know what the next flash will be  If you try to send a bit stream by this channel  you ll find that you must encode it with more flashes than you have bits  50  more  in fact   And if you want to describe a sequence of flashes  you can do so with fewer bits  The same applies if each flash is independent  context-free   but green flashes are more common than red  the more skewed the probability the fewer bits you need to describe the sequence  and the less information it contains  all the way to the all-green  bulb-burnt-out limit   It turns out there s a way to measure the amount of information in a signal  based on the the probabilities of the different symbols  If the probability of receiving symbol xi is pi  then consider the quantity   -log pi   The smaller pi  the larger this value  If xi becomes twice as unlikely  this value increases by a fixed amount  log 2    This should remind you of adding one bit to a message   If we don t know what the symbol will be  but we know the probabilities  then we can calculate the average of this value  how much we will get  by summing over the different possibilities    I   -  931 pi log pi    This is the information content in one flash    Red bulb burnt out  pred   0  pgreen 1  I   - 0   0     0 Red and green equiprobable  pred   1 2  pgreen   1 2  I   - 2   1 2   log 1 2     log 2  Three colors  equiprobable  pi 1 3  I   - 3   1 3   log 1 3     log 3  Green and red  green twice as likely  pred 1 3  pgreen 2 3  I   - 1 3 log 1 3    2 3 log 2 3     log 3  - 2 3 log 2    This is the information content  or entropy  of the message  It is maximal when the different symbols are equiprobable  If you re a physicist you use the natural log  if you re a computer scientist you use log2 and get bits

[math] What is "entropy and information gain"?

Informally

Formally

Examples related to math

Examples related to text

Examples related to computer-science

Examples related to nltk

Examples related to text-mining