[math] What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as:

Entropy is the sum of the probability of each label times the log probability of that same label

How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)?

This question is related to math text computer-science nltk text-mining

The answer is


I really recommend you read about Information Theory, bayesian methods and MaxEnt. The place to start is this (freely available online) book by David Mackay:

http://www.inference.phy.cam.ac.uk/mackay/itila/

Those inference methods are really far more general than just text mining and I can't really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods.

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

I don't think it's really necessary that you go learning about all those theorems on communication theory, but it's not possible to learn this without learning the basics about what is entropy, how it's calculated, what is it's relationship with information and inference, etc...


Informally

entropy is availability of information or knowledge, Lack of information will leads to difficulties in prediction of future which is high entropy (next word prediction in text mining) and availability of information/knowledge will help us more realistic prediction of future (low entropy).

Relevant information of any type will reduce entropy and helps us predict more realistic future, that information can be word "meat" is present in sentence or word "meat" is not present. This is called Information Gain


Formally

entropy is lack of order of predicability


I can't give you graphics, but maybe I can give a clear explanation.

Suppose we have an information channel, such as a light that flashes once every day either red or green. How much information does it convey? The first guess might be one bit per day. But what if we add blue, so that the sender has three options? We would like to have a measure of information that can handle things other than powers of two, but still be additive (the way that multiplying the number of possible messages by two adds one bit). We could do this by taking log2(number of possible messages), but it turns out there's a more general way.

Suppose we're back to red/green, but the red bulb has burned out (this is common knowledge) so that the lamp must always flash green. The channel is now useless, we know what the next flash will be so the flashes convey no information, no news. Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row. When the lamp flashes red, we know what the next flash will be. If you try to send a bit stream by this channel, you'll find that you must encode it with more flashes than you have bits (50% more, in fact). And if you want to describe a sequence of flashes, you can do so with fewer bits. The same applies if each flash is independent (context-free), but green flashes are more common than red: the more skewed the probability the fewer bits you need to describe the sequence, and the less information it contains, all the way to the all-green, bulb-burnt-out limit.

It turns out there's a way to measure the amount of information in a signal, based on the the probabilities of the different symbols. If the probability of receiving symbol xi is pi, then consider the quantity

-log pi

The smaller pi, the larger this value. If xi becomes twice as unlikely, this value increases by a fixed amount (log(2)). This should remind you of adding one bit to a message.

If we don't know what the symbol will be (but we know the probabilities) then we can calculate the average of this value, how much we will get, by summing over the different possibilities:

I = -Σ pi log(pi)

This is the information content in one flash.

Red bulb burnt out: pred = 0, pgreen=1, I = -(0 + 0)  = 0
Red and green equiprobable: pred = 1/2, pgreen = 1/2, I = -(2 * 1/2 * log(1/2)) = log(2)
Three colors, equiprobable: pi=1/3, I = -(3 * 1/3 * log(1/3)) = log(3)
Green and red, green twice as likely: pred=1/3, pgreen=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)

This is the information content, or entropy, of the message. It is maximal when the different symbols are equiprobable. If you're a physicist you use the natural log, if you're a computer scientist you use log2 and get bits.


As you are reading a book about NLTK it would be interesting you read about MaxEnt Classifier Module http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent

For text mining classification the steps could be: pre-processing (tokenization, steaming, feature selection with Information Gain ...), transformation to numeric (frequency or TF-IDF) (I think that this is the key step to understand when using text as input to a algorithm that only accept numeric) and then classify with MaxEnt, sure this is just an example.


When I was implementing an algorithm to calculate the entropy of an image I found these links, see here and here.

This is the pseudo-code I used, you'll need to adapt it to work with text rather than images but the principles should be the same.

//Loop over image array elements and count occurrences of each possible
//pixel to pixel difference value. Store these values in prob_array
for j = 0, ysize-1 do $
    for i = 0, xsize-2 do begin
       diff = array(i+1,j) - array(i,j)
       if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin
            prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1
       endif
     endfor

//Convert values in prob_array to probabilities and compute entropy
n = total(prob_array)

entrop = 0
for i = 0, array_size-1 do begin
    prob_array(i) = prob_array(i)/n

    //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element
    //here and divide final sum by Ln(2)
    if prob_array(i) ne 0 then begin
        entrop = entrop - prob_array(i)*alog(prob_array(i))
    endif
endfor

entrop = entrop/alog(2)

I got this code from somewhere, but I can't dig out the link.


To begin with, it would be best to understand the measure of information.

How do we measure the information?

When something unlikely happens, we say it's a big news. Also, when we say something predictable, it's not really interesting. So to quantify this interesting-ness, the function should satisfy

  • if the probability of the event is 1 (predictable), then the function gives 0
  • if the probability of the event is close to 0, then the function should give high number
  • if probability 0.5 events happens it give one bit of information.

One natural measure that satisfy the constraints is

I(X) = -log_2(p)

where p is the probability of the event X. And the unit is in bit, the same bit computer uses. 0 or 1.

Example 1

Fair coin flip :

How much information can we get from one coin flip?

Answer : -log(p) = -log(1/2) = 1 (bit)

Example 2

If a meteor strikes the Earth tomorrow, p=2^{-22} then we can get 22 bits of information.

If the Sun rises tomorrow, p ~ 1 then it is 0 bit of information.

Entropy

So if we take expectation on the interesting-ness of an event Y, then it is the entropy. i.e. entropy is an expected value of the interesting-ness of an event.

H(Y) = E[ I(Y)]

More formally, the entropy is the expected number of bits of an event.

Example

Y = 1 : an event X occurs with probability p

Y = 0 : an event X does not occur with probability 1-p

H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0) 
     = - p log p - (1-p) log (1-p)

Log base 2 for all log.


Examples related to math

How to do perspective fixing? How to pad a string with leading zeros in Python 3 How can I use "e" (Euler's number) and power operation in python 2.7 numpy max vs amax vs maximum Efficiently getting all divisors of a given number Using atan2 to find angle between two vectors How to calculate percentage when old value is ZERO Finding square root without using sqrt function? Exponentiation in Python - should I prefer ** operator instead of math.pow and math.sqrt? How do I get the total number of unique pairs of a set in the database?

Examples related to text

Difference between opening a file in binary vs text How do I center text vertically and horizontally in Flutter? How to `wget` a list of URLs in a text file? Convert txt to csv python script Reading local text file into a JavaScript array Python: How to increase/reduce the fontsize of x and y tick labels? How can I insert a line break into a <Text> component in React Native? How to split large text file in windows? Copy text from nano editor to shell Atom menu is missing. How do I re-enable

Examples related to computer-science

HTML5 Canvas background image What exactly does big ? notation represent? Fixed point vs Floating point number What are the differences between a program and an application? What do we mean by Byte array? How to determine the longest increasing subsequence using dynamic programming? What is "entropy and information gain"? What are the differences between NP, NP-Complete and NP-Hard? What is the difference between statically typed and dynamically typed languages? What is “2's Complement”?

Examples related to nltk

re.sub erroring with "Expected string or bytes-like object" how to check which version of nltk, scikit learn installed? Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) NLTK and Stopwords Fail #lookuperror Resource u'tokenizers/punkt/english.pickle' not found How do I download NLTK data? Stopword removal with NLTK n-grams in python, four, five, six grams? pip issue installing almost any library How to get rid of punctuation using NLTK tokenizer?

Examples related to text-mining

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern? What is "entropy and information gain"?