To begin with, it would be best to understand the measure of information
.
measure
the information?When something unlikely happens, we say it's a big news. Also, when we say something predictable, it's not really interesting. So to quantify this interesting-ness
, the function should satisfy
one bit
of information.One natural measure that satisfy the constraints is
I(X) = -log_2(p)
where p is the probability of the event X
. And the unit is in bit
, the same bit computer uses. 0 or 1.
Fair coin flip :
How much information can we get from one coin flip?
Answer : -log(p) = -log(1/2) = 1 (bit)
If a meteor strikes the Earth tomorrow, p=2^{-22}
then we can get 22 bits of information.
If the Sun rises tomorrow, p ~ 1
then it is 0 bit of information.
So if we take expectation on the interesting-ness
of an event Y
, then it is the entropy.
i.e. entropy is an expected value of the interesting-ness of an event.
H(Y) = E[ I(Y)]
More formally, the entropy is the expected number of bits of an event.
Y = 1 : an event X occurs with probability p
Y = 0 : an event X does not occur with probability 1-p
H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0)
= - p log p - (1-p) log (1-p)
Log base 2 for all log.