I can't give you graphics, but maybe I can give a clear explanation.
Suppose we have an information channel, such as a light that flashes once every day either red or green. How much information does it convey? The first guess might be one bit per day. But what if we add blue, so that the sender has three options? We would like to have a measure of information that can handle things other than powers of two, but still be additive (the way that multiplying the number of possible messages by two adds one bit). We could do this by taking log2(number of possible messages), but it turns out there's a more general way.
Suppose we're back to red/green, but the red bulb has burned out (this is common knowledge) so that the lamp must always flash green. The channel is now useless, we know what the next flash will be so the flashes convey no information, no news. Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row. When the lamp flashes red, we know what the next flash will be. If you try to send a bit stream by this channel, you'll find that you must encode it with more flashes than you have bits (50% more, in fact). And if you want to describe a sequence of flashes, you can do so with fewer bits. The same applies if each flash is independent (context-free), but green flashes are more common than red: the more skewed the probability the fewer bits you need to describe the sequence, and the less information it contains, all the way to the all-green, bulb-burnt-out limit.
It turns out there's a way to measure the amount of information in a signal, based on the the probabilities of the different symbols. If the probability of receiving symbol xi is pi, then consider the quantity
-log pi
The smaller pi, the larger this value. If xi becomes twice as unlikely, this value increases by a fixed amount (log(2)). This should remind you of adding one bit to a message.
If we don't know what the symbol will be (but we know the probabilities) then we can calculate the average of this value, how much we will get, by summing over the different possibilities:
I = -Σ pi log(pi)
This is the information content in one flash.
Red bulb burnt out: pred = 0, pgreen=1, I = -(0 + 0) = 0 Red and green equiprobable: pred = 1/2, pgreen = 1/2, I = -(2 * 1/2 * log(1/2)) = log(2) Three colors, equiprobable: pi=1/3, I = -(3 * 1/3 * log(1/3)) = log(3) Green and red, green twice as likely: pred=1/3, pgreen=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)
This is the information content, or entropy, of the message. It is maximal when the different symbols are equiprobable. If you're a physicist you use the natural log, if you're a computer scientist you use log2 and get bits.