Fastest way to compute entropy in Python

Question

In my project I need to compute the entropy of 0-1 vectors many times  Here s my code   def entropy labels           Computes entropy of 0-1 vector          n labels   len labels       if n labels  lt   1          return 0      counts   np bincount labels      probs   counts np nonzero counts     n labels     n classes   len probs       if n classes  lt   1          return 0     return - np sum probs   np log probs     np log n classes    Is there a faster way

User · Answer

Uniformly distributed data  high entropy   s range 0 256   Shannon entropy calculation step by step  import collections import math    calculate probability for each byte as number of occurrences   array length probabilities    n x len s  for x n x in collections Counter s  items       0 00390625  0 00390625  0 00390625          calculate per-character entropy fractions e x    -p x math log p x 2  for p x in probabilities     0 03125  0 03125  0 03125          sum fractions to obtain Shannon entropy entropy   sum e x   gt  gt  gt  entropy  8 0  One-liner  assuming import collections   def H s   return sum  -p x math log p x 2  for p x in  n x len s  for x n x in collections Counter s  items       A proper function  import collections import math  def H s       probabilities    n x len s  for x n x in collections Counter s  items        e x    -p x math log p x 2  for p x in probabilities          return sum e x   Test cases - English text taken from CyberChef entropy estimator   gt  gt  gt  H range 0 256   8 0  gt  gt  gt  H range 0 64   6 0  gt  gt  gt  H range 0 128   7 0  gt  gt  gt  H  0 1   1 0  gt  gt  gt  H  Standard English text usually falls somewhere between 3 5 and 5   4 228788210509104

User · Answer

My favorite function for entropy is the following   def entropy labels       prob dict    x labels count x  len labels  for x in labels      probs   np array list prob dict values          return - probs dot np log2 probs     I am still looking for a nicer way to avoid the dict -  values -  list -  np array conversion  Will comment again if I found it

User · Answer

With the data as a pd Series and scipy stats  calculating the entropy of a given quantity is pretty straightforward   import pandas as pd import scipy stats  def ent data          Calculates entropy of the passed  pd Series              p data   data value counts               counts occurrence of each value     entropy   scipy stats entropy p data     get entropy from counts     return entropy   Note  scipy stats will normalize the provided data  so this doesn t need to be done explicitly  i e  passing an array of counts works fine

User · Answer

Here is my approach   labels    0  0  1  1   from collections import Counter from scipy import stats  stats entropy list Counter labels  values     base 2

User · Answer

BiEntropy wont be the fastest way of computing entropy  but it is rigorous and builds upon Shannon Entropy in a well defined way  It has been tested in various fields including image related applications  It is implemented in Python on Github

User · Answer

This method extends the other solutions by allowing for binning  For example  bin None  default  won t bin x and will compute an empirical probability for each element of x  while bin 256 chunks x into 256 bins before computing the empirical probabilities   import numpy as np  def entropy x  bins None       N     x shape 0      if bins is None          counts   np bincount x      else          counts   np histogram x  bins bins  0    0th idx is counts     p     counts np nonzero counts   N   avoids log 0      H     -np dot  p  np log2 p        return H

User · Answer

from collections import Counter from scipy import stats  labels    0 9  0 09  0 1  stats entropy list Counter labels  keys     base 2

User · Answer

An answer that doesn t rely on numpy  either   import math from collections import Counter  def eta data  unit  natural        base              shannon    2            natural    math exp 1            hartley    10             if len data   lt   1          return 0      counts   Counter        for d in data          counts d     1      ent   0      probs    float c    len data  for c in counts values        for p in probs          if p  gt  0               ent -  p   math log p  base unit        return ent   This will accept any datatype you could throw at it    gt  gt  gt  eta   mary    had    a    little    lamb    1 6094379124341005   gt  gt  gt  eta  c for c in  mary had a little lamb    2 311097886212714   The answer provided by  Jarad suggested timings as well  To that end   repeat number   1000000 e   timeit repeat      stmt    eta labels           setup    labels  1 3 5 2 3 5 3 2 1 3 4 5  from   main   import eta          repeat 3       number repeat number    Timeit results   I believe this is  4x faster than the best numpy approach   print  Method      Avg      6f   format  eta   np array e  mean      Method  eta  Avg   10 461799

User · Answer

Following the suggestion from unutbu I create a pure python implementation   def entropy2 labels        Computes entropy of label distribution          n labels   len labels       if n labels  lt   1          return 0      counts   np bincount labels      probs   counts   n labels     n classes   np count nonzero probs       if n classes  lt   1          return 0      ent   0         Compute standard entropy      for i in probs          ent -  i   log i  base n classes       return ent   The point I was missing was that labels is a large array  however probs is 3 or 4 elements long  Using pure python my application now is twice as fast

User · Answer

Sanjeet Gupta answer is good but could be condensed  This question is specifically asking about the  Fastest  way but I only see times on one answer so I ll post a comparison of using scipy and numpy to the original poster s entropy2 answer with slight alterations   Four different approaches  scipy numpy  numpy math  pandas numpy  numpy  import numpy as np from scipy stats import entropy from math import log  e import pandas as pd  import timeit  def entropy1 labels  base None     value counts   np unique labels  return counts True    return entropy counts  base base   def entropy2 labels  base None         Computes entropy of label distribution         n labels   len labels     if n labels  lt   1      return 0    value counts   np unique labels  return counts True    probs   counts   n labels   n classes   np count nonzero probs     if n classes  lt   1      return 0    ent   0       Compute entropy   base   e if base is None else base   for i in probs      ent -  i   log i  base     return ent  def entropy3 labels  base None     vc   pd Series labels  value counts normalize True  sort False    base   e if base is None else base   return - vc   np log vc  np log base   sum    def entropy4 labels  base None     value counts   np unique labels  return counts True    norm counts   counts   counts sum     base   e if base is None else base   return - norm counts   np log norm counts  np log base   sum     Timeit operations   repeat number   1000000  a   timeit repeat stmt    entropy1 labels                        setup    labels  1 3 5 2 3 5 3 2 1 3 4 5  from   main   import entropy1                       repeat 3  number repeat number   b   timeit repeat stmt    entropy2 labels                        setup    labels  1 3 5 2 3 5 3 2 1 3 4 5  from   main   import entropy2                       repeat 3  number repeat number   c   timeit repeat stmt    entropy3 labels                        setup    labels  1 3 5 2 3 5 3 2 1 3 4 5  from   main   import entropy3                       repeat 3  number repeat number   d   timeit repeat stmt    entropy4 labels                        setup    labels  1 3 5 2 3 5 3 2 1 3 4 5  from   main   import entropy4                       repeat 3  number repeat number    Timeit results     for loop to print out results of timeit for approach timeit results in zip   scipy numpy    numpy math    pandas numpy    numpy     a b c d      print  Method      Avg      6f   format approach  np array timeit results  mean      Method  scipy numpy  Avg   63 315312 Method  numpy math  Avg   49 256894 Method  pandas numpy  Avg   884 644023 Method  numpy  Avg   60 026938   Winner  numpy math  entropy2   It s also worth noting that the entropy2 function above can handle numeric AND text data  ex  entropy2 list  abcdefabacdebcab     The original poster s answer is from 2013 and had a specific use-case for binning ints but it won t work for text

User · Answer

def entropy base  prob a  prob b      import math   base 2   x prob a   y prob b   expression  -  x math log x base   y math log y base           return  expression

User · Answer

The above answer is good  but if you need a version that can operate along different axes  here s a working implementation   def entropy A  axis None          Computes the Shannon entropy of the elements of A  Assumes A is      an array-like of nonnegative ints whose max value is approximately      the number of unique values present        gt  gt  gt  a    0  1       gt  gt  gt  entropy a      1 0      gt  gt  gt  A   np c  a  a       gt  gt  gt  entropy A      1 0      gt  gt  gt  A                     doctest   NORMALIZE WHITESPACE     array   0  0    1  1         gt  gt  gt  entropy A  axis 0     doctest   NORMALIZE WHITESPACE     array   1   1         gt  gt  gt  entropy A  axis 1     doctest   NORMALIZE WHITESPACE     array    0      0          gt  gt  gt  entropy  0  0  0       0 0      gt  gt  gt  entropy         0 0      gt  gt  gt  entropy  5       0 0             if A is None or len A   lt  2          return 0       A   np asarray A       if axis is None          A   A flatten           counts   np bincount A    needs small  non-negative ints         counts   counts counts  gt  0          if len counts     1              return 0    avoid returning -0 0 to prevent weird doctests         probs   counts   float A size          return -np sum probs   np log2 probs       elif axis    0          entropies   map lambda col  entropy col   A T          return np array entropies      elif axis    1          entropies   map lambda row  entropy row   A          return np array entropies  reshape  -1  1       else          raise ValueError  unsupported axis      format axis

[python] Fastest way to compute entropy in Python

Examples related to python

Examples related to numpy

Examples related to entropy