How to implement the Softmax function in Python

Question

From the Udacity s deep learning class  the softmax of y i is simply the exponential divided by the sum of exponential of the whole Y vector     Where S y i  is the softmax function of y i and e is the exponential and j is the no  of columns in the input vector Y   I ve tried the following   import numpy as np  def softmax x          Compute softmax values for each sets of scores in x         e x   np exp x - np max x       return e x   e x sum    scores    3 0  1 0  0 2  print softmax scores     which returns     0 8360188   0 11314284  0 05083836    But the suggested solution was   def softmax x          Compute softmax values for each sets of scores in x         return np exp x    np sum np exp x   axis 0    which produces the same output as the first implementation  even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum   Can someone show mathematically why  Is one correct and the other one wrong   Are the implementation similar in terms of code and time complexity  Which is more efficient

User · Answer

import tensorflow as tf import numpy as np  def softmax x       return  np exp x  T   np exp x  sum axis -1   T  logits   np array   1  2  3    3  10  1    1  2  5    4  6 5  1 2    3  6  1     sess   tf Session   print softmax logits   print sess run tf nn softmax logits    sess close

User · Answer

I would say that while both are correct mathematically  implementation-wise  first one is better  When computing softmax  the intermediate values may become very large  Dividing two large numbers can be numerically unstable  These notes  from Stanford  mention a normalization trick which is essentially what you are doing

User · Answer

EDIT  As of version 1 2 0  scipy includes softmax as a special function    https   scipy github io devdocs generated scipy special softmax html  I wrote a function applying the softmax over any axis   def softmax X  theta   1 0  axis   None               Compute the softmax of each element along an axis of X       Parameters     ----------     X  ND-Array  Probably should be floats       theta  optional   float parameter  used as a multiplier         prior to exponentiation  Default   1 0     axis  optional   axis to compute values along  Default is the          first non-singleton axis       Returns an array the same size as X  The result will sum to 1     along the specified axis                 make X at least 2d     y   np atleast 2d X         find axis     if axis is None          axis   next j 0  for j in enumerate y shape  if j 1   gt  1         multiply y against the theta parameter       y   y   float theta         subtract the max for numerical stability     y   y - np expand dims np max y  axis   axis   axis         exponentiate y     y   np exp y         take the sum along the specified axis     ax sum   np expand dims np sum y  axis   axis   axis         finally  divide elementwise     p   y   ax sum        flatten if X was 1D     if len X shape     1  p   p flatten        return p   Subtracting the max  as other users described  is good practice  I wrote a detailed post about it here

User · Answer

I needed something compatible with the output of a dense layer from Tensorflow    The solution from  desertnaut does not work in this case because I have batches of data  Therefore  I came with another solution that should work in both cases   def softmax x  axis -1       e x   np exp x - np max x     same code     return e x   e x sum axis axis  keepdims True    Results    logits   np asarray        -0 0052024   -0 00770216   0 01360943  -0 008921     1      -0 0052024   -0 00770216   0 01360943  -0 008921     2     print softmax logits       0 2492037  0 24858153 0 25393605 0 24827873     0 2492037  0 24858153 0 25393605 0 24827873     Ref  Tensorflow softmax

User · Answer

The softmax function is an activation function that turns numbers into probabilities which sum to one  The softmax function outputs a vector that represents the probability distributions of a list of outcomes  It is also a core element used in deep learning classification tasks   Softmax function is used when we have multiple classes   It is useful for finding out the class which has the max  Probability   The Softmax function is ideally used in the output layer  where we are actually trying to attain the probabilities to define the class of each input   It ranges from 0 to 1   Softmax function turns logits  2 0  1 0  0 1  into probabilities  0 7  0 2  0 1   and the probabilities sum to 1  Logits are the raw scores output by the last layer of a neural network  Before activation takes place  To understand the softmax function  we must look at the output of the  n-1 th layer   The softmax function is  in fact  an arg max function  That means that it does not return the largest value from the input  but the position of the largest values   For example   Before softmax  X    13  31  5    After softmax  array  1 52299795e-08  9 99999985e-01  5 10908895e-12    Code   import numpy as np    your solution   def your softmax x        Compute softmax values for each sets of scores in x       e x   np exp x - np max x     return e x   e x sum       correct solution    def softmax x        Compute softmax values for each sets of scores in x       e x   np exp x - np max x     return e x   e x sum axis 0      only difference

User · Answer

I would like to supplement a little bit more understanding of the problem  Here it is correct of subtracting max of the array  But if you run the code in the other post  you would find it is not giving you right answer when the array is 2D  or higher dimensions   Here I give you some suggestions    To get max  try to do it along x-axis  you will get an 1D array  Reshape your max array to original shape  Do np exp get exponential value  Do np sum along axis  Get the final results    Follow the result you will get the correct answer by doing vectorization  Since it is related to the college homework  I cannot post the exact code here  but I would like to give more suggestions if you don t understand

User · Answer

This also works with np reshape     def softmax  scores            quot  quot  quot          Compute softmax scores given the raw output from the model           param scores  raw scores from the model  N  num classes           return              prob  softmax probabilities  N  num classes           quot  quot  quot          prob   None          exponential   np exp              scores - np max scores  axis 1  reshape -1  1               subract the largest number https   jamesmccaffrey wordpress com 2016 03 04 the-max-trick-when-computing-softmax          prob   exponential   exponential sum axis 1  reshape -1  1                     return prob

User · Answer

In order to maintain for numerical stability  max x  should be subtracted  The following is the code for softmax function   def softmax x    if len x shape   gt  1      tmp   np max x  axis   1      x -  tmp reshape  x shape 0   1       x   np exp x      tmp   np sum x  axis   1      x    tmp reshape  x shape 0   1   else      tmp   np max x      x -  tmp     x   np exp x      tmp   np sum x      x    tmp   return x

User · Answer

Already answered in much detail in above answers  max is subtracted to avoid overflow  I am adding here one more implementation in python3   import numpy as np def softmax x       mx   np amax x axis 1 keepdims   True      x exp   np exp x - mx      x sum   np sum x exp  axis   1  keepdims   True      res   x exp   x sum     return res  x   np array   3 2 4   4 5 6    print softmax x

User · Answer

To offer an alternative solution  consider the cases where your arguments are extremely large in magnitude such that exp x  would underflow  in the negative case  or overflow  in the positive case   Here you want to remain in log space as long as possible  exponentiating only at the end where you can trust the result will be well-behaved   import scipy special as sc import numpy as np  def softmax x  np ndarray  - gt  np ndarray      return np exp x - sc logsumexp x

User · Answer

Here you can find out why they used - max    From there       When you   re writing code for computing the Softmax function in practice  the intermediate terms may be very large due to the exponentials  Dividing large numbers can be numerically unstable  so it is important to use a normalization trick

User · Answer

From mathematical point of view both sides are equal    And you can easily prove this  Let s m max x   Now your function softmax returns a vector  whose i-th coordinate is equal to    notice that this works for any m  because for all  even complex  numbers e m    0   from computational complexity point of view they are also equivalent and both run in O n  time  where n is the size of a vector   from numerical stability point of view  the first solution is preferred  because e x grows very fast and even for pretty small values of x it will overflow  Subtracting the maximum value allows to get rid of this overflow  To practically experience the stuff I was talking about try to feed x   np array  1000  5   into both of your functions  One will return correct probability  the second will overflow with nan your solution works only for vectors  Udacity quiz wants you to calculate it for matrices as well   In order to fix it you need to use sum axis 0

User · Answer

The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate  i e  tend to   - 1  tanh  or from 0 to 1  logistical    This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding  i e  if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can t tell which one is the  quot biggest quot  or  quot smallest quot  because they got squished    also it makes the total output sum to 1  and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1 p  where p is the number of output neurons with similar values  The purpose of subtracting the max value from the vector is that when you do e y exponents you may get very high value that clips the float at the max value leading to a tie  which is not the case in this example  This becomes a BIG problem if you subtract the max value to make a negative number  then you have a negative exponent that rapidly shrinks the values altering the ratio  which is what occurred in poster s question and yielded the incorrect answer  The answer supplied by Udacity is HORRIBLY inefficient  The first thing we need to do is calculate e y j for all vector components  KEEP THOSE VALUES  then sum them up  and divide  Where Udacity messed up is they calculate e y j TWICE    Here is the correct answer  def softmax y       e to the y j   np exp y      return e to the y j   np sum e to the y j  axis 0

User · Answer

A more concise version is   def softmax x       return np exp x    np exp x  sum axis 0

User · Answer

This generalizes and assumes you are normalizing the trailing dimension  def softmax x  np ndarray  - gt  np ndarray      e x   np exp x - np max x  axis -1       None       e y   e x sum axis -1       None      return e x   e y

User · Answer

Here is generalized solution using numpy and comparision for correctness with tensorflow ans scipy   Data preparation   import numpy as np  np random seed 2019   batch size   1 n items   3 n classes   2 logits np   np random rand batch size n items n classes  astype np float32  print  logits np shape   logits np shape  print  logits np    print logits np    Output   logits np shape  1  3  2  logits np     0 9034822  0 3930805      0 62397    0 6378774      0 88049906 0 299172        Softmax using tensorflow   import tensorflow as tf  logits tf   tf convert to tensor logits np  np float32  scores tf   tf nn softmax logits np  axis -1   print  logits tf shape   logits tf shape  print  scores tf shape   scores tf shape   with tf Session   as sess      scores np   sess run scores tf   print  scores np shape   scores np shape  print  scores np    print scores np   print  np sum scores np  axis -1  shape   np sum scores np axis -1  shape  print  np sum scores np  axis -1     print np sum scores np  axis -1     Output   logits tf shape  1  3  2  scores tf shape  1  3  2  scores np shape  1  3  2  scores np     0 62490064 0 37509936     0 4965232  0 5034768      0 64137274 0 3586273     np sum scores np  axis -1  shape  1  3  np sum scores np  axis -1     1  1  1      Softmax using scipy   from scipy special import softmax  scores np   softmax logits np  axis -1   print  scores np shape   scores np shape  print  scores np    print scores np   print  np sum scores np  axis -1  shape   np sum scores np  axis -1  shape  print  np sum scores np  axis -1     print np sum scores np  axis -1     Output   scores np shape  1  3  2  scores np     0 62490064 0 37509936     0 4965232  0 5034768      0 6413727  0 35862732    np sum scores np  axis -1  shape  1  3  np sum scores np  axis -1     1  1  1      Softmax using numpy  https   nolanbconaway github io blog 2017 softmax-numpy     def softmax X  theta   1 0  axis   None               Compute the softmax of each element along an axis of X       Parameters     ----------     X  ND-Array  Probably should be floats      theta  optional   float parameter  used as a multiplier         prior to exponentiation  Default   1 0     axis  optional   axis to compute values along  Default is the         first non-singleton axis       Returns an array the same size as X  The result will sum to 1     along the specified axis                 make X at least 2d     y   np atleast 2d X         find axis     if axis is None          axis   next j 0  for j in enumerate y shape  if j 1   gt  1         multiply y against the theta parameter      y   y   float theta         subtract the max for numerical stability     y   y - np expand dims np max y  axis   axis   axis         exponentiate y     y   np exp y         take the sum along the specified axis     ax sum   np expand dims np sum y  axis   axis   axis         finally  divide elementwise     p   y   ax sum        flatten if X was 1D     if len X shape     1  p   p flatten        return p   scores np   softmax logits np  axis -1   print  scores np shape   scores np shape  print  scores np    print scores np   print  np sum scores np  axis -1  shape   np sum scores np  axis -1  shape  print  np sum scores np  axis -1     print np sum scores np  axis -1     Output   scores np shape  1  3  2  scores np     0 62490064 0 37509936     0 49652317 0 5034768      0 64137274 0 3586273     np sum scores np  axis -1  shape  1  3  np sum scores np  axis -1     1  1  1

User · Answer

sklearn also offers implementation of softmax  from sklearn utils extmath import softmax import numpy as np  x   np array    0 50839931   0 49767588   0 51260159    softmax x     output array    0 3340521    0 33048906   0 33545884

User · Answer

Goal was to achieve similar results using Numpy and Tensorflow  The only change from original answer is axis parameter for np sum api   Initial approach   axis 0 - This however does not provide intended results when dimensions are N   Modified approach  axis len e x shape -1 - Always sum on the last dimension  This provides similar results as tensorflow s softmax function   def softmax fn input array                    author    Prathyush SP             Calculate Softmax for a given array      param input array  Input Array      return  Softmax Score             e x   np exp input array - np max input array       return e x   e x sum axis len e x shape -1

User · Answer

Based on all the responses and CS231n notes  allow me to summarise   def softmax x  axis       x -  np max x  axis axis  keepdims True      return np exp x    np exp x  sum axis axis  keepdims True    Usage   x   np array   1  0  2 -1                  2  4  6  8                   3  2  1  0    softmax x  axis 1  round 2    Output   array   0 24  0 09  0 64  0 03           0     0 02  0 12  0 86           0 64  0 24  0 09  0 03

User · Answer

Everybody seems to post their solution so I ll post mine   def softmax x       e x   np exp x T - np max x  axis   -1       return  e x   e x sum axis 0   T   I get the exact same results as the imported from sklearn   from sklearn utils extmath import softmax

User · Answer

Well    much confusion here  both in the question and in the answers      To start with  the two solutions  i e  yours and the suggested one  are not equivalent  they happen to be equivalent only for the special case of 1-D score arrays  You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example   Results-wise  the only actual difference between the two solutions is the axis 0 argument  To see that this is the case  let s try your solution  your softmax  and one where the only difference is the axis argument   import numpy as np    your solution  def your softmax x          Compute softmax values for each sets of scores in x         e x   np exp x - np max x       return e x   e x sum      correct solution  def softmax x          Compute softmax values for each sets of scores in x         e x   np exp x - np max x       return e x   e x sum axis 0    only difference   As I said  for a 1-D score array  the results are indeed identical   scores    3 0  1 0  0 2  print your softmax scores       0 8360188   0 11314284  0 05083836  print softmax scores       0 8360188   0 11314284  0 05083836  your softmax scores     softmax scores    array   True   True   True   dtype bool    Nevertheless  here are the results for the 2-D score array given in the Udacity quiz as a test example   scores2D   np array   1  2  3  6                         2  4  5  6                         3  8  7  6     print your softmax scores2D         4 89907947e-04   1 33170787e-03   3 61995731e-03   7 27087861e-02        1 33170787e-03   9 84006416e-03   2 67480676e-02   7 27087861e-02        3 61995731e-03   5 37249300e-01   1 97642972e-01   7 27087861e-02    print softmax scores2D        0 09003057  0 00242826  0 01587624  0 33333333       0 24472847  0 01794253  0 11731043  0 33333333       0 66524096  0 97962921  0 86681333  0 33333333     The results are different - the second one is indeed identical with the one expected in the Udacity quiz  where all columns indeed sum to 1  which is not the case with the first  wrong  result   So  all the fuss was actually for an implementation detail - the axis argument  According to the numpy sum documentation      The default  axis None  will sum all of the elements of the input array   while here we want to sum row-wise  hence axis 0  For a 1-D array  the sum of the  only  row and the sum of all the elements happen to be identical  hence your identical results in that case     The axis issue aside  your implementation  i e  your choice to subtract the max first  is actually better than the suggested solution  In fact  it is the recommended way of implementing the softmax function - see here for the justification  numeric stability  also pointed out by some other answers here

User · Answer

So  this is really a comment to desertnaut s answer but I can t comment on it yet due to my reputation  As he pointed out  your version is only correct if your input consists of a single sample  If your input consists of several samples  it is wrong  However  desertnaut s solution is also wrong  The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input  Let me show this to you   import numpy as np    your solution  def your softmax x          Compute softmax values for each sets of scores in x         e x   np exp x - np max x       return e x   e x sum      desertnaut solution  copied from his answer    def desertnaut softmax x          Compute softmax values for each sets of scores in x         e x   np exp x - np max x       return e x   e x sum axis 0    only difference    my  correct  solution  def softmax z       assert len z shape     2     s   np max z  axis 1      s   s    np newaxis    necessary step to do broadcasting     e x   np exp z - s      div   np sum e x  axis 1      div   div    np newaxis    dito     return e x   div   Lets take desertnauts example   x1   np array   1  2  3  6      notice that we put the data into 2 dimensions      This is the output   your softmax x1  array    0 00626879   0 01704033   0 04632042   0 93037047     desertnaut softmax x1  array    1    1    1    1      softmax x1  array    0 00626879   0 01704033   0 04632042   0 93037047      You can see that desernauts version would fail in this situation   It would not if the input was just one dimensional like np array  1  2  3  6     Lets now use 3 samples since thats the reason why we use a 2 dimensional input  The following x2 is not the same as the one from desernauts example    x2   np array   1  2  3  6      sample 1                 2  4  5  6      sample 2                 1  2  3  6      sample 1 again      This input consists of a batch with 3 samples  But sample one and three are essentially the same  We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1   your softmax x2  array    0 00183535   0 00498899   0 01356148   0 27238963            0 00498899   0 03686393   0 10020655   0 27238963            0 00183535   0 00498899   0 01356148   0 27238963      desertnaut softmax x2  array    0 21194156   0 10650698   0 10650698   0 33333333            0 57611688   0 78698604   0 78698604   0 33333333            0 21194156   0 10650698   0 10650698   0 33333333     softmax x2  array    0 00626879   0 01704033   0 04632042   0 93037047            0 01203764   0 08894682   0 24178252   0 65723302            0 00626879   0 01704033   0 04632042   0 93037047      I hope you can see that this is only the case with my solution   softmax x1     softmax x2  0  array    True   True   True   True    dtype bool   softmax x1     softmax x2  2  array    True   True   True   True    dtype bool    Additionally  here is the results of TensorFlows softmax implementation   import tensorflow as tf import numpy as np batch   np asarray   1 2 3 6   2 4 5 6   1 2 3 6    x   tf placeholder tf float32  shape  None  4   y   tf nn softmax x  init   tf initialize all variables   sess   tf Session   sess run y  feed dict  x  batch     And the result   array    0 00626879   0 01704033   0 04632042   0 93037045            0 01203764   0 08894681   0 24178252   0 657233              0 00626879   0 01704033   0 04632042   0 93037045    dtype float32

User · Answer

I was curious to see the performance difference between these import numpy as np  def softmax x        quot  quot  quot Compute softmax values for each sets of scores in x  quot  quot  quot      return np exp x    np sum np exp x   axis 0   def softmaxv2 x        quot  quot  quot Compute softmax values for each sets of scores in x  quot  quot  quot      e x   np exp x - np max x       return e x   e x sum    def softmaxv3 x        quot  quot  quot Compute softmax values for each sets of scores in x  quot  quot  quot      e x   np exp x - np max x       return e x   np sum e x  axis 0   def softmaxv4 x        quot  quot  quot Compute softmax values for each sets of scores in x  quot  quot  quot      return np exp x - np max x     np sum np exp x - np max x    axis 0     x  10 10 18 9 15 3 1 2 1 10 10 10 8 15   Using print  quot ----- softmax quot    timeit  a softmax x  print  quot ----- softmaxv2 quot    timeit  a softmaxv2 x  print  quot ----- softmaxv3 quot    timeit  a softmaxv2 x  print  quot ----- softmaxv4 quot    timeit  a softmaxv2 x   Increasing the values inside x   100  200  500     I get consistently better results with the original numpy version  here is just one test  ----- softmax The slowest run took 8 07 times longer than the fastest  This could mean that an intermediate result is being cached  100000 loops  best of 3  17 8   s per loop ----- softmaxv2 The slowest run took 4 30 times longer than the fastest  This could mean that an intermediate result is being cached  10000 loops  best of 3  23   s per loop ----- softmaxv3 The slowest run took 4 06 times longer than the fastest  This could mean that an intermediate result is being cached  10000 loops  best of 3  23   s per loop ----- softmaxv4 10000 loops  best of 3  23   s per loop  Until     the values inside x reach  800  then I get ----- softmax  usr local lib python3 6 dist-packages ipykernel launcher py 4  RuntimeWarning  overflow encountered in exp   after removing the cwd from sys path   usr local lib python3 6 dist-packages ipykernel launcher py 4  RuntimeWarning  invalid value encountered in true divide   after removing the cwd from sys path  The slowest run took 18 41 times longer than the fastest  This could mean that an intermediate result is being cached  10000 loops  best of 3  23 6   s per loop ----- softmaxv2 The slowest run took 4 18 times longer than the fastest  This could mean that an intermediate result is being cached  10000 loops  best of 3  22 8   s per loop ----- softmaxv3 The slowest run took 19 44 times longer than the fastest  This could mean that an intermediate result is being cached  10000 loops  best of 3  23 6   s per loop ----- softmaxv4 The slowest run took 16 82 times longer than the fastest  This could mean that an intermediate result is being cached  10000 loops  best of 3  22 7   s per loop  As some said  your version is more numerically stable  for large numbers   For small numbers could be the other way around

User · Answer

I would suggest this   def softmax z       z norm np exp z-np max z axis 0 keepdims True       return np divide z norm np sum z norm axis 0 keepdims True      It will work for stochastic as well as the batch  For more detail see   https   medium com  ravish1729 analysis-of-softmax-function-ad058d6a564d

User · Answer

They re both correct  but yours is preferred from the point of view of numerical stability   You start with  e    x - max x     sum e  x - max x     By using the fact that a  b - c     a b   a c  we have    e   x    e   max x    sum e   x   e   max x       e   x   sum e   x    Which is what the other answer says  You could replace max x  with any variable and it would cancel out

[python] How to implement the Softmax function in Python

Examples related to python

Examples related to numpy

Examples related to machine-learning

Examples related to logistic-regression

Examples related to softmax