Best way to get the max value in a Spark dataframe column

Question

I m trying to figure out the best way to get the largest value in a Spark dataframe column   Consider the following example   df   spark createDataFrame   1   4     2   5     3   6       A    B    df show     Which creates    --- ---     A   B   --- ---   1 0 4 0   2 0 5 0   3 0 6 0   --- ---    My goal is to find the largest value in column A  by inspection  this is 3 0   Using PySpark  here are four approaches I can think of     Method 1  Use describe   float df describe  A   filter  summary    max    select  A   first   asDict    A       Method 2  Use SQL df registerTempTable  df table   spark sql  SELECT MAX A  as maxval FROM df table   first   asDict    maxval      Method 3  Use groupby   df groupby   max  A   first   asDict    max A       Method 4  Convert to RDD df select  A   rdd max   0    Each of the above gives the right answer  but in the absence of a Spark profiling tool I can t tell which is best    Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage  or whether there is a more direct method than the ones above

User · Answer

in pyspark you can do this   max df select  ColumnName   rdd flatMap lambda x  x  collect

User · Answer

import org apache spark sql SparkSession import org apache spark sql functions    val testDataFrame   Seq     1 0  4 0    2 0  5 0    3 0  6 0    toDF  A    B    val  maxA  maxB    testDataFrame select max  A    max  B       as  Double  Double      first   println maxA  maxB    And the result is  3 0 6 0   which is the same to the testDataFrame agg max   A    max   B    collect   0  However  testDataFrame agg max   A    max   B    collect   0  returns a List   3 0 6 0

User · Answer

gt df1 show    ----- -------------------- -------- ---------- -----------   floor            timestamp      uid          x           y   ----- -------------------- -------- ---------- -----------       1 2014-07-19T16 00     600dfbe2  103 79211 71 50419418       1 2014-07-19T16 00     5e7b40e1  110 33613 100 6828393       1 2014-07-19T16 00     285d22e4 110 066315 86 48873585       1 2014-07-19T16 00     74d917a1  103 78499 71 45633073    gt row1   df1 agg   x    max    collect   0   gt print row1 Row max x  110 33613   gt print row1  max x    110 33613   The answer is almost the same as method3  but seems the  asDict    in method3 can be removed

User · Answer

Another way of doing it   df select f max f col  A    alias  MAX    limit 1  collect   0  MAX   On my data  I got this benchmarks   df select f max f col  A    alias  MAX    limit 1  collect   0  MAX CPU times  user 2 31 ms  sys  3 31 ms  total  5 62 ms Wall time  3 7 s  df select  A   rdd max   0  CPU times  user 23 2 ms  sys  13 9 ms  total  37 1 ms Wall time  10 3 s  df agg   A    max    collect   0  0  CPU times  user 0 ns  sys  4 77 ms  total  4 77 ms Wall time  3 75 s   All of them give the same answer

User · Answer

Here is a lazy way of doing this  by just doing compute Statistics   df write mode  overwrite   saveAsTable  sampleStats   Query    ANALYZE TABLE sampleStats COMPUTE STATISTICS FOR COLUMNS         join df columns  spark sql Query   df describe  ColName     or  spark sql  Select   from sampleStats   describe  ColName     or you can open a hive shell and    describe formatted table sampleStats    You will see the statistics in the properties - min  max  distinct  nulls  etc

User · Answer

In case some wonders how to do it using Scala  using Spark 2 0     here you go   scala gt  df createOrReplaceTempView  TEMP DF   scala gt  val myMax   spark sql  SELECT MAX x  as maxval FROM TEMP DF        collect   0  getInt 0  scala gt  print myMax  117

User · Answer

Remark  Spark is intended to work on Big Data - distributed computing  The size of the example DataFrame is very small  so the order of real-life examples can be altered with respect to the small   example   Slowest  Method 1  because  describe  A   calculates min  max  mean  stddev  and count  5 calculations over the whole column    Medium  Method 4  because   rdd  DF to RDD transformation  slows down the process   Faster  Method 3   Method 2   method 5  because the logic is very similar  so Spark s catalyst optimizer follows very similar logic with minimal number of operations  get max of a particular column  collect a single-value dataframe     asDict   adds a little extra-time comparing 3 2 to 5    import pandas as pd import time  time dict       dfff   self spark createDataFrame   1   4     2   5     3   6       A    B     --  For bigger realistic dataframe just uncomment the following 3 lines  lst   list np random normal 0 0  100 0  100000    pdf   pd DataFrame   A   lst   B   lst   C   lst   D   lst    dfff   self sqlContext createDataFrame pdf   tic1   int round time time     1000     Method 1  Use describe   max val   float dfff describe  A   filter  summary    max    select  A   collect   0  asDict    A    tac1   int round time time     1000   time dict  m1    tac1 - tic1 print  max val   tic2   int round time time     1000     Method 2  Use SQL dfff registerTempTable  df table   max val   self sqlContext sql  SELECT MAX A  as maxval FROM df table   collect   0  asDict    maxval   tac2   int round time time     1000   time dict  m2    tac2 - tic2 print  max val   tic3   int round time time     1000     Method 3  Use groupby   max val   dfff groupby   max  A   collect   0  asDict    max A    tac3   int round time time     1000   time dict  m3    tac3 - tic3 print  max val   tic4   int round time time     1000     Method 4  Convert to RDD max val   dfff select  A   rdd max   0  tac4   int round time time     1000   time dict  m4    tac4 - tic4 print  max val   tic5   int round time time     1000     Method 4  Convert to RDD max val   dfff agg   A    max    collect   0  0  tac5   int round time time     1000   time dict  m5    tac5 - tic5 print  max val   print time dict   Result on an edge-node of a cluster in milliseconds  ms     small DF  ms      m1   7096   m2   205   m3   165   m4   211   m5   180   bigger DF  ms     m1   10260   m2   452   m3   465   m4   916   m5   373

User · Answer

Max value for a particular column of a dataframe can be achieved by using -  your max value   df agg   your-column    max    collect   0  0

User · Answer

First add the import line   from pyspark sql functions import min  max  To find the min value of age in the dataframe   df agg min  age    show     --------   min age    --------         29   --------    To find the max value of age in the dataframe   df agg max  age    show     --------   max age    --------         77   --------

User · Answer

The below example shows how to get the max value in a Spark dataframe column   from pyspark sql functions import max  df   sql context createDataFrame   1   4     2   5     3   6       A    B    df show    --- ---     A   B   --- ---   1 0 4 0   2 0 5 0   3 0 6 0   --- ---   result   df select  max  A     show   result show    ------   max A    ------      3 0   ------   print result collect   0   max A    3 0   Similarly min  mean  etc  can be calculated as shown below   from pyspark sql functions import mean  min  max  result   df select  mean  A    min  A    max  A     result show    ------ ------ ------   avg A  min A  max A    ------ ------ ------      2 0    1 0    3 0   ------ ------ ------

User · Answer

I used another solution  by  satprem rath  already present in this chain    To find the min value of age in the dataframe   df agg min  age    show     --------   min age    --------         29   --------    edit  to add more context   While the above method printed the result  I faced issues when assigning the result to a variable to reuse later   Hence  to get only the int value assigned to a variable   from pyspark sql functions import max  min    maxValueA   df agg max  A    collect   0  0  maxValueB   df agg max  B    collect   0  0

User · Answer

To just get the value use any of these   df1 agg   x    max    collect   0  0  df1 agg   x    max    head   0  df1 agg   x    max    first   0    Alternatively we could do these for  min   from pyspark sql functions import min  max df1 agg min  id    collect   0  0  df1 agg min  id    head   0  df1 agg min  id    first   0

User · Answer

I believe the best solution will be using head    Considering your example    --- ---     A   B   --- ---   1 0 4 0   2 0 5 0   3 0 6 0   --- ---    Using agg and max method of python we can get the value as following   from pyspark sql functions import max df agg max df A   head   0   This will return  3 0  Make sure you have the correct import  from pyspark sql functions import max The max function we use here is the pySPark sql library function  not the default max function of python

[python] Best way to get the max value in a Spark dataframe column

Examples related to python

Examples related to apache-spark

Examples related to pyspark

Examples related to apache-spark-sql