get specific row from spark dataframe

Question

Is there any alternative for df 100  c  column    in scala spark data frames  I want to select specific row from a column of spark data frame  for example 100th row in above R equivalent code

User · Answer

This is how I achieved the same in Scala  I am not sure if it is more efficient than the valid answer  but it requires less coding  val parquetFileDF   sqlContext read parquet  myParquetFule parquet    val myRow7th   parquetFileDF rdd take 7  last

User · Answer

The getrows   function below should get the specific rows you want    For completeness  I have written down the full code in order to reproduce the output     Create SparkSession from pyspark sql import SparkSession spark   SparkSession builder master  local   appName  scratch   getOrCreate      Create the dataframe df   spark createDataFrame    a   1     b   2     c   3      letter    name       Function to get rows at  rownums  def getrows df  rownums None       return df rdd zipWithIndex   filter lambda x  x 1  in rownums  map lambda x  x 0      Get rows at positions 0 and 2  getrows df  rownums  0  2   collect      Output    gt    Row letter  a   name 1     Row letter  c   name 3

User · Answer

In PySpark  if your dataset is small  can fit into memory of driver   you can do  df collect   n    where df is the DataFrame object  and n is the Row of interest  After getting said Row  you can do row myColumn or row  myColumn   to get the contents  as spelled out in the API docs

User · Answer

This Works for me in PySpark df select  quot column quot   collect   0  0

User · Answer

There is a scala way  if you have a enough memory on working machine    val arr   df select  column   rdd collect println arr 100     If dataframe schema is unknown  and you know actual type of  column  field  for example double   than you can get arr as following   val arr   df select   column  cast  Double    as Double  rdd collect

User · Answer

Following is a Java-Spark way to do it   1  add a sequentially increment columns  2  Select Row number using Id  3  Drop the Column  import static org apache spark sql functions        ds   ds withColumn  rownum   functions monotonically increasing id     ds   ds filter col  rownum   equalTo 99    ds   ds drop  rownum      N B  monotonically increasing id starts from 0

User · Answer

Firstly  you must understand that DataFrames are distributed  that means you can t access them in a typical procedural way  you must do an analysis first  Although  you are asking about Scala I suggest you to read the Pyspark Documentation  because it has more examples than any of the other documentations   However  continuing with my explanation  I would use some methods of the RDD API cause all DataFrames have one RDD as attribute  Please  see my example bellow  and notice how I take the 2nd record   df   sqlContext createDataFrame    a   1     b   2     c   3      letter    name    myIndex   1 values    df rdd zipWithIndex                filter lambda   l  v   i   i    myIndex               map lambda   l v   i    l  v                collect     print values 0      u b   2    Hopefully  someone gives another solution with fewer steps

User · Answer

you can simply do that by using below single line of code                  val arr   df select  column   collect   99

User · Answer

When you want to fetch max value of a date column from dataframe  just the value without object type or Row object information  you can refer to below code  table    quot mytable quot  max date   df select max  date col    first   0    2020-06-26 instead of Row max reference week  datetime date 2020  6  26

[apache-spark] get specific row from spark dataframe

Examples related to apache-spark

Examples related to apache-spark-sql