How to delete columns in pyspark dataframe

Question

gt  gt  gt  a DataFrame id  bigint  julian date  string  user id  bigint   gt  gt  gt  b DataFrame id  bigint  quan created money  decimal 10 0   quan created cnt  bigint   gt  gt  gt  a join b  a id  b id   outer   DataFrame id  bigint  julian date  string  user id  bigint  id  bigint  quan created money  decimal 10 0   quan created cnt  bigint    There are two id  bigint and I want to delete one  How can I do

User · Answer

You can delete column like this   df drop  column Name  columns   In your case    df drop  id   columns   If you want to drop more than one column you can do   dfWithLongColName drop  ORIGIN COUNTRY NAME    DEST COUNTRY NAME

User · Answer

Reading the Spark documentation I found an easier solution    Since version 1 4 of spark there is a function drop col  which can be used in pyspark on a dataframe   You can use it in two ways    df drop  age   collect   df drop df age  collect     Pyspark Documentation - Drop

User · Answer

You could either explicitly name the columns you want to keep  like so   keep    a id  a julian date  a user id  b quan created money  b quan created cnt    Or in a more general approach you d include all columns except for a specific one via a list comprehension  For example like this  excluding the id column from b    keep    a c  for c in a columns     b c  for c in b columns if c     id     Finally you make a selection on your join result   d   a join b  a id  b id   outer   select  keep

User · Answer

Adding to  Patrick s answer  you can use the following to drop multiple columns  columns to drop     id    id copy   df   df drop  columns to drop

User · Answer

Consider 2 dataFrames    gt  gt  gt  aDF show    --- ----    id datA   --- ----     1   a1     2   a2     3   a3   --- ----    and    gt  gt  gt  bDF show    --- ----    id datB   --- ----     2   b2     3   b3     4   b4   --- ----    To accomplish what you are looking for  there are 2 ways   1  Different joining condition  Instead of saying aDF id    bDF id  aDF join bDF  aDF id    bDF id   outer     Write this   aDF join bDF   id    outer   show    --- ---- ----    id datA datB   --- ---- ----     1   a1 null     3   a3   b3     2   a2   b2     4 null   b4   --- ---- ----    This will automatically get rid of the extra the dropping process   2  Use Aliasing  You will lose data related to B Specific Id s in this    gt  gt  gt  from pyspark sql functions import col  gt  gt  gt  aDF alias  a   join bDF alias  b    aDF id    bDF id   outer   drop col  b id    show     ---- ---- ----     id datA datB   ---- ---- ----      1   a1 null      3   a3   b3      2   a2   b2   null null   b4   ---- ---- ----

User · Answer

An easy way to do this is to user  select  and realize you can get a list of all columns for the dataframe  df  with df columns  drop list     a column    another column         df select  column for column in df columns if column not in drop list

User · Answer

Maybe a little bit off topic  but here is the solution using Scala  Make an Array of column names from your oldDataFrame and delete the columns that you want to drop   colExclude    Then pass the Array Column  to select and unpack it      val columnsToKeep  Array Column    oldDataFrame columns diff Array  colExclude                                                    map x   gt  oldDataFrame col x   val newDataFrame  DataFrame   oldDataFrame select columnsToKeep

User · Answer

You can use two way   1  You just keep the necessary columns      drop column list     drop column   df   df select  column for column in df columns if column not in drop column list       2  This is the more elegant way     df   df drop  col name     You should avoid the collect   version  because it will send to the master the complete dataset  it will take a big computing effort

[apache-spark] How to delete columns in pyspark dataframe

Examples related to apache-spark

Examples related to apache-spark-sql

Examples related to pyspark