subsetting a Python DataFrame

Question

I am transitioning from R to Python  I just began using Pandas  I have an R code that subsets nicely   k1  lt - subset data  Product   p id  amp  Month  lt  mn  amp  Year    yr  select   c Time  Product     Now  I want to do similar stuff in Python  this is what I have got so far   import pandas as pd data   pd read csv     data monthly prod sales csv      first  index the dataset by Product  And  get all that matches a given  p id  and time   data set index  Product    k   data ix  p id   Time       then  index this subset with Time and do more subsetting     I am beginning to feel that I am doing this the wrong way  perhaps  there is an elegant solution  Can anyone help  I need to extract month and year from the timestamp I have and do subsetting  Perhaps there is a one-liner that will accomplish all this   k1  lt - subset data  Product   p id  amp  Time  gt   start time  amp  Time  lt  end time  select   c Time  Product     thanks

User · Answer

I ve found that you can use any subset condition for a given column by wrapping it in     For instance  you have a df with columns   Product   Time    Year    Color    And let s say you want to include products made before 2014  You could write   df df  Year    lt  2014    To return all the rows where this is the case  You can add different conditions   df df  Year    lt  2014  df  Color      Red     Then just choose the columns you want as directed above  For instance  the product color and key for the df above   df df  Year    lt  2014  df  Color       Red     Product   Color

User · Answer

Regarding some points mentioned in previous answers  and to improve readability   No need for data loc or query  but I do think it is a bit long  The parentheses are also necessary  because of the precedence of the  amp  operator vs  the comparison operators   I like to write such expressions as follows - less brackets  faster to type  easier to read  Closer to R  too  q product   df Product    p id q start   df Time  gt  start time q end   df Time  lt  end time  df loc q product  amp  q start  amp  q end  c  Time Product       c is just a convenience c   lambda v  v split

User · Answer

I ll assume that Time and Product are columns in a DataFrame   df is an instance of DataFrame  and that other variables are scalar values   For now  you ll have to reference the DataFrame instance   k1   df loc  df Product    p id   amp   df Time  gt   start time   amp   df Time  lt  end time     Time    Product      The parentheses are also necessary  because of the precedence of the  amp  operator vs  the comparison operators  The  amp  operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators   In pandas 0 13 a new experimental DataFrame query   method will be available  It s extremely similar to subset modulo the select argument   With query   you d do it like this   df   Time    Product    query  Product    p id and Month  lt  mn and Year    yr     Here s a simple example   In  9   df   DataFrame   gender   np random choice   m    f    size 10    price   poisson 100  size 10     In  10   df Out 10     gender  price 0      m     89 1      f    123 2      f    100 3      m    104 4      m     98 5      m    103 6      f    100 7      f    109 8      f     95 9      m     87  In  11   df query  gender     m  and price  lt  100   Out 11     gender  price 0      m     89 4      m     98 9      m     87   The final query that you re interested will even be able to take advantage of chained comparisons  like this   k1   df   Time    Product    query  Product    p id and start time  lt   Time  lt  end time

User · Answer

Just for someone looking for a solution more similar to R   df  df Product    p id   amp   df Time gt  start time   amp   df Time  lt  end time     Time   Product      No need for data loc or query  but I do think it is a bit long

User · Answer

Creating an Empty Dataframe with known Column Name   Names     Col1   ActivityID   TransactionID   df   pd DataFrame columns   Names    Creating a dataframe from csv   df   pd DataFrame        file name csv     Creating a dynamic filter to subset a dtaframe   i   12 df df  ActivitiID    lt   i    Creating a dynamic filter to subset required columns of dtaframe  df df  ActivityID      i    TransactionID   ActivityID

[python] subsetting a Python DataFrame

Examples related to python

Examples related to pandas

Examples related to subset