[python] subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:

k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))

Now, I want to do similar stuff in Python. this is what I have got so far:

import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")


#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
 data.set_index('Product')
 k = data.ix[[p.id, 'Time']]

# then, index this subset with Time and do more subsetting..

I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:

k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))

thanks.

This question is related to python pandas subset

The answer is


Just for someone looking for a solution more similar to R:

df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]

No need for data.loc or query, but I do think it is a bit long.


I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']

And let's say you want to include products made before 2014. You could write,

df[df['Year'] < 2014]

To return all the rows where this is the case. You can add different conditions.

df[df['Year'] < 2014][df['Color' == 'Red']

Then just choose the columns you want as directed above. For instance, the product color and key for the df above,

df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]

Creating an Empty Dataframe with known Column Name:

Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)

Creating a dataframe from csv:

df = pd.DataFrame('...../file_name.csv')

Creating a dynamic filter to subset a dtaframe:

i = 12
df[df['ActivitiID'] <= i]

Creating a dynamic filter to subset required columns of dtaframe

df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Regarding some points mentioned in previous answers, and to improve readability:

No need for data.loc or query, but I do think it is a bit long.

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.

I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.

q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time

df.loc[q_product & q_start & q_end, c('Time,Product')]

# c is just a convenience
c = lambda v: v.split(',') 

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to subset

how to remove multiple columns in r dataframe? Using grep to help subset a data frame in R Subset a dataframe by multiple factor levels creating a new list with subset of list using index in python subsetting a Python DataFrame Undefined columns selected when subsetting data frame How to select some rows with specific rownames from a dataframe? Subset data to contain only columns whose names match a condition find all subsets that sum to a particular value Subset and ggplot2