[python] Frequency table for a single variable

One last newbie pandas question for the day: How do I generate a table for a single Series?

For example:

my_series = pandas.Series([1,2,2,3,3,3])
pandas.magical_frequency_function( my_series )

>> {
     1 : 1,
     2 : 2, 
     3 : 3
   }

Lots of googling has led me to Series.describe() and pandas.crosstabs, but neither of these does quite what I need: one variable, counts by categories. Oh, and it'd be nice if it worked for different data types: strings, ints, etc.

This question is related to python statistics pandas frequency

The answer is


for frequency distribution of a variable with excessive values you can collapse down the values in classes,

Here I excessive values for employrate variable, and there's no meaning of it's frequency distribution with direct values_count(normalize=True)

                country  employrate alcconsumption
0           Afghanistan   55.700001            .03
1               Albania   11.000000           7.29
2               Algeria   11.000000            .69
3               Andorra         nan          10.17
4                Angola   75.699997           5.57
..                  ...         ...            ...
208             Vietnam   71.000000           3.91
209  West Bank and Gaza   32.000000               
210         Yemen, Rep.   39.000000             .2
211              Zambia   61.000000           3.56
212            Zimbabwe   66.800003           4.96

[213 rows x 3 columns]

frequency distribution with values_count(normalize=True) with no classification,length of result here is 139 (seems meaningless as a frequency distribution):

print(gm["employrate"].value_counts(sort=False,normalize=True))

50.500000   0.005618
61.500000   0.016854
46.000000   0.011236
64.500000   0.005618
63.500000   0.005618

58.599998   0.005618
63.799999   0.011236
63.200001   0.005618
65.599998   0.005618
68.300003   0.005618
Name: employrate, Length: 139, dtype: float64

putting classification we put all values with a certain range ie.

0-10 as 1,
11-20 as 2  
21-30 as 3, and so forth.
gm["employrate"]=gm["employrate"].str.strip().dropna()  
gm["employrate"]=pd.to_numeric(gm["employrate"])
gm['employrate'] = np.where(
   (gm['employrate'] <=10) & (gm['employrate'] > 0) , 1, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=20) & (gm['employrate'] > 10) , 1, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=30) & (gm['employrate'] > 20) , 2, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=40) & (gm['employrate'] > 30) , 3, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=50) & (gm['employrate'] > 40) , 4, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=60) & (gm['employrate'] > 50) , 5, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=70) & (gm['employrate'] > 60) , 6, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=80) & (gm['employrate'] > 70) , 7, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=90) & (gm['employrate'] > 80) , 8, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=100) & (gm['employrate'] > 90) , 9, gm['employrate']
   )
print(gm["employrate"].value_counts(sort=False,normalize=True))

after classification we have a clear frequency distribution. here we can easily see, that 37.64% of countries have employ rate between 51-60% and 11.79% of countries have employ rate between 71-80%

5.000000   0.376404
7.000000   0.117978
4.000000   0.179775
6.000000   0.264045
8.000000   0.033708
3.000000   0.028090
Name: employrate, dtype: float64

You can use list comprehension on a dataframe to count frequencies of the columns as such

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]

Breakdown:

my_series.select_dtypes(include=['O']) 

Selects just the categorical data

list(my_series.select_dtypes(include=['O']).columns) 

Turns the columns from above into a list

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)] 

Iterates through the list above and applies value_counts() to each of the columns


The answer provided by @DSM is simple and straightforward, but I thought I'd add my own input to this question. If you look at the code for pandas.value_counts, you'll see that there is a lot going on.

If you need to calculate the frequency of many series, this could take a while. A faster implementation would be to use numpy.unique with return_counts = True

Here is an example:

import pandas as pd
import numpy as np

my_series = pd.Series([1,2,2,3,3,3])

print(my_series.value_counts())
3    3
2    2
1    1
dtype: int64

Notice here that the item returned is a pandas.Series

In comparison, numpy.unique returns a tuple with two items, the unique values and the counts.

vals, counts = np.unique(my_series, return_counts=True)
print(vals, counts)
[1 2 3] [1 2 3]

You can then combine these into a dictionary:

results = dict(zip(vals, counts))
print(results)
{1: 1, 2: 2, 3: 3}

And then into a pandas.Series

print(pd.Series(results))
1    1
2    2
3    3
dtype: int64

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to statistics

Function to calculate R2 (R-squared) in R pandas: find percentile stats of a given column What exactly does numpy.exp() do? Find p-value (significance) in scikit-learn LinearRegression How to plot ROC curve in Python Pandas - Compute z-score for all columns Calculating percentile of dataset column How to normalize an array in NumPy to a unit vector? How to find row number of a value in R code np.mean() vs np.average() in Python NumPy?

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to frequency

How to count how many values per level in a given factor? Relative frequencies / proportions with dplyr Count the frequency that a value occurs in a dataframe column Count frequency of words in a list and sort by frequency Frequency table for a single variable Getting the count of unique values in a column in bash How to find most common elements of a list? How to count the frequency of the elements in an unordered list? Item frequency count in Python