[python] Count the frequency that a value occurs in a dataframe column

I have a dataset

category
cat a
cat b
cat a

I'd like to be able to return something like (showing unique values and frequency)

category   freq 
cat a       2
cat b       1

This question is related to python pandas frequency

The answer is


In 0.18.1 groupby together with count does not give the frequency of unique values:

>>> df
   a
0  a
1  b
2  s
3  s
4  b
5  a
6  b

>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]

However, the unique values and their frequencies are easily determined using size:

>>> df.groupby('a').size()
a
a    2
b    3
s    2

With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.


I believe this should work fine for any DataFrame columns list.

def column_list(x):
    column_list_df = []
    for col_name in x.columns:
        y = col_name, len(x[col_name].unique())
        column_list_df.append(y)
return pd.DataFrame(column_list_df)

column_list_df.rename(columns={0: "Feature", 1: "Value_count"})

The function "column_list" checks the columns names and then checks the uniqueness of each column values.


Using list comprehension and value_counts for multiple columns in a df

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]

https://stackoverflow.com/a/28192263/786326


df.apply(pd.value_counts).fillna(0)

value_counts - Returns object containing counts of unique values

apply - count frequency in every column. If you set axis=1, you get frequency in every row

fillna(0) - make output more fancy. Changed NaN to 0


your data:

|category|
cat a
cat b
cat a

solution:

 df['freq'] = df.groupby('category')['category'].transform('count')
 df =  df.drop_duplicates()

df.category.value_counts()

This short little line of code will give you the output you want.

If your column name has spaces you can use

df['category'].value_counts()

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.

cats = ['client', 'hotel', 'currency', 'ota', 'user_country']

df[cats] = df[cats].astype('category')

and then calling describe:

df[cats].describe()

This will give you a nice table of value counts and a bit more :):

    client  hotel   currency    ota user_country
count   852845  852845  852845  852845  852845
unique  2554    17477   132 14  219
top 2198    13202   USD Hades   US
freq    102562  8847    516500  242734  340992

If you want to apply to all columns you can use:

df.apply(pd.value_counts)

This will apply a column based aggregation function (in this case value_counts) to each of the columns.


If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().

index, counts = np.unique(df.values,return_counts=True)

np.bincount() could be faster if your values are integers.


Without any libraries, you could do this instead:

def to_frequency_table(data):
    frequencytable = {}
    for key in data:
        if key in frequencytable:
            frequencytable[key] += 1
        else:
            frequencytable[key] = 1
    return frequencytable

Example:

to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

@metatoaster has already pointed this out. Go for Counter. It's blazing fast.

import pandas as pd
from collections import Counter
import timeit
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])

Timers

%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop

%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop

%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop

%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop

Cheers!


n_values = data.income.value_counts()

First unique value count

n_at_most_50k = n_values[0]

Second unique value count

n_greater_50k = n_values[1]

n_values

Output:

<=50K    34014
>50K     11208

Name: income, dtype: int64

Output:

n_greater_50k,n_at_most_50k:-
(11208, 34014)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to frequency

How to count how many values per level in a given factor? Relative frequencies / proportions with dplyr Count the frequency that a value occurs in a dataframe column Count frequency of words in a list and sort by frequency Frequency table for a single variable Getting the count of unique values in a column in bash How to find most common elements of a list? How to count the frequency of the elements in an unordered list? Item frequency count in Python