I need to count unique ID
values in every domain
I have data
ID, domain
123, 'vk.com'
123, 'vk.com'
123, 'twitter.com'
456, 'vk.com'
456, 'facebook.com'
456, 'vk.com'
456, 'google.com'
789, 'twitter.com'
789, 'vk.com'
I try df.groupby(['domain', 'ID']).count()
But I want to get
domain, count
vk.com 3
twitter.com 2
facebook.com 1
google.com 1
This question is related to
python
pandas
group-by
unique
pandas-groupby
IIUC you want the number of different ID
for every domain
, then you can try this:
output = df.drop_duplicates()
output.groupby('domain').size()
output:
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
dtype: int64
You could also use value_counts
, which is slightly less efficient.But the best is Jezrael's answer using nunique
:
%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 µs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 µs per loop
df.domain.value_counts()
>>> df.domain.value_counts()
vk.com 5
twitter.com 2
google.com 1
facebook.com 1
Name: domain, dtype: int64
Generally to count distinct values in single column, you can use Series.value_counts
:
df.domain.value_counts()
#'vk.com' 5
#'twitter.com' 2
#'facebook.com' 1
#'google.com' 1
#Name: domain, dtype: int64
To see how many unique values in a column, use Series.nunique
:
df.domain.nunique()
# 4
To get all these distinct values, you can use unique
or drop_duplicates
, the slight difference between the two functions is that unique
return a numpy.array
while drop_duplicates
returns a pandas.Series
:
df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)
df.domain.drop_duplicates()
#0 'vk.com'
#2 'twitter.com'
#4 'facebook.com'
#6 'google.com'
#Name: domain, dtype: object
As for this specific problem, since you'd like to count distinct value with respect to another variable, besides groupby
method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts()
:
import pandas as pd
df.drop_duplicates().domain.value_counts()
# 'vk.com' 3
# 'twitter.com' 2
# 'facebook.com' 1
# 'google.com' 1
# Name: domain, dtype: int64
Source: Stackoverflow.com