[python] How to calculate 1st and 3rd quartiles?

I have DataFrame:

    time_diff   avg_trips
0   0.450000    1.0
1   0.483333    1.0
2   0.500000    1.0
3   0.516667    1.0
4   0.533333    2.0

I want to get 1st quartile, 3rd quartile and median for the column time_diff. To obtain median, I use np.median(df["time_diff"].values).

How can I calculate quartiles?

This question is related to python python-2.7 pandas numpy

The answer is


you can use

df.describe()

which would show the information

df.describe()


I also faced a similar problem when trying to find a package that finds quartiles. That's not to say the others are wrong but to say this is how I personally would have defined quartiles. It is similar to Shikar's results with using mid-point but also works on lists that have an odd length. If the quartile position is between lengths, it will use the average of the neighbouring values. (i.e. position always treated as either the exact position or 0.5 of the position)

import math

def find_quartile_postions(size):
    if size == 1:
        # All quartiles are the first (only) element
        return 0, 0, 0
    elif size == 2:
        # Lower quartile is first element, Upper quartile is second element, Median is average
        # Set to 0.5, 0.5, 0.5 if you prefer all quartiles to be the mean value
        return 0, 0.5, 1
    else:
        # Lower quartile is element at 1/4th position, median at 1/2th, upper at 3/4
        # Quartiles can be between positions if size + 1 is not divisible by 4
        return (size + 1) / 4 - 1, (size + 1) / 2 - 1, 3 * (size + 1) / 4 - 1

def find_quartiles(num_array):
    size = len(num_array)
    
    if size == 0:
        quartiles = [0,0,0]
    else:
        sorted_array = sorted(num_array)
        lower_pos, median_pos, upper_pos = find_quartile_postions(size)

        # Floor so can work in arrays
        floored_lower_pos = math.floor(lower_pos)
        floored_median_pos = math.floor(median_pos)
        floored_upper_pos = math.floor(upper_pos)

        # If position is an integer, the quartile is the elem at position
        # else the quartile is the mean of the elem & the elem one position above
        lower_quartile = (sorted_array[floored_lower_pos]
                          if (lower_pos % 1 == 0)
                          else (sorted_array[floored_lower_pos] + sorted_array[floored_lower_pos + 1]) / 2
                         )

        median = (sorted_array[floored_median_pos]
                          if (median_pos % 1 == 0)
                          else (sorted_array[floored_median_pos] + sorted_array[floored_median_pos + 1]) / 2
                         )

        upper_quartile = (sorted_array[floored_upper_pos]
                          if (upper_pos % 1 == 0)
                          else (sorted_array[floored_upper_pos] + sorted_array[floored_upper_pos + 1]) / 2
                         )

        quartiles = [lower_quartile, median, upper_quartile]

    return quartiles

Using np.percentile.

q75, q25 = np.percentile(DataFrame, [75,25])
iqr = q75 - q25

Answer from How do you find the IQR in Numpy?


Coincidentally, this information is captured with the describe method:

df.time_diff.describe()

count    5.000000
mean     0.496667
std      0.032059
min      0.450000
25%      0.483333
50%      0.500000
75%      0.516667
max      0.533333
Name: time_diff, dtype: float64

np.percentile DOES NOT calculate the values of Q1, median, and Q3. Consider the sorted list below:

samples = [1, 1, 8, 12, 13, 13, 14, 16, 19, 22, 27, 28, 31]

running np.percentile(samples, [25, 50, 75]) returns the actual values from the list:

Out[1]: array([12., 14., 22.])

However, the quartiles are Q1=10.0, Median=14, Q3=24.5 (you can also use this link to find the quartiles and median online). One can use the below code to calculate the quartiles and median of a sorted list (because of sorting this approach requires O(nlogn) computations where n is the number of items). Moreover, finding quartiles and median can be done in O(n) computations using the Median of medians Selection algorithm (order statistics).

samples = sorted([28, 12, 8, 27, 16, 31, 14, 13, 19, 1, 1, 22, 13])

def find_median(sorted_list):
    indices = []

    list_size = len(sorted_list)
    median = 0

    if list_size % 2 == 0:
        indices.append(int(list_size / 2) - 1)  # -1 because index starts from 0
        indices.append(int(list_size / 2))

        median = (sorted_list[indices[0]] + sorted_list[indices[1]]) / 2
        pass
    else:
        indices.append(int(list_size / 2))

        median = sorted_list[indices[0]]
        pass

    return median, indices
    pass

median, median_indices = find_median(samples)
Q1, Q1_indices = find_median(samples[:median_indices[0]])
Q2, Q2_indices = find_median(samples[median_indices[-1] + 1:])

quartiles = [Q1, median, Q2]

print("(Q1, median, Q3): {}".format(quartiles))

You can use np.percentile to calculate quartiles (including the median):

>>> np.percentile(df.time_diff, 25)  # Q1
0.48333300000000001

>>> np.percentile(df.time_diff, 50)  # median
0.5

>>> np.percentile(df.time_diff, 75)  # Q3
0.51666699999999999

Or all at once:

>>> np.percentile(df.time_diff, [25, 50, 75])
array([ 0.483333,  0.5     ,  0.516667])

If you want to use raw python rather than numpy or panda, you can use the python stats module to find the median of the upper and lower half of the list:

    >>> import statistics as stat
    >>> def quartile(data):
            data.sort()               
            half_list = int(len(data)//2)
            upper_quartile = stat.median(data[-half_list]
            lower_quartile = stat.median(data[:half_list])
            print("Lower Quartile: "+str(lower_quartile))
            print("Upper Quartile: "+str(upper_quartile))
            print("Interquartile Range: "+str(upper_quartile-lower_quartile)

    >>> quartile(df.time_diff)

Line 1: import the statistics module under the alias "stat"

Line 2: define the quartile function

Line 3: sort the data into ascending order

Line 4: get the length of half of the list

Line 5: get the median of the lower half of the list

Line 6: get the median of the upper half of the list

Line 7: print the lower quartile

Line 8: print the upper quartile

Line 9: print the interquartile range

Line 10: run the quartile function for the time_diff column of the DataFrame


try that way:

dfo = sorted(df.time_diff)

n=len(dfo)

Q1=int((n+3)/4)  
Q3=int((3*n+1)/4)  


print("Q1 position: ", Q1, "Q1 position: " ,Q3)

print("Q1 value: ", dfo[Q1], "Q1 value: ", dfo[Q3])

In my efforts to learn object-oriented programming alongside learning statistics, I made this, maybe you'll find it useful:

samplesCourse = [9, 10, 10, 11, 13, 15, 16, 19, 19, 21, 23, 28, 30, 33, 34, 36, 44, 45, 47, 60]

class sampleSet:
    def __init__(self, sampleList):
        self.sampleList = sampleList
        self.interList = list(sampleList) # interList is sampleList alias; alias used to maintain integrity of original sampleList

    def find_median(self):
        self.median = 0

        if len(self.sampleList) % 2 == 0:
            # find median for even-numbered sample list length
            self.medL = self.interList[int(len(self.interList)/2)-1]
            self.medU = self.interList[int(len(self.interList)/2)]
            self.median = (self.medL + self.medU)/2

        else:
            # find median for odd-numbered sample list length
            self.median = self.interList[int((len(self.interList)-1)/2)]
        return self.median

    def find_1stQuartile(self, median):
        self.lower50List = []
        self.Q1 = 0

        # break out lower 50 percentile from sampleList
        if len(self.interList) % 2 == 0:
            self.lower50List = self.interList[:int(len(self.interList)/2)]
        else:
            # drop median to make list ready to divide into 50 percentiles
            self.interList.pop(interList.index(self.median))
            self.lower50List = self.interList[:int(len(self.interList)/2)]

        # find 1st quartile (median of lower 50 percentiles)
        if len(self.lower50List) % 2 == 0:
            self.Q1L = self.lower50List[int(len(self.lower50List)/2)-1]
            self.Q1U = self.lower50List[int(len(self.lower50List)/2)]
            self.Q1 = (self.Q1L + self.Q1U)/2

        else:
            self.Q1 = self.lower50List[int((len(self.lower50List)-1)/2)]

        return self.Q1

    def find_3rdQuartile(self, median):
        self.upper50List = []
        self.Q3 = 0

        # break out upper 50 percentile from sampleList
        if len(self.sampleList) % 2 == 0:
            self.upper50List = self.interList[int(len(self.interList)/2):]
        else:
            self.interList.pop(interList.index(self.median))
            self.upper50List = self.interList[int(len(self.interList)/2):]

        # find 3rd quartile (median of upper 50 percentiles)
        if len(self.upper50List) % 2 == 0:
            self.Q3L = self.upper50List[int(len(self.upper50List)/2)-1]
            self.Q3U = self.upper50List[int(len(self.upper50List)/2)]
            self.Q3 = (self.Q3L + self.Q3U)/2

        else:
            self.Q3 = self.upper50List[int((len(self.upper50List)-1)/2)]

        return self.Q3

    def find_InterQuartileRange(self, Q1, Q3):
        self.IQR = self.Q3 - self.Q1
        return self.IQR

    def find_UpperFence(self, Q3, IQR):
        self.fence = self.Q3 + 1.5 * self.IQR
        return self.fence

samples = sampleSet(samplesCourse)
median = samples.find_median()
firstQ = samples.find_1stQuartile(median)
thirdQ = samples.find_3rdQuartile(median)
iqr = samples.find_InterQuartileRange(firstQ, thirdQ)
fence = samples.find_UpperFence(thirdQ, iqr)

print("Median is: ", median)
print("1st quartile is: ", firstQ)
print("3rd quartile is: ", thirdQ)
print("IQR is: ", iqr)
print("Upper fence is: ", fence)

Building upon or rather correcting a bit on what Babak said....

np.percentile DOES VERY MUCH calculate the values of Q1, median, and Q3. Consider the sorted list below:

s1=[18,45,66,70,76,83,88,90,90,95,95,98]

running np.percentile(s1, [25, 50, 75]) returns the actual values from the list:

[69.  85.5  91.25]

However, the quartiles are Q1=68.0, Median=85.5, Q3=92.5, which is the correct thing to say

What we are missing here is the interpolation parameter of the np.percentile and related functions. By default the value of this argument is linear. This optional parameter specifies the interpolation method to use when the desired quantile lies between two data points i < j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j, whichever is nearest.
midpoint: (i + j) / 2.

Thus running np.percentile(s1, [25, 50, 75], interpolation='midpoint') returns the actual results for the list:

[68. 85.5 92.5]

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index