I have a dataframe of the following form (for example)
shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method
1,FALSE,0,0,MX,
2,FALSE,1,0,MX,
3,FALSE,0,0,MX,
4,FALSE,22,0,MX,
5,FALSE,0,0,MX,
6,FALSE,0,0,MX,
7,FALSE,5,0,MX,
8,FALSE,0,0,MX,
9,FALSE,4,0,MX,
10,FALSE,2,0,MX,
11,FALSE,0,0,MX,
12,FALSE,13,0,MX,
13,FALSE,0,0,CA,
14,FALSE,0,0,US,
How can I use Pandas to calculate summary statistics of each column (column data types are variable, some columns have no information
And then return the a dataframe of the form:
columnname, max, min, median,
is_martian, NA, NA, FALSE
So on and so on
Now there is the pandas_profiling
package, which is a more complete alternative to df.describe()
.
If your pandas dataframe is df
, the below will return a complete analysis including some warnings about missing values, skewness, etc. It presents histograms and correlation plots as well.
import pandas_profiling
pandas_profiling.ProfileReport(df)
See the example notebook detailing the usage.
To clarify one point in @EdChum's answer, per the documentation, you can include the object columns by using df.describe(include='all')
. It won't provide many statistics, but will provide a few pieces of info, including count, number of unique values, top value. This may be a new feature, I don't know as I am a relatively new user.
Source: Stackoverflow.com