I have this data frame diamonds
which is composed of variables like (carat, price, color)
, and I want to draw a scatter plot of price
to carat
for each color
, which means different color
has different color in the plot.
This is easy in R
with ggplot
:
ggplot(aes(x=carat, y=price, color=color), #by setting color=color, ggplot automatically draw in different colors
data=diamonds) + geom_point(stat='summary', fun.y=median)
I wonder how could this be done in Python using matplotlib
?
PS:
I know about auxiliary plotting packages, such as seaborn
and ggplot for python
, and I donot prefer them, just want to find out if it is possible to do the job using matplotlib
alone, ;P
This question is related to
matplotlib
pandas
visualization
You can convert the categorical column into a numerical one by using the commands:
#we converting it into categorical data
cat_col = df['column_name'].astype('categorical')
#we are getting codes for it
cat_col = cat_col.cat.codes
# we are using c parameter to change the color.
plt.scatter(df['column1'],df['column2'], c=cat_col)
I had the same question, and have spent all day trying out different packages.
I had originally used matlibplot: and was not happy with either mapping categories to predefined colors; or grouping/aggregating then iterating through the groups (and still having to map colors). I just felt it was poor package implementation.
Seaborn wouldn't work on my case, and Altair ONLY works inside of a Jupyter Notebook.
The best solution for me was PlotNine, which "is an implementation of a grammar of graphics in Python, and based on ggplot2".
Below is the plotnine code to replicate your R example in Python:
from plotnine import *
from plotnine.data import diamonds
g = ggplot(diamonds, aes(x='carat', y='price', color='color')) + geom_point(stat='summary')
print(g)
So clean and simple :)
I usually do it using Seaborn which is built on top of matplotlib
import seaborn as sns
iris = sns.load_dataset('iris')
sns.scatterplot(x='sepal_length', y='sepal_width',
hue='species', data=iris);
Here's a succinct and generic solution to use a seaborn color palette.
First find a color palette you like and optionally visualize it:
sns.palplot(sns.color_palette("Set2", 8))
Then you can use it with matplotlib
doing this:
# Unique category labels: 'D', 'F', 'G', ...
color_labels = df['color'].unique()
# List of RGB triplets
rgb_values = sns.color_palette("Set2", 8)
# Map label to RGB
color_map = dict(zip(color_labels, rgb_values))
# Finally use the mapped values
plt.scatter(df['carat'], df['price'], c=df['color'].map(color_map))
With df.plot()
Normally when quickly plotting a DataFrame, I use pd.DataFrame.plot()
. This takes the index as the x value, the value as the y value and plots each column separately with a different color.
A DataFrame in this form can be achieved by using set_index
and unstack
.
import matplotlib.pyplot as plt
import pandas as pd
carat = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
price = [100, 100, 200, 200, 300, 300, 400, 400, 500, 500, 600, 600]
color =['D', 'D', 'D', 'E', 'E', 'E', 'F', 'F', 'F', 'G', 'G', 'G',]
df = pd.DataFrame(dict(carat=carat, price=price, color=color))
df.set_index(['color', 'carat']).unstack('color')['price'].plot(style='o')
plt.ylabel('price')
With this method you do not have to manually specify the colors.
This procedure may make more sense for other data series. In my case I have timeseries data, so the MultiIndex consists of datetime and categories. It is also possible to use this approach for more than one column to color by, but the legend is getting a mess.
Using Altair.
from altair import *
import pandas as pd
df = datasets.load_dataset('iris')
Chart(df).mark_point().encode(x='petalLength',y='sepalLength', color='species')
Here a combination of markers and colors from a qualitative colormap in matplotlib
:
import itertools
import numpy as np
from matplotlib import markers
import matplotlib.pyplot as plt
m_styles = markers.MarkerStyle.markers
N = 60
colormap = plt.cm.Dark2.colors # Qualitative colormap
for i, (marker, color) in zip(range(N), itertools.product(m_styles, colormap)):
plt.scatter(*np.random.random(2), color=color, marker=marker, label=i)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., ncol=4);
Source: Stackoverflow.com