[r] Make Frequency Histogram for Factor Variables

I am very new to R, so I apologize for such a basic question. I spent an hour googling this issue, but couldn't find a solution.

Say I have some categorical data in my data set about common pet types. I input it as a character vector in R that contains the names of different types of animals. I created it like this:

animals <- c("cat", "dog",  "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "bird")

I turn it into a factor for use with other vectors in my data frame:

animalFactor <- as.factor(animals)

I now want to create a histogram that shows the frequency of each variable on the y-axis, the name of each factor on the x-axis, and contains one bar for each factor. I attempt this code:

hist(table(animalFactor), freq=TRUE, xlab = levels(animalFactor), ylab = "Frequencies")

The output is absolutely nothing like I'd expect. Labeling problems aside, I can't seem to figure out how to create a simple frequency histogram by category.

This question is related to r histogram categorical-data

The answer is


The reason you are getting the unexpected result is that hist(...) calculates the distribution from a numeric vector. In your code, table(animalFactor) behaves like a numeric vector with three elements: 1, 3, 7. So hist(...) plots the number of 1's (1), the number of 3's (1), and the number of 7's (1). @Roland's solution is the simplest.

Here's a way to do this using ggplot:

library(ggplot2)
ggp <- ggplot(data.frame(animals),aes(x=animals))
# counts
ggp + geom_histogram(fill="lightgreen")
# proportion
ggp + geom_histogram(fill="lightblue",aes(y=..count../sum(..count..)))

You would get precisely the same result using animalFactor instead of animals in the code above.


You could also use lattice::histogram()


Country is a categorical variable and I want to see how many occurences of country exist in the data set. In other words, how many records/attendees are from each Country

barplot(summary(df$Country))

Data as factor can be used as input to the plot function.

An answer to a similar question has been given here: https://stat.ethz.ch/pipermail/r-help/2010-December/261873.html

 x=sample(c("Richard", "Minnie", "Albert", "Helen", "Joe", "Kingston"),  
 50, replace=T)
 x=as.factor(x)
 plot(x)

If you'd like to do this in ggplot, an API change was made to geom_histogram() that leads to an error: https://github.com/hadley/ggplot2/issues/1465

To get around this, use geom_bar():

animals <- c("cat", "dog",  "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "bird")

library(ggplot2)
# counts
ggplot(data.frame(animals), aes(x=animals)) +
  geom_bar()

enter image description here


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to histogram

Why isn't this code to plot a histogram on a continuous value Pandas column working? Make Frequency Histogram for Factor Variables Overlay normal curve to histogram in R Plotting histograms from grouped data in a pandas DataFrame save a pandas.Series histogram plot to file changing default x range in histogram matplotlib How does numpy.histogram() work? Fitting a histogram with python Bin size in Matplotlib (Histogram) Plot two histograms on single chart with matplotlib

Examples related to categorical-data

pandas dataframe convert column type to string or categorical Plotting with ggplot2: "Error: Discrete value supplied to continuous scale" on categorical y-axis Make Frequency Histogram for Factor Variables R error "sum not meaningful for factors" How to force R to use a specified factor level as reference in a regression?