[r] Splitting a continuous variable into equal sized groups

I need to split/divide up a continuous variable into 3 equal sized groups.

Example data frame:

das <- data.frame(anim = 1:15,
                  wt = c(181,179,180.5,201,201.5,245,246.4,
                         189.3,301,354,369,205,199,394,231.3))

After being cut up (according to the value of wt), I would need to have the 3 classes under the new variable wt2 like this:

> das 
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

This would be applied to a large data set.

This question is related to r variables split continuous

The answer is


equal_freq from funModeling takes a vector and the number of bins (based on equal frequency):

das <- data.frame(anim=1:15,
                  wt=c(181,179,180.5,201,201.5,245,246.4,
                       189.3,301,354,369,205,199,394,231.3))

das$wt_bin=funModeling::equal_freq(das$wt, 3)

table(das$wt_bin)

#[179,201) [201,246) [246,394] 
#        5         5         5 

You can also use the bin function with method = "content" from the OneR package for that:

library(OneR)
das$wt_2 <- as.numeric(bin(das$wt, nbins = 3, method = "content"))
das
##    anim    wt wt_2
## 1     1 181.0    1
## 2     2 179.0    1
## 3     3 180.5    1
## 4     4 201.0    2
## 5     5 201.5    2
## 6     6 245.0    2
## 7     7 246.4    3
## 8     8 189.3    1
## 9     9 301.0    3
## 10   10 354.0    3
## 11   11 369.0    3
## 12   12 205.0    2
## 13   13 199.0    1
## 14   14 394.0    3
## 15   15 231.3    2

Without any extra package, 3 being the number of groups:

> findInterval(das$wt, unique(quantile(das$wt, seq(0, 1, length.out = 3 + 1))), rightmost.closed = TRUE)
 [1] 1 1 1 2 2 2 3 1 3 3 3 2 1 3 2

You can speed up the quantile computation by using a representative sample of the values of interest. Double check the documentation of the FindInterval function.


Here's another solution using the bin_data() function from the mltools package.

library(mltools)

# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")

# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")

# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")

das
   anim    wt                                  wt2                                  wt3         wt4
1     1 181.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
2     2 179.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
3     3 180.5              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
4     4 201.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
5     5 201.5 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
6     6 245.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
7     7 246.4              [245.466666666667, 394]              [179, 250.666666666667) [-Inf, 250)
8     8 189.3              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
9     9 301.0              [245.466666666667, 394] [250.666666666667, 322.333333333333)  [250, 322)
10   10 354.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
11   11 369.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
12   12 205.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
13   13 199.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
14   14 394.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
15   15 231.3 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)

Alternative without using cut2.

das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))

or

das$wt2 <- as.factor( cut(das$wt,3, labels=F))

As pointed out by @ben-bolker this splits into equal-widths rather occupancy. I think that using quantiles one can approximate equal-occupancy

x = rnorm(10)
x
 [1] -0.1074316  0.6690681 -1.7168853  0.5144931  1.6460280  0.7014368
 [7]  1.1170587 -0.8503069  0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3

cut, when not given explicit break points divides values into bins of same width, they won't contain an equal number of items in general:

x <- c(1:4,10)
lengths(split(x, cut(x, 2)))
# (0.991,5.5]    (5.5,10] 
#           4           1 

Hmisc::cut2 and ggplot2::cut_number use quantiles, which will usually create groups of same size (in term of number of elements) if the data is well spread and of decent size, it's not always the case however. mltools::bin_data can give different results but is also based on quantiles.

These functions don't always give neat results when the data contains a small number of distinct values :

x <- rep(c(1:20),c(15, 7, 10, 3, 9, 3, 4, 9, 3, 2,
                   23, 2, 4, 1, 1, 7, 18, 37, 6, 2))

table(x)
# x
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 15  7 10  3  9  3  4  9  3  2 23  2  4  1  1  7 18 37  6  2   

table(Hmisc::cut2(x, g=4))
# [ 1, 6) [ 6,12) [12,19) [19,20] 
#      44      44      70       8

table(ggplot2::cut_number(x, 4))
# [1,5]  (5,11] (11,18] (18,20] 
#    44      44      70       8

table(mltools::bin_data(x, bins=4, binType = "quantile"))
# [1, 5)  [5, 11) [11, 18) [18, 20] 
#     35       30       56       45

This is not clear if the optimal solution has been found here.

What is the best binning approach is a subjective matter, but one reasonable way to approach it is to look for the bins that minimize the variance around the expected bin size.

The function smart_cut from (my) package cutr proposes such feature. It's computationally heavy though and should be reserved to cases where cut points and unique values are few (which happen to be usually the case where it matters).

# devtools::install_github("moodymudskipper/cutr")
table(cutr::smart_cut(x, list(4, "balanced"), "g"))
# [1,6)  [6,12) [12,18) [18,20] 
# 44      44      33      45 

We see the groups are much better balanced.

"balanced" in the call can in fact be replaced by a custom function to optimize or restrict the bins as desired if the method based on variance isn't enough.


Or see cut_number from the ggplot2 package, e.g.

das$wt_2 <- as.numeric(cut_number(das$wt,3))

Note that cut(...,3) divides the range of the original data into three ranges of equal lengths; it doesn't necessarily result in the same number of observations per group if the data are unevenly distributed (you can replicate what cut_number does by using quantile appropriately, but it's a nice convenience function). On the other hand, Hmisc::cut2() using the g= argument does split by quantiles, so is more or less equivalent to ggplot2::cut_number. I might have thought that something like cut_number would have made its way into dplyr by so far, but as far as I can tell it hasn't.


ntile from dplyr now does this but behaves weirdly with NA's.

I've used similar code in the following function that works in base R and does the equivalent of the cut2 solution above:

ntile_ <- function(x, n) {
    b <- x[!is.na(x)]
    q <- floor((n * (rank(b, ties.method = "first") - 1)/length(b)) + 1)
    d <- rep(NA, length(x))
    d[!is.na(x)] <- q
    return(d)
}

If you want to split into 3 equally distributed groups, the answer is the same as Ben Bolker's answer above - use ggplot2::cut_number(). For sake of completion here are the 3 methods of converting continuous to categorical (binning).

  • cut_number(): Makes n groups with (approximately) equal numbers of observation
  • cut_interval(): Makes n groups with equal range
  • cut_width(): Makes groups of width

My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

library(tidyverse)

skewed_tbl <- tibble(
    counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
               rep(1:5, 5), rep(1:2, 10), rep(1, 20))
    ) %>%
    mutate(
        counts_cut_number   = cut_number(counts, n = 4),
        counts_cut_interval = cut_interval(counts, n = 4),
        counts_cut_width    = cut_width(counts, width = 25)
        ) 

# Data
skewed_tbl
#> # A tibble: 265 x 4
#>    counts counts_cut_number counts_cut_interval counts_cut_width
#>     <dbl> <fct>             <fct>               <fct>           
#>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
#> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
#> # ... with 255 more rows

summary(skewed_tbl$counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.00   13.00   25.75   42.00  100.00

# Histogram showing skew
skewed_tbl %>%
    ggplot(aes(counts)) +
    geom_histogram(bins = 30)

# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
    ggplot(aes(counts_cut_number)) +
    geom_bar()

# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
    ggplot(aes(counts_cut_interval)) +
    geom_bar()

# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
    ggplot(aes(counts_cut_width)) +
    geom_bar()

Created on 2018-11-01 by the reprex package (v0.2.1)


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to variables

When to create variables (memory management) How to print a Groovy variable in Jenkins? What does ${} (dollar sign and curly braces) mean in a string in Javascript? How to access global variables How to initialize a variable of date type in java? How to define a variable in a Dockerfile? Why does foo = filter(...) return a <filter object>, not a list? How can I pass variable to ansible playbook in the command line? How do I use this JavaScript variable in HTML? Static vs class functions/variables in Swift classes?

Examples related to split

Parameter "stratify" from method "train_test_split" (scikit Learn) Pandas split DataFrame by column value How to split large text file in windows? Attribute Error: 'list' object has no attribute 'split' Split function in oracle to comma separated values with automatic sequence How would I get everything before a : in a string Python Split String by delimiter position using oracle SQL JavaScript split String with white space Split a String into an array in Swift? Split pandas dataframe in two if it has more than 10 rows

Examples related to continuous

Splitting a continuous variable into equal sized groups