Splitting a continuous variable into equal sized groups

Question

I need to split divide up a continuous variable into 3 equal sized groups  Example data frame  das  lt - data frame anim   1 15                    wt   c 181 179 180 5 201 201 5 245 246 4                           189 3 301 354 369 205 199 394 231 3    After being cut up  according to the value of wt   I would need to have the 3 classes under the new variable wt2 like this   gt  das     anim    wt wt2 1     1 181 0   1 2     2 179 0   1 3     3 180 5   1 4     4 201 0   2 5     5 201 5   2 6     6 245 0   2 7     7 246 4   3 8     8 189 3   1 9     9 301 0   3 10   10 354 0   3 11   11 369 0   3 12   12 205 0   2 13   13 199 0   1 14   14 394 0   3 15   15 231 3   2  This would be applied to a large data set

User · Answer

Or see cut number from the ggplot2 package  e g   das wt 2  lt - as numeric cut number das wt 3     Note that cut     3  divides the range of the original data into three ranges of equal lengths  it doesn t necessarily result in the same number of observations per group if the data are unevenly distributed  you can replicate what cut number does by using quantile appropriately  but it s a nice convenience function    On the other hand  Hmisc  cut2   using the g  argument does split by quantiles  so is more or less equivalent to ggplot2  cut number   I might have thought that something like cut number would have made its way into dplyr by so far  but as far as I can tell it hasn t

User · Answer

You can also use the bin function with method    content  from the OneR package for that   library OneR  das wt 2  lt - as numeric bin das wt  nbins   3  method    content    das       anim    wt wt 2    1     1 181 0    1    2     2 179 0    1    3     3 180 5    1    4     4 201 0    2    5     5 201 5    2    6     6 245 0    2    7     7 246 4    3    8     8 189 3    1    9     9 301 0    3    10   10 354 0    3    11   11 369 0    3    12   12 205 0    2    13   13 199 0    1    14   14 394 0    3    15   15 231 3    2

User · Answer

ntile from dplyr now does this but behaves weirdly with NA s   I ve used similar code in the following function that works in base R and does the equivalent of the cut2 solution above   ntile   lt - function x  n        b  lt - x  is na x       q  lt - floor  n    rank b  ties method    first   - 1  length b     1      d  lt - rep NA  length x       d  is na x    lt - q     return d

User · Answer

Here s another solution using the bin data   function from the mltools package   library mltools     Resulting bins have an equal number of observations in each group das    wt2    lt - bin data das wt  bins 3  binType    quantile      Resulting bins are equally spaced from min to max das    wt3    lt - bin data das wt  bins 3  binType    explicit      Or if you d rather define the bins yourself das    wt4    lt - bin data das wt  bins c -Inf  250  322  Inf   binType    explicit    das    anim    wt                                  wt2                                  wt3         wt4 1     1 181 0               179  200 333333333333                179  250 666666666667   -Inf  250  2     2 179 0               179  200 333333333333                179  250 666666666667   -Inf  250  3     3 180 5               179  200 333333333333                179  250 666666666667   -Inf  250  4     4 201 0  200 333333333333  245 466666666667                179  250 666666666667   -Inf  250  5     5 201 5  200 333333333333  245 466666666667                179  250 666666666667   -Inf  250  6     6 245 0  200 333333333333  245 466666666667                179  250 666666666667   -Inf  250  7     7 246 4               245 466666666667  394                179  250 666666666667   -Inf  250  8     8 189 3               179  200 333333333333                179  250 666666666667   -Inf  250  9     9 301 0               245 466666666667  394   250 666666666667  322 333333333333    250  322  10   10 354 0               245 466666666667  394                322 333333333333  394    322  Inf  11   11 369 0               245 466666666667  394                322 333333333333  394    322  Inf  12   12 205 0  200 333333333333  245 466666666667                179  250 666666666667   -Inf  250  13   13 199 0               179  200 333333333333                179  250 666666666667   -Inf  250  14   14 394 0               245 466666666667  394                322 333333333333  394    322  Inf  15   15 231 3  200 333333333333  245 466666666667                179  250 666666666667   -Inf  250

User · Answer

equal freq from funModeling takes a vector and the number of bins  based on equal frequency    das  lt - data frame anim 1 15                    wt c 181 179 180 5 201 201 5 245 246 4                         189 3 301 354 369 205 199 394 231 3    das wt bin funModeling  equal freq das wt  3   table das wt bin     179 201   201 246   246 394            5         5         5

User · Answer

If you want to split into 3 equally distributed groups  the answer is the same as Ben Bolker s answer above - use ggplot2  cut number    For sake of completion here are the 3 methods of converting continuous to categorical  binning    cut number    Makes n groups with  approximately  equal numbers of observation cut interval    Makes n groups with equal range cut width    Makes groups of width  My go-to is cut number   because this uses evenly spaced quantiles for binning observations  Here s an example with skewed data   library tidyverse   skewed tbl  lt - tibble      counts   c 1 100  1 50  1 20  rep 1 10  3                   rep 1 5  5   rep 1 2  10   rep 1  20           gt       mutate          counts cut number     cut number counts  n   4           counts cut interval   cut interval counts  n   4           counts cut width      cut width counts  width   25                Data skewed tbl   gt    A tibble  265 x 4   gt     counts counts cut number counts cut interval counts cut width   gt       lt dbl gt   lt fct gt               lt fct gt                 lt fct gt               gt   1      1  1 3               1 25 8              -12 5 12 5        gt   2      2  1 3               1 25 8              -12 5 12 5        gt   3      3  1 3               1 25 8              -12 5 12 5        gt   4      4  3 13              1 25 8              -12 5 12 5        gt   5      5  3 13              1 25 8              -12 5 12 5        gt   6      6  3 13              1 25 8              -12 5 12 5        gt   7      7  3 13              1 25 8              -12 5 12 5        gt   8      8  3 13              1 25 8              -12 5 12 5        gt   9      9  3 13              1 25 8              -12 5 12 5        gt  10     10  3 13              1 25 8              -12 5 12 5        gt        with 255 more rows  summary skewed tbl counts    gt     Min  1st Qu   Median    Mean 3rd Qu     Max     gt     1 00    3 00   13 00   25 75   42 00  100 00    Histogram showing skew skewed tbl   gt       ggplot aes counts         geom histogram bins   30      cut number   evenly distributes observations into bins by quantile skewed tbl   gt       ggplot aes counts cut number         geom bar       cut interval   evenly splits the interval across the range skewed tbl   gt       ggplot aes counts cut interval         geom bar       cut width   uses the width   25 to create bins that are 25 in width skewed tbl   gt       ggplot aes counts cut width         geom bar     Created on 2018-11-01 by the reprex package  v0 2 1

User · Answer

Alternative without using cut2   das wt2  lt - as factor  as numeric  cut das wt 3      or   das wt2  lt - as factor  cut das wt 3  labels F     As pointed out by  ben-bolker this splits into equal-widths rather occupancy   I think that using quantiles one can approximate equal-occupancy  x   rnorm 10  x   1  -0 1074316  0 6690681 -1 7168853  0 5144931  1 6460280  0 7014368   7   1 1170587 -0 8503069  0 4462932 -0 1089427 bin   3  for 1 3 rd  4 for 1 4  100 for 1 100th etc xx   cut x  quantile x  breaks 1 bin c 1 bin    labels F  include lowest T  table xx  1 2 3 4 3 2 2 3

User · Answer

cut  when not given explicit break points divides values into bins of same width  they won t contain an equal number of items in general   x  lt - c 1 4 10  lengths split x  cut x  2       0 991 5 5      5 5 10               4           1    Hmisc  cut2 and ggplot2  cut number use quantiles  which will usually create groups of same size  in term of number of elements  if the data is well spread and of decent size  it s not always the case however  mltools  bin data can give different results but is also based on quantiles   These functions don t always give neat results when the data contains a small number of distinct values    x  lt - rep c 1 20  c 15  7  10  3  9  3  4  9  3  2                     23  2  4  1  1  7  18  37  6  2    table x    x    1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20    15  7 10  3  9  3  4  9  3  2 23  2  4  1  1  7 18 37  6  2     table Hmisc  cut2 x  g 4       1  6    6 12   12 19   19 20          44      44      70       8  table ggplot2  cut number x  4      1 5    5 11   11 18   18 20        44      44      70       8  table mltools  bin data x  bins 4  binType    quantile       1  5    5  11   11  18   18  20         35       30       56       45   This is not clear if the optimal solution has been found here   What is the best binning approach is a subjective matter  but one reasonable way to approach it is to look for the bins that minimize the variance around the expected bin size   The function smart cut from  my  package cutr proposes such feature  It s computationally heavy though and should be reserved to cases where cut points and unique values are few  which happen to be usually the case where it matters      devtools  install github  moodymudskipper cutr   table cutr  smart cut x  list 4   balanced     g       1 6    6 12   12 18   18 20     44      44      33      45    We see the groups are much better balanced    balanced  in the call can in fact be replaced by a custom function to optimize or restrict the bins as desired if the method based on variance isn t enough

User · Answer

try this   split das  cut das anim  3     if you want to split based on the value of wt  then  library Hmisc    cut2 split das  cut2 das wt  g 3     anyway  you can do that by combining cut  cut2 and split   UPDATED  if you want a group index as an additional column  then  das group  lt - cut das anim  3    if the column should be index like 1  2       then  das group  lt - as numeric cut das anim  3     UPDATED AGAIN  try this    gt  das wt2  lt - as numeric cut2 das wt  g 3    gt  das    anim    wt wt2 1     1 181 0   1 2     2 179 0   1 3     3 180 5   1 4     4 201 0   2 5     5 201 5   2 6     6 245 0   2 7     7 246 4   3 8     8 189 3   1 9     9 301 0   3 10   10 354 0   3 11   11 369 0   3 12   12 205 0   2 13   13 199 0   1 14   14 394 0   3 15   15 231 3   2

User · Answer

Without any extra package  3 being the number of groups    gt  findInterval das wt  unique quantile das wt  seq 0  1  length out   3   1     rightmost closed   TRUE    1  1 1 1 2 2 2 3 1 3 3 3 2 1 3 2   You can speed up the quantile computation by using a representative sample of the values of interest  Double check the documentation of the FindInterval function

[r] Splitting a continuous variable into equal sized groups

Examples related to r

Examples related to variables

Examples related to split

Examples related to continuous