[r] How to sort a dataframe by multiple column(s)

I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2

This question is related to r sorting dataframe r-faq

The answer is


or you can use package doBy

library(doBy)
dd <- orderBy(~-z+b, data=dd)

Dirk's answer is great. It also highlights a key difference in the syntax used for indexing data.frames and data.tables:

## The data.frame way
dd[with(dd, order(-z, b)), ]

## The data.table way: (7 fewer characters, but that's not the important bit)
dd[order(-z, b)]

The difference between the two calls is small, but it can have important consequences. Especially if you write production code and/or are concerned with correctness in your research, it's best to avoid unnecessary repetition of variable names. data.table helps you do this.

Here's an example of how repetition of variable names might get you into trouble:

Let's change the context from Dirk's answer, and say this is part of a bigger project where there are a lot of object names and they are long and meaningful; instead of dd it's called quarterlyreport. It becomes :

quarterlyreport[with(quarterlyreport,order(-z,b)),]

Ok, fine. Nothing wrong with that. Next your boss asks you to include last quarter's report in the report. You go through your code, adding an object lastquarterlyreport in various places and somehow (how on earth?) you end up with this :

quarterlyreport[with(lastquarterlyreport,order(-z,b)),]

That isn't what you meant but you didn't spot it because you did it fast and it's nestled on a page of similar code. The code doesn't fall over (no warning and no error) because R thinks it is what you meant. You'd hope whoever reads your report spots it, but maybe they don't. If you work with programming languages a lot then this situation may be all to familiar. It was a "typo" you'll say. I'll fix the "typo" you'll say to your boss.

In data.table we're concerned about tiny details like this. So we've done something simple to avoid typing variable names twice. Something very simple. i is evaluated within the frame of dd already, automatically. You don't need with() at all.

Instead of

dd[with(dd, order(-z, b)), ]

it's just

dd[order(-z, b)]

And instead of

quarterlyreport[with(lastquarterlyreport,order(-z,b)),]

it's just

quarterlyreport[order(-z,b)]

It's a very small difference, but it might just save your neck one day. When weighing up the different answers to this question, consider counting the repetitions of variable names as one of your criteria in deciding. Some answers have quite a few repeats, others have none.


Another alternative, using the rgr package:

> library(rgr)
> gx.sort.df(dd, ~ -z+b)
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

The R package data.table provides both fast and memory efficient ordering of data.tables with a straightforward syntax (a part of which Matt has highlighted quite nicely in his answer). There has been quite a lot of improvements and also a new function setorder() since then. From v1.9.5+, setorder() also works with data.frames.

First, we'll create a dataset big enough and benchmark the different methods mentioned from other answers and then list the features of data.table.

Data:

require(plyr)
require(doBy)
require(data.table)
require(dplyr)
require(taRifx)

set.seed(45L)
dat = data.frame(b = as.factor(sample(c("Hi", "Med", "Low"), 1e8, TRUE)),
                 x = sample(c("A", "D", "C"), 1e8, TRUE),
                 y = sample(100, 1e8, TRUE),
                 z = sample(5, 1e8, TRUE), 
                 stringsAsFactors = FALSE)

Benchmarks:

The timings reported are from running system.time(...) on these functions shown below. The timings are tabulated below (in the order of slowest to fastest).

orderBy( ~ -z + b, data = dat)     ## doBy
plyr::arrange(dat, desc(z), b)     ## plyr
arrange(dat, desc(z), b)           ## dplyr
sort(dat, f = ~ -z + b)            ## taRifx
dat[with(dat, order(-z, b)), ]     ## base R

# convert to data.table, by reference
setDT(dat)

dat[order(-z, b)]                  ## data.table, base R like syntax
setorder(dat, -z, b)               ## data.table, using setorder()
                                   ## setorder() now also works with data.frames 

# R-session memory usage (BEFORE) = ~2GB (size of 'dat')
# ------------------------------------------------------------
# Package      function    Time (s)  Peak memory   Memory used
# ------------------------------------------------------------
# doBy          orderBy      409.7        6.7 GB        4.7 GB
# taRifx           sort      400.8        6.7 GB        4.7 GB
# plyr          arrange      318.8        5.6 GB        3.6 GB 
# base R          order      299.0        5.6 GB        3.6 GB
# dplyr         arrange       62.7        4.2 GB        2.2 GB
# ------------------------------------------------------------
# data.table      order        6.2        4.2 GB        2.2 GB
# data.table   setorder        4.5        2.4 GB        0.4 GB
# ------------------------------------------------------------
  • data.table's DT[order(...)] syntax was ~10x faster than the fastest of other methods (dplyr), while consuming the same amount of memory as dplyr.

  • data.table's setorder() was ~14x faster than the fastest of other methods (dplyr), while taking just 0.4GB extra memory. dat is now in the order we require (as it is updated by reference).

data.table features:

Speed:

  • data.table's ordering is extremely fast because it implements radix ordering.

  • The syntax DT[order(...)] is optimised internally to use data.table's fast ordering as well. You can keep using the familiar base R syntax but speed up the process (and use less memory).

Memory:

  • Most of the times, we don't require the original data.frame or data.table after reordering. That is, we usually assign the result back to the same object, for example:

    DF <- DF[order(...)]
    

    The issue is that this requires at least twice (2x) the memory of the original object. To be memory efficient, data.table therefore also provides a function setorder().

    setorder() reorders data.tables by reference (in-place), without making any additional copies. It only uses extra memory equal to the size of one column.

Other features:

  1. It supports integer, logical, numeric, character and even bit64::integer64 types.

    Note that factor, Date, POSIXct etc.. classes are all integer/numeric types underneath with additional attributes and are therefore supported as well.

  2. In base R, we can not use - on a character vector to sort by that column in decreasing order. Instead we have to use -xtfrm(.).

    However, in data.table, we can just do, for example, dat[order(-x)] or setorder(dat, -x).


I learned about order with the following example which then confused me for a long time:

set.seed(1234)

ID        = 1:10
Age       = round(rnorm(10, 50, 1))
diag      = c("Depression", "Bipolar")
Diagnosis = sample(diag, 10, replace=TRUE)

data = data.frame(ID, Age, Diagnosis)

databyAge = data[order(Age),]
databyAge

The only reason this example works is because order is sorting by the vector Age, not by the column named Age in the data frame data.

To see this create an identical data frame using read.table with slightly different column names and without making use of any of the above vectors:

my.data <- read.table(text = '

  id age  diagnosis
   1  49 Depression
   2  50 Depression
   3  51 Depression
   4  48 Depression
   5  50 Depression
   6  51    Bipolar
   7  49    Bipolar
   8  49    Bipolar
   9  49    Bipolar
  10  49 Depression

', header = TRUE)

The above line structure for order no longer works because there is no vector named age:

databyage = my.data[order(age),]

The following line works because order sorts on the column age in my.data.

databyage = my.data[order(my.data$age),]

I thought this was worth posting given how confused I was by this example for so long. If this post is not deemed appropriate for the thread I can remove it.

EDIT: May 13, 2014

Below is a generalized way of sorting a data frame by every column without specifying column names. The code below shows how to sort from left to right or by right to left. This works if every column is numeric. I have not tried with a character column added.

I found the do.call code a month or two ago in an old post on a different site, but only after extensive and difficult searching. I am not sure I could relocate that post now. The present thread is the first hit for ordering a data.frame in R. So, I thought my expanded version of that original do.call code might be useful.

set.seed(1234)

v1  <- c(0,0,0,0, 0,0,0,0, 1,1,1,1, 1,1,1,1)
v2  <- c(0,0,0,0, 1,1,1,1, 0,0,0,0, 1,1,1,1)
v3  <- c(0,0,1,1, 0,0,1,1, 0,0,1,1, 0,0,1,1)
v4  <- c(0,1,0,1, 0,1,0,1, 0,1,0,1, 0,1,0,1)

df.1 <- data.frame(v1, v2, v3, v4) 
df.1

rdf.1 <- df.1[sample(nrow(df.1), nrow(df.1), replace = FALSE),]
rdf.1

order.rdf.1 <- rdf.1[do.call(order, as.list(rdf.1)),]
order.rdf.1

order.rdf.2 <- rdf.1[do.call(order, rev(as.list(rdf.1))),]
order.rdf.2

rdf.3 <- data.frame(rdf.1$v2, rdf.1$v4, rdf.1$v3, rdf.1$v1) 
rdf.3

order.rdf.3 <- rdf.1[do.call(order, as.list(rdf.3)),]
order.rdf.3

Suppose you have a data.frame A and you want to sort it using column called x descending order. Call the sorted data.frame newdata

newdata <- A[order(-A$x),]

If you want ascending order then replace "-" with nothing. You can have something like

newdata <- A[order(-A$x, A$y, -A$z),]

where x and z are some columns in data.frame A. This means sort data.frame A by x descending, y ascending and z descending.


Just like the mechanical card sorters of long ago, first sort by the least significant key, then the next most significant, etc. No library required, works with any number of keys and any combination of ascending and descending keys.

 dd <- dd[order(dd$b, decreasing = FALSE),]

Now we're ready to do the most significant key. The sort is stable, and any ties in the most significant key have already been resolved.

dd <- dd[order(dd$z, decreasing = TRUE),]

This may not be the fastest, but it is certainly simple and reliable


For the sake of completeness: you can also use the sortByCol() function from the BBmisc package:

library(BBmisc)
sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE))
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Performance comparison:

library(microbenchmark)
microbenchmark(sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE)), times = 100000)
median 202.878

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=100000)
median 148.758

microbenchmark(dd[with(dd, order(-z, b)), ], times = 100000)
median 115.872

With this (very helpful) function by Kevin Wright, posted in the tips section of the R wiki, this is easily achieved.

sort(dd,by = ~ -z + b)
#     b x y z
# 4 Low C 9 2
# 2 Med D 3 1
# 1  Hi A 8 1
# 3  Hi A 9 1

The arrange() in dplyr is my favorite option. Use the pipe operator and go from least important to most important aspect

dd1 <- dd %>%
    arrange(z) %>%
    arrange(desc(x))

if SQL comes naturally to you, sqldf package handles ORDER BY as Codd intended.


I was struggling with the above solutions when I wanted to automate my ordering process for n columns, whose column names could be different each time. I found a super helpful function from the psych package to do this in a straightforward manner:

dfOrder(myDf, columnIndices)

where columnIndices are indices of one or more columns, in the order in which you want to sort them. More information here:

dfOrder function from 'psych' package


There are a lot of excellent answers here, but dplyr gives the only syntax that I can quickly and easily remember (and so now use very often):

library(dplyr)
# sort mtcars by mpg, ascending... use desc(mpg) for descending
arrange(mtcars, mpg)
# sort mtcars first by mpg, then by cyl, then by wt)
arrange(mtcars , mpg, cyl, wt)

For the OP's problem:

arrange(dd, desc(z),  b)

    b x y z
1 Low C 9 2
2 Med D 3 1
3  Hi A 8 1
4  Hi A 9 1

In response to a comment added in the OP for how to sort programmatically:

Using dplyr and data.table

library(dplyr)
library(data.table)

dplyr

Just use arrange_, which is the Standard Evaluation version for arrange.

df1 <- tbl_df(iris)
#using strings or formula
arrange_(df1, c('Petal.Length', 'Petal.Width'))
arrange_(df1, ~Petal.Length, ~Petal.Width)
    Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           4.6         3.6          1.0         0.2  setosa
2           4.3         3.0          1.1         0.1  setosa
3           5.8         4.0          1.2         0.2  setosa
4           5.0         3.2          1.2         0.2  setosa
5           4.7         3.2          1.3         0.2  setosa
6           5.4         3.9          1.3         0.4  setosa
7           5.5         3.5          1.3         0.2  setosa
8           4.4         3.0          1.3         0.2  setosa
9           5.0         3.5          1.3         0.3  setosa
10          4.5         2.3          1.3         0.3  setosa
..          ...         ...          ...         ...     ...


#Or using a variable
sortBy <- c('Petal.Length', 'Petal.Width')
arrange_(df1, .dots = sortBy)
    Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           4.6         3.6          1.0         0.2  setosa
2           4.3         3.0          1.1         0.1  setosa
3           5.8         4.0          1.2         0.2  setosa
4           5.0         3.2          1.2         0.2  setosa
5           4.7         3.2          1.3         0.2  setosa
6           5.5         3.5          1.3         0.2  setosa
7           4.4         3.0          1.3         0.2  setosa
8           4.4         3.2          1.3         0.2  setosa
9           5.0         3.5          1.3         0.3  setosa
10          4.5         2.3          1.3         0.3  setosa
..          ...         ...          ...         ...     ...

#Doing the same operation except sorting Petal.Length in descending order
sortByDesc <- c('desc(Petal.Length)', 'Petal.Width')
arrange_(df1, .dots = sortByDesc)

more info here: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html

It is better to use formula as it also captures the environment to evaluate an expression in

data.table

dt1 <- data.table(iris) #not really required, as you can work directly on your data.frame
sortBy <- c('Petal.Length', 'Petal.Width')
sortType <- c(-1, 1)
setorderv(dt1, sortBy, sortType)
dt1
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:          7.7         2.6          6.9         2.3 virginica
  2:          7.7         2.8          6.7         2.0 virginica
  3:          7.7         3.8          6.7         2.2 virginica
  4:          7.6         3.0          6.6         2.1 virginica
  5:          7.9         3.8          6.4         2.0 virginica
 ---                                                            
146:          5.4         3.9          1.3         0.4    setosa
147:          5.8         4.0          1.2         0.2    setosa
148:          5.0         3.2          1.2         0.2    setosa
149:          4.3         3.0          1.1         0.1    setosa
150:          4.6         3.6          1.0         0.2    setosa

Dirk's answer is good but if you need the sort to persist you'll want to apply the sort back onto the name of that data frame. Using the example code:

dd <- dd[with(dd, order(-z, b)), ] 

Just for the sake of completeness, since not much has been said about sorting by column numbers... It can surely be argued that it is often not desirable (because the order of the columns could change, paving the way to errors), but in some specific situations (when for instance you need a quick job done and there is no such risk of columns changing orders), it might be the most sensible thing to do, especially when dealing with large numbers of columns.

In that case, do.call() comes to the rescue:

ind <- do.call(what = "order", args = iris[,c(5,1,2,3)])
iris[ind, ]

##        Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##    14           4.3         3.0          1.1         0.1     setosa
##    9            4.4         2.9          1.4         0.2     setosa
##    39           4.4         3.0          1.3         0.2     setosa
##    43           4.4         3.2          1.3         0.2     setosa
##    42           4.5         2.3          1.3         0.3     setosa
##    4            4.6         3.1          1.5         0.2     setosa
##    48           4.6         3.2          1.4         0.2     setosa
##    7            4.6         3.4          1.4         0.3     setosa
##    (...)

Alternatively, using the package Deducer

library(Deducer)
dd<- sortData(dd,c("z","b"),increasing= c(FALSE,TRUE))

Your choices

  • order from base
  • arrange from dplyr
  • setorder and setorderv from data.table
  • arrange from plyr
  • sort from taRifx
  • orderBy from doBy
  • sortData from Deducer

Most of the time you should use the dplyr or data.table solutions, unless having no-dependencies is important, in which case use base::order.


I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here: Best way to create generic/method consistency for sort.data.frame?

Therefore, given the data.frame dd, you can sort as follows:

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(taRifx)
sort(dd, f= ~ -z + b )

If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: https://chat.stackoverflow.com/transcript/message/1094290#1094290


You can also use the arrange() function from plyr as Hadley pointed out in the above thread:

library(plyr)
arrange(dd,desc(z),b)

Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes sort to return "The following object(s) are masked from 'x (position 17)': b, x, y, z", and loading the Deducer package overwrites sort.data.frame from Kevin Wright or the taRifx package.

#Load each time
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(microbenchmark)

# Reload R between benchmarks
microbenchmark(dd[with(dd, order(-z, b)), ] ,
    dd[order(-dd$z, dd$b),],
    times=1000
)

Median times:

dd[with(dd, order(-z, b)), ] 778

dd[order(-dd$z, dd$b),] 788

library(taRifx)
microbenchmark(sort(dd, f= ~-z+b ),times=1000)

Median time: 1,567

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=1000)

Median time: 862

library(doBy)
microbenchmark(orderBy(~-z+b, data=dd),times=1000)

Median time: 1,694

Note that doBy takes a good bit of time to load the package.

library(Deducer)
microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000)

Couldn't make Deducer load. Needs JGR console.

esort <- function(x, sortvar, ...) {
attach(x)
x <- x[with(x,order(sortvar,...)),]
return(x)
detach(x)
}

microbenchmark(esort(dd, -z, b),times=1000)

Doesn't appear to be compatible with microbenchmark due to the attach/detach.


m <- microbenchmark(
  arrange(dd,desc(z),b),
  sort(dd, f= ~-z+b ),
  dd[with(dd, order(-z, b)), ] ,
  dd[order(-dd$z, dd$b),],
  times=1000
  )

uq <- function(x) { fivenum(x)[4]}  
lq <- function(x) { fivenum(x)[2]}

y_min <- 0 # min(by(m$time,m$expr,lq))
y_max <- max(by(m$time,m$expr,uq)) * 1.05
  
p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max )) 
p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr))

microbenchmark plot

(lines extend from lower quartile to upper quartile, dot is the median)


Given these results and weighing simplicity vs. speed, I'd have to give the nod to arrange in the plyr package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called by sort(object), but I understand why Hadley did it that way due to issues discussed in the question linked above.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to sorting

Sort Array of object by object field in Angular 6 Sorting a list with stream.sorted() in Java How to sort dates from Oldest to Newest in Excel? how to sort pandas dataframe from one column Reverse a comparator in Java 8 Find the unique values in a column and then sort them pandas groupby sort within groups pandas groupby sort descending order Efficiently sorting a numpy array in descending order? Swift: Sort array of objects alphabetically

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc