[r] Understanding the order() function

I'm trying to understand how the order() function works. I was under the impression that it returned a permutation of indices, which when sorted, would sort the original vector.

For instance,

> a <- c(45,50,10,96)
> order(a)
[1] 3 1 2 4

I would have expected this to return c(2, 3, 1, 4), since the list sorted would be 10 45 50 96.

Can someone help me understand the return value of this function?

This question is related to r sorting r-faq

The answer is


To sort a 1D vector or a single column of data, just call the sort function and pass in your sequence.

On the other hand, the order function is necessary to sort data two-dimensional data--i.e., multiple columns of data collected in a matrix or dataframe.

Stadium Home Week Qtr Away Off Def Result       Kicker Dist
751     Out  PHI   14   4  NYG PHI NYG   Good      D.Akers   50
491     Out   KC    9   1  OAK OAK  KC   Good S.Janikowski   32
702     Out  OAK   15   4  CLE CLE OAK   Good     P.Dawson   37
571     Out   NE    1   2  OAK OAK  NE Missed S.Janikowski   43
654     Out  NYG   11   2  PHI NYG PHI   Good      J.Feely   26
307     Out  DEN   14   2  BAL DEN BAL   Good       J.Elam   48
492     Out   KC   13   3  DEN  KC DEN   Good      L.Tynes   34
691     Out  NYJ   17   3  BUF NYJ BUF   Good     M.Nugent   25
164     Out  CHI   13   2   GB CHI  GB   Good      R.Gould   25
80      Out  BAL    1   2  IND IND BAL   Good M.Vanderjagt   20

Here is an excerpt of data for field goal attempts in the 2008 NFL season, a dataframe i've called 'fg'. suppose that these 10 data points represent all of the field goals attempted in 2008; further suppose you want to know the the distance of the longest field goal attempted that year, who kicked it, and whether it was good or not; you also want to know the second-longest, as well as the third-longest, etc.; and finally you want the shortest field goal attempt.

Well, you could just do this:

sort(fg$Dist, decreasing=T)

which returns: 50 48 43 37 34 32 26 25 25 20

That is correct, but not very useful--it does tell us the distance of the longest field goal attempt, the second-longest,...as well as the shortest; however, but that's all we know--eg, we don't know who the kicker was, whether the attempt was successful, etc. Of course, we need the entire dataframe sorted on the "Dist" column (put another way, we want to sort all of the data rows on the single attribute Dist. that would look like this:

Stadium Home Week Qtr Away Off Def Result       Kicker Dist
751     Out  PHI   14   4  NYG PHI NYG   Good      D.Akers   50
307     Out  DEN   14   2  BAL DEN BAL   Good       J.Elam   48
571     Out   NE    1   2  OAK OAK  NE Missed S.Janikowski   43
702     Out  OAK   15   4  CLE CLE OAK   Good     P.Dawson   37
492     Out   KC   13   3  DEN  KC DEN   Good      L.Tynes   34
491     Out   KC    9   1  OAK OAK  KC   Good S.Janikowski   32
654     Out  NYG   11   2  PHI NYG PHI   Good      J.Feely   26
691     Out  NYJ   17   3  BUF NYJ BUF   Good     M.Nugent   25
164     Out  CHI   13   2   GB CHI  GB   Good      R.Gould   25
80      Out  BAL    1   2  IND IND BAL   Good M.Vanderjagt   20

This is what order does. It is 'sort' for two-dimensional data; put another way, it returns a 1D integer index comprised of the row numbers such that sorting the rows according to that vector, would give you a correct row-oriented sort on the column, Dist

Here's how it works. Above, sort was used to sort the Dist column; to sort the entire dataframe on the Dist column, we use 'order' exactly the same way as 'sort' is used above:

ndx = order(fg$Dist, decreasing=T)

(i usually bind the array returned from 'order' to the variable 'ndx', which stands for 'index', because i am going to use it as an index array to sort.)

that was step 1, here's step 2:

'ndx', what is returned by 'sort' is then used as an index array to re-order the dataframe, 'fg':

fg_sorted = fg[ndx,]

fg_sorted is the re-ordered dataframe immediately above.

In sum, 'sort' is used to create an index array (which specifies the sort order of the column you want sorted), which then is used as an index array to re-order the dataframe (or matrix).


Running this little piece of code allowed me to understand the order function

x <- c(3, 22, 5, 1, 77)

cbind(
  index=1:length(x),
  rank=rank(x),
  x, 
  order=order(x), 
  sort=sort(x)
)

     index rank  x order sort
[1,]     1    2  3     4    1
[2,]     2    4 22     1    3
[3,]     3    3  5     3    5
[4,]     4    1  1     2   22
[5,]     5    5 77     5   77

Reference: http://r.789695.n4.nabble.com/I-don-t-understand-the-order-function-td4664384.html


This could help you at some point.

a <- c(45,50,10,96)
a[order(a)]

What you get is

[1] 10 45 50 96

The code I wrote indicates you want "a" as a whole subset of "a" and you want it ordered from the lowest to highest value.


they are similar but not same

set.seed(0)
x<-matrix(rnorm(10),1)

# one can compute from the other
rank(x)  == col(x)%*%diag(length(x))[order(x),]
order(x) == col(x)%*%diag(length(x))[rank(x),]
# rank can be used to sort
sort(x) == x%*%diag(length(x))[rank(x),]

(I thought it might be helpful to lay out the ideas very simply here to summarize the good material posted by @doug, & linked by @duffymo; +1 to each,btw.)

?order tells you which element of the original vector needs to be put first, second, etc., so as to sort the original vector, whereas ?rank tell you which element has the lowest, second lowest, etc., value. For example:

> a <- c(45, 50, 10, 96)
> order(a)  
[1] 3 1 2 4  
> rank(a)  
[1] 2 3 1 4  

So order(a) is saying, 'put the third element first when you sort... ', whereas rank(a) is saying, 'the first element is the second lowest... '. (Note that they both agree on which element is lowest, etc.; they just present the information differently.) Thus we see that we can use order() to sort, but we can't use rank() that way:

> a[order(a)]  
[1] 10 45 50 96  
> sort(a)  
[1] 10 45 50 96  
> a[rank(a)]  
[1] 50 10 45 96  

In general, order() will not equal rank() unless the vector has been sorted already:

> b <- sort(a)  
> order(b)==rank(b)  
[1] TRUE TRUE TRUE TRUE  

Also, since order() is (essentially) operating over ranks of the data, you could compose them without affecting the information, but the other way around produces gibberish:

> order(rank(a))==order(a)  
[1] TRUE TRUE TRUE TRUE  
> rank(order(a))==rank(a)  
[1] FALSE FALSE FALSE  TRUE  

In simple words, order() gives the locations of elements of increasing magnitude.

For example, order(c(10,20,30)) will give 1,2,3 and order(c(30,20,10)) will give 3,2,1.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to sorting

Sort Array of object by object field in Angular 6 Sorting a list with stream.sorted() in Java How to sort dates from Oldest to Newest in Excel? how to sort pandas dataframe from one column Reverse a comparator in Java 8 Find the unique values in a column and then sort them pandas groupby sort within groups pandas groupby sort descending order Efficiently sorting a numpy array in descending order? Swift: Sort array of objects alphabetically

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc