# Apply function to each column in a data frame observing each columns existing data type

56

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:

``````apply(t,2,max,na.rm=1)
``````

It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as `" -99.5"`.

I then tried this:

``````sapply(t,max,na.rm=1)
``````

but it complains about max not meaningful for factors. (`lapply` is the same.) What is confusing me is that `apply` thought `max` was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.

BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.

This question is tagged with `r` `apply` `sapply`

41

If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:

``````sapply(df, function(x) max(as.numeric(x)) )   # not generally a useful result
``````

Or if you want to test for factors first and return as you expect then:

``````sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
``````

@Darrens comment does work better:

`````` sapply(df, function(x) max(as.character(x)) )
``````

`max` does succeed with character vectors.

19

The reason that `max` works with `apply` is that `apply` is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. `sapply` is just a wrapper for `lapply`, so it is not surprising that both yield the same error.

The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like `max` and `min` will be undefined, since R is assuming that you've created an unordered factor.

You can change this behavior by specifying `options(stringsAsFactors = FALSE)`, which will change the default for the entire session, or you can pass `stringsAsFactors = FALSE` in the `data.frame()` construction call itself. Note that this just means that `min` and `max` will assume "alphabetical" ordering by default.

Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.

Regardless, `sapply` will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:

``````#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)

d[4,] <- NA