[r] how to realize countifs function (excel) in R

I have a dataset containing 100000 rows of data. I tried to do some countif operations in Excel, but it was prohibitively slow. So I am wondering if this kind of operation can be done in R? Basically, I want to do a count based on multiple conditions. For example, I can count on both occupation and sex

row sex occupation
  1   M    Student
  2   F    Analyst
  2   M    Analyst

This question is related to r

The answer is


Given a dataset

df <- data.frame( sex = c('M', 'M', 'F', 'F', 'M'), 
                  occupation = c('analyst', 'dentist', 'dentist', 'analyst', 'cook') )

you can subset rows

df[df$sex == 'M',] # To get all males
df[df$occupation == 'analyst',] # All analysts

etc.

If you want to get number of rows, just call the function nrow such as

nrow(df[df$sex == 'M',])

library(matrixStats)
> data <- rbind(c("M", "F", "M"), c("Student", "Analyst", "Analyst"))
> rowCounts(data, value = 'M') # output = 2 0
> rowCounts(data, value = 'F') # output = 1 0

Here an example with 100000 rows (occupations are set here from A to Z):

> a = data.frame(sex=sample(c("M", "F"), 100000, replace=T), occupation=sample(LETTERS, 100000, replace=T))
> sum(a$sex == "M" & a$occupation=="A")
[1] 1882

returns the number of males with occupation "A".

EDIT

As I understand from your comment, you want the counts of all possible combinations of sex and occupation. So first create a dataframe with all combinations:

combns = expand.grid(c("M", "F"), LETTERS)

and loop with apply to sum for your criteria and append the results to combns:

combns = cbind (combns, apply(combns, 1, function(x)sum(a$sex==x[1] & a$occupation==x[2])))
colnames(combns) = c("sex", "occupation", "count")

The first rows of your result look as follows:

  sex occupation count
1   M          A  1882
2   F          A  1869
3   M          B  1866
4   F          B  1904
5   M          C  1979
6   F          C  1910

Does this solve your problem?

OR:

Much easier solution suggested by thelatemai:

table(a$sex, a$occupation)


       A    B    C    D    E    F    G    H    I    J    K    L    M    N    O
  F 1869 1904 1910 1907 1894 1940 1964 1907 1918 1892 1962 1933 1886 1960 1972
  M 1882 1866 1979 1904 1895 1845 1946 1905 1999 1994 1933 1950 1876 1856 1911

       P    Q    R    S    T    U    V    W    X    Y    Z
  F 1908 1907 1883 1888 1943 1922 2016 1962 1885 1898 1889
  M 1928 1938 1916 1927 1972 1965 1946 1903 1965 1974 1906

Table is the obvious choice, but it returns an object of class table which takes a few annoying steps to transform back into a data.frame So, if you're OK using dplyr, you use the command tally:

    library(dplyr)
    df = data.frame(sex=sample(c("M", "F"), 100000, replace=T), occupation=sample(c('Analyst', 'Student'), 100000, replace=T)
    df %>% group_by_all() %>% tally()


# A tibble: 4 x 3
# Groups:   sex [2]
  sex   occupation `n()`
  <fct> <fct>      <int>
1 F     Analyst    25105
2 F     Student    24933
3 M     Analyst    24769
4 M     Student    25193

Easy peasy. Your data frame will look like this:

df <- data.frame(sex=c('M','F','M'),
                 occupation=c('Student','Analyst','Analyst'))

You can then do the equivalent of a COUNTIF by first specifying the IF part, like so:

df$sex == 'M'

This will give you a boolean vector, i.e. a vector of TRUE and FALSE. What you want is to count the observations for which the condition is TRUE. Since in R TRUE and FALSE double as 1 and 0 you can simply sum() over the boolean vector. The equivalent of COUNTIF(sex='M') is therefore

sum(df$sex == 'M')

Should there be rows in which the sex is not specified the above will give back NA. In that case, if you just want to ignore the missing observations use

sum(df$sex == 'M', na.rm=TRUE)