I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
This question is related to
r
mydata$sCode
is a vector, it's why nrow output is NULL.mydata[mydata$sCode == 'CA',]
returns data.frame
where sCode == 'CA'
. sCode includes character. That's why sum
gives you the error.subset(mydata, sCode='CA', select=c(sCode))
, you should use sCode=='CA'
instead sCode='CA'
. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
mydata$sCode == "CA"
will return a boolean array, with a TRUE
value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA")
, as suggested in the comments; because
TRUE
is interpreted as 1 and FALSE
as 0, this should return the
numer of TRUE
values in your vector.
length(which(mydata$sCode == "CA"))
; the which()
function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA"
.
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which()
returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length()
of this vector is the number of occurences.
With dplyr
package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
sum
is used to add elements; nrow
is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length
is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
Call nrow
passing as argument the name of the dataset:
nrow(dataset)
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
Source: Stackoverflow.com