Standardize data columns in R

Question

I have a dataset called spam which contains 58 columns and approximately 3500 rows of data related to spam messages    I plan on running some linear regression on this dataset in the future  but I d like to do some pre-processing beforehand and standardize the columns to have zero mean and unit variance    I ve been told the best way to go about this is with R  so I d like to ask how can i achieve normalization with R  I ve already got the data properly loaded and I m just looking for some packages or methods to perform this task

User · Answer

The normalize function from the BBMisc package was the right tool for me since it can deal with NA values.

Here is how to use it:

Given the following dataset,

    ASR_API     <- c("CV",  "F",    "IER",  "LS-c", "LS-o")
    Human       <- c(NA,    5.8,    12.7,   NA, NA)
    Google      <- c(23.2,  24.2,   16.6,   12.1,   28.8)
    GoogleCloud <- c(23.3,  26.3,   18.3,   12.3,   27.3)
    IBM     <- c(21.8,  47.6,   24.0,   9.8,    25.3)
    Microsoft   <- c(29.1,  28.1,   23.1,   18.8,   35.9)
    Speechmatics    <- c(19.1,  38.4,   21.4,   7.3,    19.4)
    Wit_ai      <- c(35.6,  54.2,   37.4,   19.2,   41.7)
    dt     <- data.table(ASR_API,Human, Google, GoogleCloud, IBM, Microsoft, Speechmatics, Wit_ai)
> dt
   ASR_API Human Google GoogleCloud  IBM Microsoft Speechmatics Wit_ai
1:      CV    NA   23.2        23.3 21.8      29.1         19.1   35.6
2:       F   5.8   24.2        26.3 47.6      28.1         38.4   54.2
3:     IER  12.7   16.6        18.3 24.0      23.1         21.4   37.4
4:    LS-c    NA   12.1        12.3  9.8      18.8          7.3   19.2
5:    LS-o    NA   28.8        27.3 25.3      35.9         19.4   41.7

normalized values can be obtained like this:

> dtn <- normalize(dt, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
> dtn
   ASR_API      Human     Google GoogleCloud         IBM  Microsoft Speechmatics      Wit_ai
1:      CV         NA  0.3361245   0.2893457 -0.28468670  0.3247336  -0.18127203 -0.16032655
2:       F -0.7071068  0.4875320   0.7715885  1.59862532  0.1700986   1.55068347  1.31594762
3:     IER  0.7071068 -0.6631646  -0.5143923 -0.12409420 -0.6030768   0.02512682 -0.01746131
4:    LS-c         NA -1.3444981  -1.4788780 -1.16064578 -1.2680075  -1.24018782 -1.46198764
5:    LS-o         NA  1.1840062   0.9323361 -0.02919864  1.3762521  -0.15435044  0.32382788

where hand calculated method just ignores colmuns containing NAs:

> dt %>% mutate(normalizedHuman = (Human - mean(Human))/sd(Human)) %>% 
+ mutate(normalizedGoogle = (Google - mean(Google))/sd(Google)) %>% 
+ mutate(normalizedGoogleCloud = (GoogleCloud - mean(GoogleCloud))/sd(GoogleCloud)) %>% 
+ mutate(normalizedIBM = (IBM - mean(IBM))/sd(IBM)) %>% 
+ mutate(normalizedMicrosoft = (Microsoft - mean(Microsoft))/sd(Microsoft)) %>% 
+ mutate(normalizedSpeechmatics = (Speechmatics - mean(Speechmatics))/sd(Speechmatics)) %>% 
+ mutate(normalizedWit_ai = (Wit_ai - mean(Wit_ai))/sd(Wit_ai))
  ASR_API Human Google GoogleCloud  IBM Microsoft Speechmatics Wit_ai normalizedHuman normalizedGoogle
1      CV    NA   23.2        23.3 21.8      29.1         19.1   35.6              NA        0.3361245
2       F   5.8   24.2        26.3 47.6      28.1         38.4   54.2              NA        0.4875320
3     IER  12.7   16.6        18.3 24.0      23.1         21.4   37.4              NA       -0.6631646
4    LS-c    NA   12.1        12.3  9.8      18.8          7.3   19.2              NA       -1.3444981
5    LS-o    NA   28.8        27.3 25.3      35.9         19.4   41.7              NA        1.1840062
  normalizedGoogleCloud normalizedIBM normalizedMicrosoft normalizedSpeechmatics normalizedWit_ai
1             0.2893457   -0.28468670           0.3247336            -0.18127203      -0.16032655
2             0.7715885    1.59862532           0.1700986             1.55068347       1.31594762
3            -0.5143923   -0.12409420          -0.6030768             0.02512682      -0.01746131
4            -1.4788780   -1.16064578          -1.2680075            -1.24018782      -1.46198764
5             0.9323361   -0.02919864           1.3762521            -0.15435044       0.32382788

(normalizedHuman is made a list of NAs ...)

regarding the selection of specific columns for calculation, a generic method can be employed like this one:

data_vars <- df_full %>% dplyr::select(-ASR_API,-otherVarNotToBeUsed)
meta_vars <- df_full %>% dplyr::select(ASR_API,otherVarNotToBeUsed)
data_varsn <- normalize(data_vars, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
dtn <- cbind(meta_vars,data_varsn)

User · Answer

Realizing that the question is old and one answer is accepted  I ll provide another answer for reference   scale is limited by the fact that it scales all variables  The solution below allows to scale only specific variable names while preserving other variables unchanged  and the variable names could be dynamically generated    library dplyr   set seed 1234  dat  lt - data frame x   rnorm 10  30   2                      y   runif 10  3  5                     z   runif 10  10  20   dat  dat2  lt - dat   gt   mutate at c  y    z      scale      gt   as vector   dat2   which gives me this    gt  dat           x        y        z 1  29 75859 3 633225 14 56091 2  30 05549 3 605387 12 65187 3  30 21689 3 318092 13 04672 4  29 53086 3 079992 15 07307 5  30 08582 3 437599 11 81096 6  30 10121 4 621197 17 59671 7  29 88505 4 051395 12 01248 8  29 89067 4 829316 12 58810 9  29 88711 4 662690 19 92150 10 29 82199 3 091541 18 07352   and    gt  dat2  lt - dat   gt   mutate at c  y    z      scale      gt   as vector    gt  dat2           x          y           z 1  29 75859 -0 3004815 -0 06016029 2  30 05549 -0 3423437 -0 72529604 3  30 21689 -0 7743696 -0 58772361 4  29 53086 -1 1324181  0 11828039 5  30 08582 -0 5946582 -1 01827752 6  30 10121  1 1852038  0 99754666 7  29 88505  0 3283513 -0 94806607 8  29 89067  1 4981677 -0 74751378 9  29 88711  1 2475998  1 80753470 10 29 82199 -1 1150515  1 16367556   EDIT 1  2016   Addressed Julian s comment  the output of scale is Nx1 matrix so ideally we should add an as vector to convert the matrix type back into a vector type  Thanks Julian   EDIT 2  2019   Quoting Duccio A  s comment  For the latest dplyr  version 0 8  you need to change dplyr  funcs with list  like dat   gt   mutate each  list  scale      gt   as vector   vars c  y   z     EDIT 3  2020   Thanks to  mj whales  the old solution is deprecated and now we need to use mutate at

User · Answer

The dplyr package has two functions that do this    gt  require dplyr    To mutate specific columns of a data table  you can use the function mutate at    To mutate all columns  you can use mutate all   The following is a brief example for using these functions to standardize data   Mutate specific columns   dt   data table a   runif 3500   b   runif 3500   c   runif 3500   dt   data table dt   gt   mutate at vars  a    c    scale     can also index columns by number  e g   vars c 1 3     gt  apply dt  2  mean              a             b             c   1 783137e-16  5 064855e-01 -5 245395e-17    gt  apply dt  2  sd          a         b         c  1 0000000 0 2906622 1 0000000    Mutate all columns   dt   data table a   runif 3500   b   runif 3500   c   runif 3500   dt   data table dt   gt   mutate all scale     gt  apply dt  2  mean              a             b             c  -1 728266e-16  9 291994e-17  1 683551e-16    gt  apply dt  2  sd  a b c  1 1 1

User · Answer

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1   If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want   dat  lt - data frame x   rnorm 10  30   2   y   runif 10  3  5   scaled dat  lt - scale dat     check that we get mean of 0 and sd of 1 colMeans scaled dat     faster version of apply scaled dat  2  mean  apply scaled dat  2  sd    Using built in functions is classy   Like this cat

User · Answer

BBKim pretty much gave the best answer  but it can just be done shorter  I m surprised noone came up with it yet    dat  lt - data frame x   rnorm 10  30   2   y   runif 10  3  5   dat  lt - apply dat  2   function x   x - mean x     sd x

User · Answer

Caret  package provides methods for preprocessing data  e g  centering and scaling   You could also use the following code   library caret    Assuming goal class is column 10 preObj  lt - preProcess data   -10   method c  center    scale    newData  lt - predict preObj  data   -10     More details  http   www inside-r org node 86978

User · Answer

Again  even though this is an old question  it is very relevant  And I have found a simple way to normalise certain columns without the need of any packages   normFunc  lt - function x   x-mean x  na rm   T   sd x  na rm   T     For example  x lt -rnorm 10 14 2  y lt -rnorm 10 7 3  z lt -rnorm 10 18 5  df lt -data frame x y z   df 2 3   lt - apply df 2 3   2  normFunc    You will see that the y and z columns have been normalised  No packages needed  -

User · Answer

The collapse package provides the fastest scale function - implemented in C   using Welfords Online Algorithm  dat  lt - data frame x   rnorm 1e6  30   2                      y   runif 1e6  3  5                     z   runif 1e6  10  20    library collapse  library microbenchmark  microbenchmark fscale dat   scale dat    Unit  milliseconds         expr       min       lq      mean    median        uq      max neval cld  fscale dat   27 86456  29 5864  38 96896  30 80421  43 79045 313 5729   100  a    scale dat  357 07130 391 0914 489 93546 416 33626 625 38561 793 2243   100   b  Furthermore  fscale is S3 generic for vectors  matrices and data frames and also supports grouped and or weighted scaling operations  as well as scaling to arbitrary means and standard deviations

User · Answer

Use the package  recommenderlab   Download and install the package  This package has a command  Normalize  in built  It also allows you to choose one of the many methods for normalization namely  center  or  Z-score  Follow the following example      create a matrix with ratings m  lt - matrix sample c NA 0 5  50  replace TRUE  prob c  5 rep  5 6 6    nrow 5  ncol 10  dimnames   list users paste  u   1 5  sep  amp rdquo    items paste  i   1 10  sep  amp rdquo         do normalization r  lt - as m   realRatingMatrix    here   centre  is the default method r n1  lt - normalize r    here  Z-score  is the used method used r n2  lt - normalize r  method  Z-score    r r n1 r n2     show normalized data image r  main  Raw Data   image r n1  main  Centered   image r n2  main  Z-Score Normalization

User · Answer

You can easily normalize the data also using data Normalization function in clusterSim package  It provides different method of data normalization       data Normalization  x type  n0  normalization  column     Arguments  x vector  matrix or dataset type type of normalization  n0 - without normalization  n1 - standardization   x-mean  sd   n2 - positional standardization   x-median  mad   n3 - unitization   x-mean  range   n3a - positional unitization   x-median  range   n4 - unitization with zero minimum   x-min  range   n5 - normalization in range  lt -1 1    x-mean  max abs x-mean     n5a - positional normalization in range  lt -1 1    x-median  max abs x-median     n6 - quotient transformation  x sd   n6a - positional quotient transformation  x mad   n7 - quotient transformation  x range   n8 - quotient transformation  x max   n9 - quotient transformation  x mean   n9a - positional quotient transformation  x median   n10 - quotient transformation  x sum   n11 - quotient transformation  x sqrt SSQ    n12 - normalization   x-mean  sqrt sum  x-mean  2     n12a - positional normalization   x-median  sqrt sum  x-median  2     n13 - normalization with zero being the central point   x-midrange   range 2    normalization  column  - normalization by variable   row  - normalization by object

User · Answer

With dplyr v0 7 4 all variables can be scaled by using mutate all       library dplyr    gt     gt  Attaching package   dplyr    gt  The following objects are masked from  package stats     gt     gt      filter  lag   gt  The following objects are masked from  package base     gt     gt      intersect  setdiff  setequal  union library tibble   set seed 1234  dat  lt - tibble x   rnorm 10  30   2                  y   runif 10  3  5                 z   runif 10  10  20    dat   gt   mutate all scale    gt    A tibble  10 x 3   gt          x      y       z   gt       lt dbl gt    lt dbl gt     lt dbl gt    gt   1 -0 827 -0 300 -0 0602   gt   2  0 663 -0 342 -0 725    gt   3  1 47  -0 774 -0 588    gt   4 -1 97  -1 13   0 118    gt   5  0 816 -0 595 -1 02     gt   6  0 893  1 19   0 998    gt   7 -0 192  0 328 -0 948    gt   8 -0 164  1 50  -0 748    gt   9 -0 182  1 25   1 81     gt  10 -0 509 -1 12   1 16   Specific variables can be excluded using mutate at     dat   gt   mutate at scale   vars   vars -x     gt    A tibble  10 x 3   gt         x      y       z   gt      lt dbl gt    lt dbl gt     lt dbl gt    gt   1  29 8 -0 300 -0 0602   gt   2  30 1 -0 342 -0 725    gt   3  30 2 -0 774 -0 588    gt   4  29 5 -1 13   0 118    gt   5  30 1 -0 595 -1 02     gt   6  30 1  1 19   0 998    gt   7  29 9  0 328 -0 948    gt   8  29 9  1 50  -0 748    gt   9  29 9  1 25   1 81     gt  10  29 8 -1 12   1 16   Created on 2018-04-24 by the reprex package  v0 2 0

User · Answer

Scale can be used for both full data frame and specific columns  For specific columns  following code can be used   trainingSet   3 7    scale trainingSet   3 7     For column 3 to 7 trainingSet   8    scale trainingSet   8     For column 8    Full data frame  trainingSet  lt - scale trainingSet

User · Answer

When I used the solution stated by Dason  instead of getting a data frame as a result  I got a vector of numbers  the scaled values of my df    In case someone is having the same trouble  you have to add as data frame   to the code  like this   df scaled  lt - as data frame scale df     I hope this is will be useful for ppl having the same issue

User · Answer

Before I happened to find this thread  I had the same problem  I had user dependant column types  so I wrote a for loop going through them and getting needed columns scale d  There are probably better ways to do it  but this solved the problem just fine    for i in 1 length colnames df              if class df  i       numeric     class df  i       integer                 df  i   lt - as vector scale df  i                  as vector is a needed part  because it turned out scale does rownames x 1 matrix which is usually not what you want to have in your data frame

User · Answer

This is 3 years old  Still  I feel I have to add the following   The most common normalization is the z-transformation  where you subtract the mean and divide by the standard deviation of your variable  The result will have mean 0 and sd 1   For that  you don t need any package   zVar  lt -  myVar - mean myVar     sd myVar    That s it

[r] Standardize data columns in R

Examples related to r

Examples related to normalization