[r] Add an index (numeric ID) column to large data frame

I have a read large csv file into a data frame. Data in the csv file are from multiple web sites representing user information. For example here is the structure of the data frame.

user_id, number_of_logins, number_of_images, web
001, 34, 3, aa.com
002, 4, 4, aa.com
034, 3, 3, aa.com
001, 12, 4, bb.com
002, 1, 3, bb.com
034, 2, 2, cc.com

as you can see once I bring the data into the data frame user_id is no longer a unique id and this causes all the analysis. I am trying to add another columns prior to user_id which is something like "generated_uid" and pretty much use the index of the data.frame to be filled by that column. What's the best way to accomplish this.

This question is related to r dataframe

The answer is


Using alternative dplyr package:

library("dplyr") # or library("tidyverse")

df <- df %>% mutate(id = row_number())

Well, if I understand you correctly. You can do something like the following.

To show it, I first create a data.frame with your example

df <- 
scan(what = character(), sep = ",", text =
"001, 34, 3, aa.com
002, 4, 4, aa.com
034, 3, 3, aa.com
001, 12, 4, bb.com
002, 1, 3, bb.com
034, 2, 2, cc.com")

df <- as.data.frame(matrix(df, 6, 4, byrow = TRUE))
colnames(df) <- c("user_id", "number_of_logins", "number_of_images", "web")  

You can then run one of the following lines to add a column (at the end of the data.frame) with the row number as the generated user id. The second lines simply adds leading zeros.

df$generated_uid  <- 1:nrow(df)
df$generated_uid2 <- sprintf("%03d", 1:nrow(df))

If you absolutely want the generated user id to be the first column, you can add the column like so:

df <- cbind("generated_uid3" = sprintf("%03d", 1:nrow(df)), df)

or simply rearrage the columns.


If your data.frame is a data.table, you can use special symbol .I:

data[, ID := .I]