[r] How to specify names of columns for x and y when joining in dplyr?

I have two data frames that I want to join using dplyr. One is a data frame containing first names.

test_data <- data.frame(first_name = c("john", "bill", "madison", "abby", "zzz"),
                        stringsAsFactors = FALSE)

The other data frame contains a cleaned up version of the Kantrowitz names corpus, identifying gender. Here is a minimal example:

kantrowitz <- structure(list(name = c("john", "bill", "madison", "abby", "thomas"), gender = c("M", "either", "M", "either", "M")), .Names = c("name", "gender"), row.names = c(NA, 5L), class = c("tbl_df", "tbl", "data.frame"))

I essentially want to look up the gender of the name from the test_data table using the kantrowitz table. Because I'm going to abstract this into a function encode_gender, I won't know the name of the column in the data set that's going to be used, and so I can't guarantee that it will be name, as in kantrowitz$name.

In base R I would perform the merge this way:

merge(test_data, kantrowitz, by.x = "first_names", by.y = "name", all.x = TRUE)

That returns the correct output:

  first_name gender
1       abby either
2       bill either
3       john      M
4    madison      M
5        zzz   <NA>

But I want to do this in dplyr because I'm using that package for all my other data manipulation. The dplyr by option to the various *_join functions only lets me specify one column name, but I need to specify two. I'm looking for something like this:

library(dplyr)
# either
left_join(test_data, kantrowitz, by.x = "first_name", by.y = "name")
# or
left_join(test_data, kantrowitz, by = c("first_name", "name"))

What is the way to perform this kind of join using dplyr?

(Never mind that the Kantrowitz corpus is a bad way to identify gender. I'm working on a better implementation, but I want to get this working first.)

This question is related to r join left-join dplyr

The answer is


This feature has been added in dplyr v0.3. You can now pass a named character vector to the by argument in left_join (and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:

left_join(test_data, kantrowitz, by = c("first_name" = "name"))

This is more a workaround than a real solution. You can create a new object test_data with another column name:

left_join("names<-"(test_data, "name"), kantrowitz, by = "name")

     name gender
1    john      M
2    bill either
3 madison      M
4    abby either
5     zzz   <NA>

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to join

Pandas Merging 101 pandas: merge (join) two data frames on multiple columns How to use the COLLATE in a JOIN in SQL Server? How to join multiple collections with $lookup in mongodb How to join on multiple columns in Pyspark? Pandas join issue: columns overlap but no suffix specified MySQL select rows where left join is null How to return rows from left table not found in right table? Why do multiple-table joins produce duplicate rows? pandas three-way joining multiple dataframes on columns

Examples related to left-join

MySQL select rows where left join is null How to return rows from left table not found in right table? How to specify names of columns for x and y when joining in dplyr? Select rows which are not present in other table SQL LEFT-JOIN on 2 fields for MySQL Combine two pandas Data Frames (join on a common column) SQL LEFT JOIN Subquery Alias How to use mysql JOIN without ON condition? LEFT JOIN only first row MySQL - sum column value(s) based on row from the same table

Examples related to dplyr

R dplyr: Drop multiple columns How to specify "does not contain" in dplyr filter Select first and last row from grouped data Error: could not find function "%>%" Sum across multiple columns with dplyr Removing NA observations with dplyr::filter() Changing factor levels with dplyr mutate Change value of variable with dplyr dplyr change many data types What does %>% function mean in R?