[sql] Difference between INNER JOIN and LEFT SEMI JOIN

What is the difference between an INNER JOIN and LEFT SEMI JOIN?

In the scenario below, why am I getting two different results?

The INNER JOIN result set is a lot larger. Can someone explain? I am trying to get the names within table_1 that only appear in table_2.

SELECT name
FROM table_1 a
    INNER JOIN table_2 b ON a.name=b.name

SELECT name
FROM table_1 a
    LEFT SEMI JOIN table_2 b ON (a.name=b.name)

This question is related to sql hql hive

The answer is


Suppose there are 2 tables TableA and TableB with only 2 columns (Id, Data) and following data:

TableA:

+----+---------+
| Id |  Data   |
+----+---------+
|  1 | DataA11 |
|  1 | DataA12 |
|  1 | DataA13 |
|  2 | DataA21 |
|  3 | DataA31 |
+----+---------+

TableB:

+----+---------+
| Id |  Data   |
+----+---------+
|  1 | DataB11 |
|  2 | DataB21 |
|  2 | DataB22 |
|  2 | DataB23 |
|  4 | DataB41 |
+----+---------+

Inner Join on column Id will return columns from both the tables and only the matching records:

.----.---------.----.---------.
| Id |  Data   | Id |  Data   |
:----+---------+----+---------:
|  1 | DataA11 |  1 | DataB11 |
:----+---------+----+---------:
|  1 | DataA12 |  1 | DataB11 |
:----+---------+----+---------:
|  1 | DataA13 |  1 | DataB11 |
:----+---------+----+---------:
|  2 | DataA21 |  2 | DataB21 |
:----+---------+----+---------:
|  2 | DataA21 |  2 | DataB22 |
:----+---------+----+---------:
|  2 | DataA21 |  2 | DataB23 |
'----'---------'----'---------'

Left Join (or Left Outer join) on column Id will return columns from both the tables and matching records with records from left table (Null values from right table):

.----.---------.----.---------.
| Id |  Data   | Id |  Data   |
:----+---------+----+---------:
|  1 | DataA11 |  1 | DataB11 |
:----+---------+----+---------:
|  1 | DataA12 |  1 | DataB11 |
:----+---------+----+---------:
|  1 | DataA13 |  1 | DataB11 |
:----+---------+----+---------:
|  2 | DataA21 |  2 | DataB21 |
:----+---------+----+---------:
|  2 | DataA21 |  2 | DataB22 |
:----+---------+----+---------:
|  2 | DataA21 |  2 | DataB23 |
:----+---------+----+---------:
|  3 | DataA31 |    |         |
'----'---------'----'---------'

Right Join (or Right Outer join) on column Id will return columns from both the tables and matching records with records from right table (Null values from left table):

+-----------------------------+
¦ Id ¦  Data   ¦ Id ¦  Data   ¦
+----+---------+----+---------¦
¦  1 ¦ DataA11 ¦  1 ¦ DataB11 ¦
¦  1 ¦ DataA12 ¦  1 ¦ DataB11 ¦
¦  1 ¦ DataA13 ¦  1 ¦ DataB11 ¦
¦  2 ¦ DataA21 ¦  2 ¦ DataB21 ¦
¦  2 ¦ DataA21 ¦  2 ¦ DataB22 ¦
¦  2 ¦ DataA21 ¦  2 ¦ DataB23 ¦
¦    ¦         ¦  4 ¦ DataB41 ¦
+-----------------------------+

Full Outer Join on column Id will return columns from both the tables and matching records with records from left table (Null values from right table) and records from right table (Null values from left table):

+-----------------------------+
¦ Id ¦  Data   ¦ Id ¦  Data   ¦
¦----+---------+----+---------¦
¦  - ¦         ¦    ¦         ¦
¦  1 ¦ DataA11 ¦  1 ¦ DataB11 ¦
¦  1 ¦ DataA12 ¦  1 ¦ DataB11 ¦
¦  1 ¦ DataA13 ¦  1 ¦ DataB11 ¦
¦  2 ¦ DataA21 ¦  2 ¦ DataB21 ¦
¦  2 ¦ DataA21 ¦  2 ¦ DataB22 ¦
¦  2 ¦ DataA21 ¦  2 ¦ DataB23 ¦
¦  3 ¦ DataA31 ¦    ¦         ¦
¦    ¦         ¦  4 ¦ DataB41 ¦
+-----------------------------+

Left Semi Join on column Id will return columns only from left table and matching records only from left table:

+--------------+
¦ Id ¦  Data   ¦
+----+---------¦
¦  1 ¦ DataA11 ¦
¦  1 ¦ DataA12 ¦
¦  1 ¦ DataA13 ¦
¦  2 ¦ DataA21 ¦
+--------------+

Tried in Hive and got the below output

table1

1,wqe,chennai,india

2,stu,salem,india

3,mia,bangalore,india

4,yepie,newyork,USA

table2

1,wqe,chennai,india

2,stu,salem,india

3,mia,bangalore,india

5,chapie,Los angels,USA

Inner Join

SELECT * FROM table1 INNER JOIN table2 ON (table1.id = table2.id);

1 wqe chennai india 1 wqe chennai india

2 stu salem india 2 stu salem india

3 mia bangalore india 3 mia bangalore india

Left Join

SELECT * FROM table1 LEFT JOIN table2 ON (table1.id = table2.id);

1 wqe chennai india 1 wqe chennai india

2 stu salem india 2 stu salem india

3 mia bangalore india 3 mia bangalore india

4 yepie newyork USA NULL NULL NULL NULL

Left Semi Join

SELECT * FROM table1 LEFT SEMI JOIN table2 ON (table1.id = table2.id);

1 wqe chennai india

2 stu salem india

3 mia bangalore india

note: Only records in left table are displayed whereas for Left Join both the table records displayed


An INNER JOIN can return data from the columns from both tables, and can duplicate values of records on either side have more than one match. A LEFT SEMI JOIN can only return columns from the left-hand table, and yields one of each record from the left-hand table where there is one or more matches in the right-hand table (regardless of the number of matches). It's equivalent to (in standard SQL):

SELECT name
FROM table_1 a
WHERE EXISTS(
    SELECT * FROM table_2 b WHERE (a.name=b.name))

If there are multiple matching rows in the right-hand column, an INNER JOIN will return one row for each match on the right table, while a LEFT SEMI JOIN only returns the rows from the left table, regardless of the number of matching rows on the right side. That's why you're seeing a different number of rows in your result.

I am trying to get the names within table_1 that only appear in table_2.

Then a LEFT SEMI JOIN is the appropriate query to use.


Trying to depict with venn diagrams for better understanding..

Left Semi join : A semi join returns values from the left side of the relation that has a match with the right. It is also referred to as a left semi join.

enter image description here

Note : There is another thing called left anti join : An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.

Inner join : It selects rows that have matching values in both relations.

enter image description here


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to hql

Difference between INNER JOIN and LEFT SEMI JOIN [Ljava.lang.Object; cannot be cast to HQL Hibernate INNER JOIN What is the difference between JOIN and JOIN FETCH when using JPA and Hibernate IN-clause in HQL or Java Persistence Query Language ORDER BY using Criteria API How do you do a limit query in JPQL or HQL? Hibernate HQL Query : How to set a Collection as a named parameter of a Query? How do you create a Distinct query in HQL JPA and Hibernate - Criteria vs. JPQL or HQL

Examples related to hive

select rows in sql with latest date for each ID repeated multiple times PySpark: withColumn() with two conditions and three outcomes java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient Hive cast string to date dd-MM-yyyy How to save DataFrame directly to Hive? How to calculate Date difference in Hive Select top 2 rows in Hive Just get column names from hive table Create hive table using "as select" or "like" and also specify delimiter Hive Alter table change Column Name