[sql] What is the difference between a hash join and a merge join (Oracle RDBMS )?

A "sort merge" join is performed by sorting the two data sets to be joined according to the join keys and then merging them together. The merge is very cheap, but the sort can be prohibitively expensive especially if the sort spills to disk. The cost of the sort can be lowered if one of the data sets can be accessed in sorted order via an index, although accessing a high proportion of blocks of a table via an index scan can also be very expensive in comparison to a full table scan.

A hash join is performed by hashing one data set into memory based on join columns and reading the other one and probing the hash table for matches. The hash join is very low cost when the hash table can be held entirely in memory, with the total cost amounting to very little more than the cost of reading the data sets. The cost rises if the hash table has to be spilled to disk in a one-pass sort, and rises considerably for a multipass sort.

(In pre-10g, outer joins from a large to a small table were problematic performance-wise, as the optimiser could not resolve the need to access the smaller table first for a hash join, but the larger table first for an outer join. Consequently hash joins were not available in this situation).

The cost of a hash join can be reduced by partitioning both tables on the join key(s). This allows the optimiser to infer that rows from a partition in one table will only find a match in a particular partition of the other table, and for tables having n partitions the hash join is executed as n independent hash joins. This has the following effects:

  1. The size of each hash table is reduced, hence reducing the maximum amount of memory required and potentially removing the need for the operation to require temporary disk space.
  2. For parallel query operations the amount of inter-process messaging is vastly reduced, reducing CPU usage and improving performance, as each hash join can be performed by one pair of PQ processes.
  3. For non-parallel query operations the memory requirement is reduced by a factor of n, and the first rows are projected from the query earlier.

You should note that hash joins can only be used for equi-joins, but merge joins are more flexible.

In general, if you are joining large amounts of data in an equi-join then a hash join is going to be a better bet.

This topic is very well covered in the documentation.

http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/optimops.htm#i51523

12.1 docs: https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to oracle

concat yesterdays date with a specific time ORA-28001: The password has expired how to modify the size of a column How to create a blank/empty column with SELECT query in oracle? Find the number of employees in each department - SQL Oracle Query to display all tablespaces in a database and datafiles When or Why to use a "SET DEFINE OFF" in Oracle Database How to insert date values into table error: ORA-65096: invalid common user or role name in oracle In Oracle SQL: How do you insert the current date + time into a table?

Examples related to join

Pandas Merging 101 pandas: merge (join) two data frames on multiple columns How to use the COLLATE in a JOIN in SQL Server? How to join multiple collections with $lookup in mongodb How to join on multiple columns in Pyspark? Pandas join issue: columns overlap but no suffix specified MySQL select rows where left join is null How to return rows from left table not found in right table? Why do multiple-table joins produce duplicate rows? pandas three-way joining multiple dataframes on columns