[hadoop] When to use Hadoop, HBase, Hive and Pig?

Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

There are four main modules in Hadoop.

  1. Hadoop Common: The common utilities that support the other Hadoop modules.

  2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

  3. Hadoop YARN: A framework for job scheduling and cluster resource management.

  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Before going further, Let's note that we have three different types of data.

  • Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.

  • Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.

  • Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.

Depending on type of data to be processed, we have to choose right technology.

Some more projects, which are part of Hadoop:

  • HBase™: A scalable, distributed database that supports structured data storage for large tables.

  • Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.

  • Pig™: A high-level data-flow language and execution framework for parallel computation.

Hive Vs PIG comparison can be found at this article and my other post at this SE question.

HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.

You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce

You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce

You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce

Have a look at: Hadoop Use Cases.

Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.

HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.

PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.

Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to hbase

When to use Hadoop, HBase, Hive and Pig? Hive load CSV with commas in quoted fields Hbase quickly count number of rows How to delete all data from solr and hbase

Examples related to hive

select rows in sql with latest date for each ID repeated multiple times PySpark: withColumn() with two conditions and three outcomes java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient Hive cast string to date dd-MM-yyyy How to save DataFrame directly to Hive? How to calculate Date difference in Hive Select top 2 rows in Hive Just get column names from hive table Create hive table using "as select" or "like" and also specify delimiter Hive Alter table change Column Name

Examples related to apache-pig

When to use Hadoop, HBase, Hive and Pig? PIG how to count a number of rows in alias Difference between Pig and Hive? Why have both?