When to use Hadoop HBase Hive and Pig

Question

What are the benefits of using either Hadoop or HBase or Hive    From my understanding  HBase avoids using map-reduce and has a column oriented storage on top of HDFS  Hive is a sql-like interface for Hadoop and HBase    I would also like to know how Hive compares with Pig

User · Answer

Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models   There are four main modules in Hadoop    Hadoop Common  The common utilities that support the other Hadoop modules  Hadoop Distributed File System  HDFS      A distributed file system that provides high-throughput access to application data  Hadoop YARN  A framework for job scheduling and cluster resource management  Hadoop MapReduce  A YARN-based system for parallel processing of large data sets    Before going further  Let s note that we have three different types of data    Structured  Structured data has strong schema and schema will be checked during write  amp  read operation  e g  Data in RDBMS systems like Oracle  MySQL Server etc  Unstructured  Data does not have any structure and it can be any form - Web server logs  E-Mail  Images etc  Semi-structured  Data is not strictly structured but have some structure  e g  XML files    Depending on type of data to be processed  we have to choose right technology    Some more projects  which are part of Hadoop    HBase     A scalable  distributed database that supports structured data storage for large tables  Hive     A data warehouse infrastructure that provides data summarization and ad-hoc querying  Pig     A high-level data-flow language and execution framework for parallel computation    Hive Vs PIG comparison can be found at this article and my other post at this  SE question   HBASE won t replace Map Reduce  HBase is scalable distributed database  amp  Map Reduce is programming model for distributed processing of data  Map Reduce may act on data in HBASE in processing    You can use HIVE HBASE for structured semi-structured data and process it with Hadoop Map Reduce  You can use SQOOP to import structured data from traditional RDBMS database Oracle  SQL Server etc and process it with Hadoop Map Reduce  You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce  Have a look at  Hadoop Use Cases   Hive should be used for analytical querying of data collected over a period of time  e g Calculate trends  summarize website logs but it can t be used for real time queries   HBase fits for real-time querying of Big Data  Facebook use it for messaging and real-time analytics   PIG can be used to construct dataflows  run a scheduled jobs  crunch big volumes of data  aggregate summarize it and store into relation database systems  Good for ad-hoc analysis    Hive can be used for ad-hoc data analysis but it can t support all un-structured data formats unlike PIG

User · Answer

Pig   it is better to handle files and cleaning data   example   removing null values string handling unnecessary values Hive  for querying on cleaned data

User · Answer

Hadoop    HDFS stands for Hadoop Distributed File System which uses  Computational processing model Map-Reduce       HBase    HBase is Key-Value storage  good for reading and writing in near real time       Hive    Hive is used for data extraction from the HDFS using SQL-like syntax  Hive use HQL language      Pig    Pig is a data flow language for creating ETL  It s an scripting language

User · Answer

Understanding in depth  Hadoop  Hadoop is an open source project of the Apache foundation  It is a framework written in Java  originally developed by Doug Cutting in 2005  It was created to support distribution for Nutch  the text search engine  Hadoop uses Google s Map Reduce and Google File System Technologies as its foundation    Features of Hadoop   It is optimized to handle massive quantities of structured  semi-structured and unstructured data using commodity hardware   It has shared nothing architecture  It replicates its data into multiple computers so that if one goes down  the data can still be processed from another machine that stores its replica   Hadoop is for high throughput rather than low latency  It is a batch operation handling massive quantities of data  therefore the response time is not immediate  It complements Online Transaction Processing and Online Analytical Processing  However  it is not a replacement for a RDBMS  It is not good when work cannot be parallelized or when there are dependencies within the data  It is not good for processing small files  It works best with huge data files and data sets    Versions of Hadoop  There are two versions of Hadoop available     Hadoop 1 0 Hadoop 2 0   Hadoop 1 0  It has two main parts    1  Data Storage Framework  It is a general-purpose file system called Hadoop Distributed File System  HDFS    HDFS is schema-less  It simply stores data files and these data files can be in just about any format   The idea is to store files as close to their original form as possible   This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement   2  Data Processing Framework  This is a simple functional programming model initially popularized by Google as MapReduce    It essentially uses two functions  MAP and REDUCE to process data   The  Mappers  take in a set of key-value pairs and generate intermediate data  which is another list of key-value pairs    The  Reducers  then act on this input to produce the output data   The two functions seemingly work in isolation with one another  thus enabling the processing to be highly distributed in highly parallel  fault-tolerance and scalable way   Limitations of Hadoop 1 0   The first limitation was the requirement of MapReduce programming expertise  It supported only batch processing which although is suitable for tasks such as log analysis  large scale data mining projects but pretty much unsuitable for other kinds of projects  One major limitation was that Hadoop 1 0 was tightly computationally coupled with MapReduce  which meant that the established data management vendors where left with two opinions    Either rewrite their functionality in MapReduce so that it could be  executed in Hadoop or Extract data from HDFS or process it outside of Hadoop     None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop cluster   Hadoop 2 0  In Hadoop 2 0  HDFS continues to be data storage framework   However  a new and seperate resource management framework called Yet Another Resource Negotiater  YARN  has been added   Any application capable of dividing itself into parallel tasks is supported by YARN   YARN coordinates the allocation of subtasks of the submitted application  thereby further enhancing the flexibility  scalability and efficiency of applications   It works by having an Application Master in place of Job Tracker  running applications on resources governed by new Node Manager   ApplicationMaster is able to run any application and not just MapReduce   This means it does not only support batch processing but also real-time processing  MapReduce is no longer the only data processing option   Advantages of Hadoop  It stores data in its native from  There is no structure imposed while keying in data or storing data  HDFS is schema less  It is only later when the data needs to be processed that the structure is imposed on the raw data   It is scalable  Hadoop can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel   It is resilient to failure  Hadoop is fault tolerance  It practices replication of data diligently which means whenever data is sent to any node  the same data also gets replicated to other nodes in the cluster  thereby ensuring that in event of node failure there will always be another copy of data available for use   It is flexible  One of the key advantages of Hadoop is that it can work with any kind of data  structured  unstructured or semi-structured  Also  the processing is extremely fast in Hadoop owing to the  move code to data  paradigm   Hadoop Ecosystem  Following are the components of Hadoop ecosystem   HDFS  Hadoop Distributed File System  It simply stores data files as close to the original form as possible   HBase  It is Hadoop s database and compares well with an RDBMS  It supports structured data storage for large tables   Hive  It enables analysis of large datasets using a language very similar to standard ANSI SQL  which implies that anyone familier with SQL should be able to access data on a Hadoop cluster   Pig  It is an easy to understand data flow language  It helps with analysis of large datasets which is quite the order with Hadoop  Pig scripts are automatically converted to MapReduce jobs by the Pig interpreter   ZooKeeper  It is a coordination service for distributed applications   Oozie  It is a workflow schedular system to manage Apache Hadoop jobs   Mahout  It is a scalable machine learning and data mining library   Chukwa  It is data collection system for managing large distributed system   Sqoop  It is used to transfer bulk data between Hadoop and structured data stores such as relational databases   Ambari  It is a web based tool for provisioning  managing and monitoring Hadoop clusters   Hive  Hive is a data warehouse infrastructure tool to process structured data in Hadoop  It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy   Hive is not   A relational database A design for Online Transaction Processing  OLTP   A language for real-time queries and row-level updates    Features of Hive   It stores schema in database and processed data into HDFS  It is designed for OLAP  It provides SQL type language for querying called HiveQL or HQL  It is familier  fast  scalable and extensible    Hive Architecture  The following components are contained in Hive Architecture    User Interface  Hive is a data warehouse infrastructure that can create interaction between user and HDFS  The User Interfaces that Hive supports are Hive Web UI  Hive Command line and Hive HD Insight In Windows Server   MetaStore  Hive chooses respective database servers to store the schema or Metadata of tables  databases  columns in a table  their data types and HDFS mapping  HiveQL Process Engine  HiveQL is similar to SQL for querying on schema info on the Metastore  It is one of the replacements of traditional approach for MapReduce program  Instead of writing MapReduce in Java  we can write a query for MapReduce and process it  Exceution Engine  The conjunction part of HiveQL process engine and MapReduce is the Hive Execution Engine  Execution engine processes the query and generates results as same as MapReduce results  It uses the flavor of MapReduce  HDFS or HBase  Hadoop Distributed File System or HBase are the data storage techniques to store data into file system

User · Answer

For a Comparison Between Hadoop Vs Cassandra HBase read this post   Basically HBase enables really fast read and writes with scalability  How fast and scalable  Facebook uses it to manage its user statuses  photos  chat messages etc  HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself   Where As Hive is more like a Data Warehousing solution  You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job  Not ideal for fast  transactional systems

User · Answer

I worked on Lambda architecture processing Real time and Batch loads   Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions  Batch processing is needed to summarize data which can be feed into BI systems   we used Hadoop ecosystem technologies for above applications   Real Time Processing  Apache Storm  Stream Data processing  Rule application  HBase  Datastore for serving Realtime dashboard  Batch Processing Hadoop   Crunching huge chunk of data  360 degrees overview or adding context to events  Interfaces or frameworks like Pig  MR  Spark  Hive  Shark help in computing  This layer needs scheduler for which Oozie is good option   Event Handling layer  Apache Kafka was first layer to consume high velocity events from sensor   Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors

User · Answer

Let me try to answer in few words   Hadoop is an eco-system which comprises of all other tools  So  you can t compare Hadoop but you can compare MapReduce   Here are my few cents    Hive  If your need is very SQLish meaning your problem statement can be catered by SQL  then the easiest thing to do would be to use Hive  The other case  when you would use hive is when you want a server to have certain structure of data  Pig  If you are comfortable with Pig Latin and you need is more of the data pipelines  Also  your data lacks structure  In those cases  you could use Pig  Honestly there is not much difference between Hive  amp  Pig with respect to the use cases  MapReduce  If your problem can not be solved by using SQL straight  you first should try to create UDF for Hive  amp  Pig and then if the UDF is not solving the problem then getting it done via MapReduce makes sense

User · Answer

MapReduce is just a computing framework  HBase has nothing to do with it  That said  you can efficiently put or fetch data to from HBase by writing MapReduce jobs  Alternatively you can write sequential programs using other HBase APIs  such as Java  to put or fetch the data  But we use Hadoop  HBase etc to deal with gigantic amounts of data  so that doesn t make much sense  Using normal sequential programs would be highly inefficient when your data is too huge  Coming back to the first part of your question  Hadoop is basically 2 things  a Distributed FileSystem  HDFS    a Computation or Processing framework  MapReduce   Like all other FS  HDFS also provides us storage  but in a fault tolerant manner with high throughput and lower risk of data loss  because of the replication   But  being a FS  HDFS lacks random read and write access  This is where HBase comes into picture  It s a distributed  scalable  big data store  modelled after Google s BigTable  It stores data as key value pairs  Coming to Hive  It provides us data warehousing facilities on top of an existing Hadoop cluster  Along with that it provides an SQL like interface which makes your work easier  in case you are coming from an SQL background  You can create tables in Hive and store data there  Along with that you can even map your existing HBase tables to Hive and operate on them  While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly  Pig basically has 2 parts  the Pig Interpreter and the language  PigLatin  You write Pig script in PigLatin and using Pig interpreter process them  Pig makes our life a lot easier  otherwise writing MapReduce is always not easy  In fact in some cases it can really become a pain  I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago  It s not an in depth comparison  but a short intro to each of these tools which can help you to get started   Just to add on to my answer  No self promotion intended  Both Hive and Pig queries get converted into MapReduce jobs under the hood  HTH

User · Answer

1 We are using Hadoop for storing Large data  i e structure Unstructure and Semistructure data   in the form file format like txt csv   2 If We want columnar Updations in our data then we are using Hbase tool  3 In case of Hive   we are storing Big data which is in structured format  and in addition to that we are providing Analysis on that data   4 Pig is tool which is using Pig latin language to analyze data which is in any format structure semistructure and unstructure

User · Answer

Short answer to this question is -  Hadoop - Is Framework which facilitates distributed file system and programming model which allow us to store humongous sized data and process data in distributed fashion very efficiently and with very less processing time compare to traditional approaches    HDFS - Hadoop Distributed File system   Map Reduce - Programming Model for distributed processing    Hive - Is query language which allows to read write data from Hadoop distributed file system in a very popular SQL like fashion  This made life easier for many non-programming background people as they don t have to write Map-Reduce program anymore except for very complex scenarios where Hive is not supported   Hbase - Is Columnar NoSQL Database  Underlying storage layer for Hbase is again HDFS  Most important use case for this database is to be able to store billion s of rows with million s of columns  Low latency feature of Hbase helps faster and random access of record over distributed data  is very important feature to make it useful for complex projects like Recommender Engines  Also it s record level versioning capability allow user to store transactional data very efficiently  this solves the problem of updating records we have with HDFS and Hive   Hope this is helpful to quickly understand the above 3 features

User · Answer

First of all we should get clear that Hadoop was created as a faster alternative to RDBMS  To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS   Now one should know the two terms      Structured Data   This is the data that we used in traditional RDBMS and is divided into well defined structures  Unstructured Data   This is important to understand  about 80  of the world data is unstructured or semi structured  These are the data which are on its raw form and cannot be processed using RDMS  Example   facebook  twitter data   http   www dummies com how-to content unstructured-data-in-a-big-data-environment html     So  large amount of data was being generated in the last few years and the data was mostly unstructured  that gave birth to HADOOP  It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS  It had many drawbacks  that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version   Before going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools  So  whichever tool you will see that is created has been done to overcome the problem of the previous tools   Hadoop can be simply said as two things   Mapreduce and HDFS  Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored  This structure followed WORM principal i e  write once read multiple times  So  once we have stored data in HDFS  we cannot make changes  This led to the creation of HBASE  a NOSQL product where we can make changes in the data also after writing it once   But with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure  PIG and HIVE are two popular examples   HIVE was created for people with SQL background  The queries written is similar to SQL named as HIVEQL  HIVE was developed to process completely structured data  It is not used for ustructured data   PIG on the other hand has its own query language i e  PIG LATIN  It can be used for both structured as well as unstructured data   Moving to the difference as when to use HIVE and when to use PIG  I don t think anyone other than the architect of PIG could say  Follow the link   https   developer yahoo com blogs hadoop comparing-pig-latin-sql-constructing-data-processing-pipelines-444 html

User · Answer

I believe this thread hasn t done in particular justice to HBase and Pig in particular  While I believe Hadoop is the choice of the distributed  resilient file-system for big-data lake implementations  the choice between HBase and Hive is in particular well-segregated    As in  a lot of use-cases have a particular requirement of SQL like or No-SQL like interfaces  With Phoenix on top of HBase  though SQL like capabilities is certainly achievable  however  the performance  third-party integrations  dashboard update are a kind of painful experiences  However  it s an excellent choice for databases requiring horizontal scaling    Pig is in particular excellent for non-recursive batch like computations or ETL pipelining  somewhere  where it outperforms Spark by a comfortable distance   Also  it s high-level dataflow implementations is an excellent choice for batch querying and scripting  The choice between Pig and Hive is also pivoted on the need of the client or server-side scripting  required file formats  etc  Pig supports Avro file format which is not true in the case of Hive  The choice for  procedural dataflow language  vs  declarative data flow language  is also a strong argument for the choice between pig and hive

User · Answer

Cleansing Data in Pig is very easy a suitable approach would be cleansing data through pig and then processing data through hive and later uploading it to hdfs

User · Answer

Use of Hive  Hbase and Pig w r t  my real time experience in different projects   Hive is used mostly for    Analytics purpose where you need to do analysis on history data Generating business reports based on certain columns Efficiently managing the data together with metadata information Joining tables on certain columns which are frequently used by using bucketing concept Efficient Storing and querying using partitioning concept Not useful for transaction row level operations like update  delete  etc    Pig is mostly used for    Frequent data analysis on huge data Generating aggregated values counts on huge data Generating enterprise level key performance indicators very frequently    Hbase is mostly used    For real time processing of data For efficiently managing Complex and nested schema  For real time querying and faster result For easy Scalability with columns Useful for transaction row level operations like update  delete  etc

User · Answer

I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team   Objective   To have the daily web log files collected from 350  servers daily queryable thru some SQL like language To replace daily aggregation data generated thru MySQL with Hive Build Custom reports thru queries in Hive    Architecture Options  I benchmarked the following options    Hive HDFS Hive HBase - queries were too slow so I dumped this option   Design   Daily log Files were transported to HDFS MR jobs parsed these log files and output files in HDFS Create Hive tables with partitions and locations pointing to HDFS locations Create Hive query scripts  call it HQL if you like as diff from SQL  that in turn ran MR jobs in the background and generated aggregation data Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator   Summary  HBase is like a Map  If you know the key  you can instantly get the value  But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone   If you have data that needs to be aggregated  rolled up  analyzed across rows then consider Hive   Hopefully this helps   Hive actually rocks    I know  I have lived it for 12 months now    So does HBase

User · Answer

Consider that you work with RDBMS and have to select what to use - full table scans  or index access - but only one of them   If you select full table scan - use hive  If index access - HBase

[hadoop] When to use Hadoop, HBase, Hive and Pig?

Examples related to hadoop

Examples related to hbase

Examples related to hive

Examples related to apache-pig