Difference between Pig and Hive Why have both

Question

My background - 4 weeks old in the Hadoop world  Dabbled a bit in Hive  Pig and Hadoop using Cloudera s Hadoop VM  Have read Google s paper on Map-Reduce and GFS  PDF link    I understand that-   Pig s language Pig Latin is a shift from suits the way programmers think  SQL like declarative style of programming and Hive s query language closely resembles SQL   Pig sits on top of Hadoop and in principle can also sit on top of Dryad  I might be wrong but Hive is closely coupled to Hadoop   Both Pig Latin and Hive commands compiles to Map and Reduce jobs    My question - What is the goal of having both when one  say Pig  could serve the purpose  Is it just because Pig is evangelized by Yahoo  and Hive by Facebook

User · Answer

Here are some additional links on to use Pig or Hive   http   aws amazon com elasticmapreduce faqs  hive-8  http   www larsgeorge com 2009 10 hive-vs-pig html

User · Answer

Pig is useful for ETL kind of workloads generally speaking  For example set of transformations you need to do to your data every day   Hive shines when you need to run adhoc queries or just want to explore data  It sometimes can act as interface to your visualisation Layer   Tableau Qlikview    Both are essential and serve different purpose

User · Answer

In Simpler words  Pig is a high-level platform for creating MapReduce programs used with Hadoop  using pig scripts we will process the large amount of data into desired format   Once the processed data obtained  this processed data is kept in HDFS for later processing to obtain the desired results   On top of the stored processed data we will apply HIVE SQL commands to get the desired results  internally this hive sql commands runs MAP Reduce programs

User · Answer

You can achieve similar results with pig hive queries  The main difference lies within approach to understanding writing creating queries     Pig tends to create a flow of data  small steps where in each you do some processing Hive gives you SQL-like language to operate on your data  so transformation from RDBMS is much easier  Pig can be easier for someone who had not earlier experience with SQL   It is also worth noting  that for Hive you can nice interface to work with this data  Beeswax for HUE  or Hive web interface   and it also gives you metastore for information about your data  schema  etc  which is useful as a central information about your data   I use both Hive and Pig  for different queries  I use that one where I can write query faster easier  I do it this way mostly ad-hoc queries  - they can use the same data as an input  But currently I m doing much of my work through Beeswax

User · Answer

From the link  http   www aptibook com discuss-technical uid tech-hive4 amp question What-kind-of-datawarehouse-application-is-suitable-for-Hive   Hive is not a full database  The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do    Hive is most suited for data warehouse applications  where   1  Relatively static data is analyzed   2  Fast response times are not required  and   3  When the data is not changing rapidly   Hive doesn   t provide crucial features required for OLTP  Online Transaction Processing  It   s closer to being an OLAP tool  Online Analytic Processing  So  Hive is best suited for data warehouse applications  where a large data set is maintained and mined for insights  reports  etc

User · Answer

I found this the most helpful  though  it s a year old  - http   yahoohadoop tumblr com post 98256601751 pig-and-hive-at-yahoo  It specifically talks about Pig vs Hive and when and where they are employed at Yahoo  I found this very insightful  Some interesting notes   On incremental changes updates to data sets      Instead  joining against the new incremental data and using the   results together with the results from the previous full join is the   correct approach  This will take only a few minutes  Standard database   operations can be implemented in this incremental way in Pig Latin    making Pig a good tool for this use case    On using other tools via streaming       Pig integration with streaming also makes it easy for researchers to   take a Perl or Python script they have already debugged on a small   data set and run it against a huge data set    On using Hive for data warehousing      In both cases  the relational model and SQL are the best fit  Indeed    data warehousing has been one of the core use cases for SQL through   much of its history  It has the right constructs to support the types   of queries and tools that analysts want to use  And it is already in   use by both the tools and users in the field       The Hadoop subproject Hive provides a SQL interface and relational   model for Hadoop  The Hive team has begun work to integrate with BI   tools via interfaces such as ODBC

User · Answer

To give a very high level overview of both  in short   1  Pig is a relational algebra over hadoop  2  Hive is a SQL over hadoop  one level above Pig

User · Answer

I believe that the real answer to your question is that they are were independent projects and there was no centrally coordinated goal  They were in different spaces early on and have grown to overlap with time as both projects expand   Paraphrased from the Hadoop O Reilly book      Pig  a dataflow language and   environment for exploring very large   datasets       Hive  a distributed data warehouse

User · Answer

Hive Vs Pig-   Hive is as SQL interface which allows sql savvy users or Other tools like Tableu Microstrategy any other tool or language that has sql interface    PIG is more like a ETL pipeline  with step by step commands like declaring variables  looping  iterating   conditional statements etc   I prefer writing Pig scripts over hive QL when I want to write complex step by step logic  When I am comfortable writing a single sql for pulling the data i want i use Hive  for hive you will need to define table before querying as you do in RDBMS     The purpose of both are different but under the hood  both do the same  convert to map reduce programs Also the Apache open source community is add more and more features to both there projects

User · Answer

Have a look at Pig Vs Hive Comparison in a nut shell from a  dezyre  article  Hive is better than PIG in  Partitions  Server  Web interface  amp  JDBC ODBC support    Some differences    Hive is best for structured Data  amp  PIG is best for semi structured data Hive is used for reporting  amp  PIG for programming Hive is used as a declarative SQL  amp  PIG as a procedural language Hive supports partitions  amp  PIG does not Hive can start an optional thrift based server  amp  PIG cannot Hive defines tables beforehand  schema    stores schema information in a database  amp  PIG doesn t have a dedicated metadata of database Hive does not support Avro but PIG does  EDIT  Hive supports Avro  specify the serde as org apache hadoop hive serde2 avro  Pig also supports additional COGROUP feature for performing outer joins but hive does not  But both Hive   amp  PIG can join  order  amp  sort dynamically

User · Answer

Hive was designed to appeal to a community comfortable with SQL  Its philosophy was that we don t need yet another scripting language  Hive supports map and reduce transform scripts in the language of the user s choice  which can be embedded within SQL clauses   It is widely used in Facebook by analysts comfortable with SQL as well as by data miners programming in Python  SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear   Supporting SQL syntax also means that it s possible to integrate with existing BI tools like Microstrategy  Hive has an ODBC JDBC driver  that s a work in progress  that should allow this to happen in the near future  It s also beginning to add support for indexes which should allow support for drill-down queries common in such environments   Finally--this is not pertinent to the question directly--Hive is a framework for performing analytic queries  While its dominant use is to query flat files  there s no reason why it cannot query other stores  Currently Hive can be used to query data stored in Hbase  which is a key-value store like those found in the guts of most RDBMSes   and the HadoopDB project has used Hive to query a federated RDBMS tier

User · Answer

Pig eats anything   Meaning it can consume unstructured data   Hive requires a schema

User · Answer

I found below useful link to explore how and when to use HIVE and PIG   http   www hadoopwizard com when-to-use-pig-latin-versus-hive-sql

User · Answer

Pig allows one to load data and user code at any point in the pipeline  This is can be particularly important if the data is a streaming data  for example data from satellites or instruments    Hive  which is RDBMS based  needs the data to be first imported  or loaded  and after that it can be worked upon  So if you were using Hive on streaming data  you would have to keep filling buckets  or files  and use hive on each filled bucket  while using other buckets to keep storing the newly arriving data    Pig also uses lazy evaluation  It allows greater ease of programming and one can use it to analyze data in different ways with more freedom than in an SQL like language like Hive  So if you really wanted to analyze matrices or patterns in some unstructured data you had  and wanted to do interesting calculations on them  with Pig you can go some fair distance  while with Hive  you need something else to play with the results   Pig is faster in the data import but slower in actual execution than an RDBMS friendly language like Hive    Pig is well suited to parallelization and so it possibly has an edge for systems where the datasets are huge  i e  in systems where you are concerned more about the throughput of your results than the latency  the time to get any particular datum of result

User · Answer

Read the difference between PIG and HIVE in this link   http   www aptibook com Articles Pig-and-hive-advantages-disadvantages-features  All the aspects are given  If you are in the confusion which to choose then you must see that web page

User · Answer

When we are using Hadoop in the sense it means we are trying to huge data processing The end goal of the data processing would be to generate content reports out of it   So it internally consists of 2 prime activities   1  Loading Data Processing  2  Generate content and use it for the reporting  etc    Loading  Data Processing -  Pig would be helpful in it   This helps as an ETL  We can perform etl operations using pig scripts     Once the result is processed we can use hive to generate the reports based on the processed result   Hive  Its built on top of hdfs for warehouse processing   We can generate adhoc reports easily using hive from the processed content generated from pig

User · Answer

Pig-latin is data flow style  is more suitable for software engineer  While sql is more suitable for analytics person who are get used to sql   For complex task  for hive you have to manually to create temporary table to store intermediate data  but it is not necessary for pig  Pig-latin is suitable for complicated data structure  like small graph   There s a data structure in pig called DataBag which is a collection of Tuple  Sometimes you need to calculate metrics which involve multiple tuples   there s a hidden link between tuples  in this case I would call it graph   In this case  it is very easy to write a UDF to calculate the metrics which involve multiple tuples  Of course it could be done in hive  but it is not so convenient as it is in pig   Writing UDF in pig much is easier than in Hive in my opinion  Pig has no metadata support   or it is optional  in future it may integrate hcatalog   Hive has tables  metadata stored in database  You can debug pig script in local environment  but it would be hard for hive to do that  The reason is point 3  You need to set up hive metadata in your local environment  very time consuming

User · Answer

Check out this post from Alan Gates  Pig architect at Yahoo   that compares when would use a SQL like Hive rather than Pig   He makes a very convincing case as to the usefulness of a procedural language like Pig  vs  declarative SQL  and its utility to dataflow designers

User · Answer

What HIVE can do which is not possible in PIG   Partitioning can be done using HIVE but not in PIG  it is a way of bypassing the output   What PIG can do which is not possible in HIVE   Positional referencing - Even when you dont have field names  we can reference using the position like  0 - for first field   1 for second and so on   And another fundamental difference is  PIG doesn t need a schema to write the values but HIVE does need a schema   You can connect from any external application to HIVE using JDBC and others but not with PIG   Note  Both runs on top of HDFS  hadoop distributed file system  and the statements are converted to Map Reduce programs

[hadoop] Difference between Pig and Hive? Why have both?

Examples related to hadoop

Examples related to hive

Examples related to apache-pig