How to Access Hive via Python

Question

https   cwiki apache org confluence display Hive HiveClient HiveClient-Python appears to be outdated   When I add this to  etc profile   export PYTHONPATH  PYTHONPATH  usr lib hive lib py   I can then do the imports as listed in the link  with the exception of from hive import ThriftHive which actually need to be   from hive service import ThriftHive   Next the port in the example was 10000  which when I tried caused the program to hang  The default Hive Thrift port is 9083  which stopped the hanging   So I set it up like so   from thrift import Thrift from thrift transport import TSocket from thrift transport import TTransport from thrift protocol import TBinaryProtocol try      transport   TSocket TSocket   lt node-with-metastore gt    9083      transport   TTransport TBufferedTransport transport      protocol   TBinaryProtocol TBinaryProtocol transport      client   ThriftHive Client protocol      transport open       client execute  CREATE TABLE test c1 int         transport close   except Thrift TException  tx      print   s     tx message    I received the following error   Traceback  most recent call last   File   lt stdin gt    line 1  in  lt module gt  File   usr lib hive lib py hive service ThriftHive py   line 68  in execute self recv execute   File   usr lib hive lib py hive service ThriftHive py   line 84  in recv execute raise x thrift Thrift TApplicationException  Invalid method name   execute    But inspecting the ThriftHive py file reveals the method execute within the Client class   How may I use Python to access Hive

User · Answer

I believe the easiest way is to use PyHive.

To install you'll need these libraries:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case.

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install in Terminal)

After installation, you can connect to Hive like this:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...or to use the connection to make a Pandas dataframe:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

User · Answer

Below python program should work to access hive tables from python   import commands  cmd    hive -S -e  SELECT   FROM db name table name LIMIT 1      status  output   commands getstatusoutput cmd   if status    0     print output else     print  error

User · Answer

Simplest method   Using sqlalchemy Requirements   pip install pyhive  Code  import pandas as pd from sqlalchemy import create engine  SECRET     username   lol    password    lol   user name   SECRET get  username   passwd   SECRET get  password    host server    x x x x  port    10000  database    default  conn   f hive    user name   passwd   host server   port   database   engine   create engine conn  connect args   auth    LDAP     query    quot select   from tablename limit 100 quot  data   pd read sql query  con engine  print data

User · Answer

similar to  python-starter solution  But  commands package is not avilable on python3 x  So Alternative solution is to use subprocess in python3 x  import subprocess  cmd    hive -S -e  SELECT   FROM db name table name LIMIT 1      status  output   subprocess getstatusoutput cmd   if status    0     print output  else     print  error

User · Answer

This can be a quick hack to connect hive and python   from pyhive import hive cursor   hive connect  YOUR HOST NAME   cursor   cursor execute  SELECT   from table name LIMIT 5  async True  print cursor fetchall     Output  List of Tuples

User · Answer

here s a generic approach which makes it easy for me because I keep connecting to several servers  SQL  Teradata  Hive etc   from python  Hence  I use the pyodbc connector  Here s some basic steps to get going with pyodbc  in case you have never used it      Pre-requisite  You should have the relevant ODBC connection in your windows setup before you follow the below steps  In case you don t have it  find the same here    Once complete   STEP 1  pip install   pip install pyodbc  here s the link to download the relevant driver from Microsoft s website   STEP 2  now  import the same in your python script    import pyodbc   STEP 3  Finally  go ahead and give the connection details as follows    conn hive   pyodbc connect  DSN   YOUR DSN NAME   SERVER   YOUR SERVER NAME  UID   USER ID  PWD   PSWD      The best part of using pyodbc is that I have to import just one package to connect to almost any data source

User · Answer

None of the answers demonstrate how to fetch and print the table headers  Modified the standard example from PyHive which is  widely used and actively maintained  from pyhive import hive cursor   hive connect host  quot localhost quot                          port 10000                         username  quot shadan quot                          auth  quot KERBEROS quot                          kerberos service name  quot hive quot                          cursor   cursor execute  quot SELECT   FROM my dummy table LIMIT 10 quot   columnList    desc 0  for desc in cursor description  headerStr    quot   quot  join columnList  headerTuple   tuple headerStr split   quot   quot   print headerTuple  print cursor fetchone    print cursor fetchall

User · Answer

To connect using a username password and specifying ports  the code looks like this   from pyhive import presto  cursor   presto connect host  host example com                       port 8081                      username  USERNAME PASSWORD   cursor    sql    select   from table limit 10   cursor execute sql   print cursor fetchone    print cursor fetchall

User · Answer

You could use python JayDeBeApi package to create DB-API connection from Hive or Impala JDBC driver and then pass the connection to pandas read sql function to return data in pandas dataframe   import jaydebeapi   Apparently need to load the jar files for the first time for impala jdbc driver to work  conn   jaydebeapi connect  com cloudera hive jdbc41 HS2Driver     jdbc hive2   host 10000 db AuthMech 1 KrbHostFQDN xxx com KrbServiceName hive KrbRealm xxx COM           jars    hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 HiveJDBC41 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 TCLIServiceClient jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 commons-codec-1 3 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 commons-logging-1 1 1 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 hive metastore jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 hive service jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 httpclient-4 1 3 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 httpcore-4 1 3 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 libfb303-0 9 0 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 libthrift-0 9 0 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 log4j-1 2 14 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 ql jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 slf4j-api-1 5 11 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 slf4j-log4j12-1 5 11 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 zookeeper-3 4 6 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 ImpalaJDBC41 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 TCLIServiceClient jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 commons-codec-1 3 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 commons-logging-1 1 1 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 hive metastore jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 hive service jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 httpclient-4 1 3 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 httpcore-4 1 3 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 libfb303-0 9 0 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 libthrift-0 9 0 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 log4j-1 2 14 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 ql jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 slf4j-api-1 5 11 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 slf4j-log4j12-1 5 11 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 zookeeper-3 4 6 jar        the previous call have initialized the jar files  technically this call needs not include the required jar files impala conn   jaydebeapi connect  com cloudera impala jdbc41 Driver     jdbc impala   host 21050 db AuthMech 1 KrbHostFQDN xxx com KrbServiceName impala KrbRealm xxx COM          jars    hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 HiveJDBC41 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 TCLIServiceClient jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 commons-codec-1 3 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 commons-logging-1 1 1 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 hive metastore jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 hive service jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 httpclient-4 1 3 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 httpcore-4 1 3 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 libfb303-0 9 0 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 libthrift-0 9 0 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 log4j-1 2 14 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 ql jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 slf4j-api-1 5 11 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 slf4j-log4j12-1 5 11 jar     hadp opt jdbc hive jdbc 2 5 18 1050 2 5 18 1050 GA Cloudera HiveJDBC41 2 5 18 1050 zookeeper-3 4 6 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 ImpalaJDBC41 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 TCLIServiceClient jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 commons-codec-1 3 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 commons-logging-1 1 1 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 hive metastore jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 hive service jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 httpclient-4 1 3 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 httpcore-4 1 3 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 libfb303-0 9 0 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 libthrift-0 9 0 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 log4j-1 2 14 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 ql jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 slf4j-api-1 5 11 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 slf4j-log4j12-1 5 11 jar     hadp opt jdbc impala jdbc 2 5 35 2 5 35 1055 GA Cloudera ImpalaJDBC41 2 5 35 zookeeper-3 4 6 jar      import pandas as pd df1   pd read sql  SELECT   FROM tablename   conn  df2   pd read sql  SELECT   FROM tablename   impala conn   conn close   impala conn close

User · Answer

I have solved the same problem with you here is my operation environment  System linux Versions python 3 6 Package Pyhive  please refer to my answer as follows   from pyhive import hive conn   hive Connection host  149 129          port 10000  username      database     password     auth  LDAP     The key point is to add the reference password  amp  auth and meanwhile set the auth equal to  LDAP    Then it works well  any questions please let me know

User · Answer

Similar to eycheu s solution  but a little more detailed   Here is an alternative solution specifically for hive2 that does not require PyHive or installing system-wide packages  I am working on a linux environment that I do not have root access to so installing the SASL dependencies as mentioned in Tristin s post was not an option for me      If you re on Linux  you may need to install SASL separately before running the above  Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution     Specifically  this solution focuses on leveraging the python package  JayDeBeApi  In my experience installing this one extra package on top of a python Anaconda 2 7 install was all I needed  This package leverages java  JDK   I am assuming that is already set up   Step 1  Install JayDeBeApi  pip install jaydebeap   Step 2  Download appropriate drivers for your environment    Here is a link to the jars required for an enterprise CDH environment Another post that talks about where to find jdbc drivers for Apache Hive   Store all  jar files in a directory  I will refer to this directory as  path to jar files    Step 3  Identify your systems authentication mechanism   In the pyhive solutions listed I ve seen PLAIN listed as the authentication mechanism as well as Kerberos  Note that your jdbc connection URL will depend on the authentication mechanism you are using  I will explain  Kerberos solution without passing a username password  Here is more information Kerberos authentication and options   Create a Kerberos ticket if one is not already created    kinit   Tickets can be viewed via klist    You are now ready to make the connection via python   import jaydebeapi import glob   Creates a list of jar files in the  path to jar files  directory jar files   glob glob   path to jar files   jar    host  localhost  port  10000  database  default     note  your driver will depend on your environment and drivers you ve   downloaded in step 2   this is the driver for my environment  jdbc3  hive2  cloudera enterprise  driver  com cloudera hive jdbc3 HS2Driver   conn hive   jaydebeapi connect driver           jdbc hive2     host      port     database   AuthMech 1 KrbHostFQDN   host   KrbServiceName hive                              jars jar files    If you only care about reading  then you can read it directly into a panda s dataframe with ease via eycheu s solution   import pandas as pd df   pd read sql  select   from table   conn hive    Otherwise  here is a more versatile communication option   cursor   conn hive cursor   sql expression    select   from table  cursor execute sql expression  results   cursor fetchall     You could imagine  if you wanted to create a table  you would not need to  fetch  the results  but could submit a create table query instead

User · Answer

pyhs2 is no longer maintained  A better alternative is impyla  Don t be confused that some of the above examples below about Impala  just change port to 10000  default  for HiveServer2  and it ll work the same way as with Impala examples  It s the same protocol  Thrift  that is used for both Impala and Hive   https   github com cloudera impyla   It has many more features over pyhs2  for example  it has Kerberos authentication  which is a must for us   from impala dbapi import connect conn   connect host  my host com   port 10000  cursor   conn cursor   cursor execute  SELECT   FROM mytable LIMIT 100   print cursor description    prints the result set s schema results   cursor fetchall       cursor execute  SELECT   FROM mytable LIMIT 100   for row in cursor      process row    Cloudera is putting more effort now on hs2 client https   github com cloudera hs2client which is a C C   HiveServer2 Impala client  Might be a better option if you push a lot of data to from python   has Python binding too - https   github com cloudera hs2client tree master python    Some more information on impyla    http   blog cloudera com blog 2014 04 a-new-python-client-for-impala   https   github com cloudera impyla blob master README md

User · Answer

By using Python Client Driver  pip install pyhs2   Then  import pyhs2  with pyhs2 connect host  localhost                  port 10000                 authMechanism  PLAIN                  user  root                  password  test                  database  default   as conn  with conn cursor   as cur       Show databases     print cur getDatabases         Execute query     cur execute  select   from table         Return column info from query     print cur getSchema         Fetch table results     for i in cur fetch            print i   Refer   https   cwiki apache org confluence display Hive Setting Up HiveServer2 SettingUpHiveServer2-PythonClientDriver

User · Answer

You can use hive library for that you want to import hive Class from hive import ThriftHive  Try This example   import sys  from hive import ThriftHive from hive ttypes import HiveServerException  from thrift import Thrift from thrift transport import TSocket from thrift transport import TTransport from thrift protocol import TBinaryProtocol  try    transport   TSocket TSocket  localhost   10000    transport   TTransport TBufferedTransport transport    protocol   TBinaryProtocol TBinaryProtocol transport    client   ThriftHive Client protocol    transport open     client execute  CREATE TABLE r a STRING  b INT  c DOUBLE      client execute  LOAD TABLE LOCAL INPATH   path  INTO TABLE r     client execute  SELECT   FROM r     while  1       row   client fetchOne       if  row    None          break     print row    client execute  SELECT   FROM r     print client fetchAll     transport close   except Thrift TException  tx    print   s     tx message

User · Answer

It is a common practice to prohibit for a user to download and install packages and libraries on cluster nodes  In this case solutions of  python-starter and  goks are working perfect  if hive run on the same node  Otherwise  one can use a beeline  instead of hive command line tool  See details  python 2 import commands  cmd    beeline -u  quot jdbc hive2   node07 foo bar 10000     lt your connect string gt  quot  -e  quot SELECT   FROM db name table name LIMIT 1  quot    status  output   commands getstatusoutput cmd   if status    0     print output else     print  quot error quot      python 3 import subprocess  cmd    beeline -u  quot jdbc hive2   node07 foo bar 10000     lt your connect string gt  quot  -e  quot SELECT   FROM db name table name LIMIT 1  quot    status  output   subprocess getstatusoutput cmd   if status    0     print output  else     print  quot error quot

User · Answer

The examples above are a bit out of date   One new example is here    import pyhs2 as hive import getpass DEFAULT DB    default  DEFAULT SERVER    10 37 40 1  DEFAULT PORT   10000 DEFAULT DOMAIN    PAM01-PRD01 IBM COM   u   raw input  Enter PAM username     s   getpass getpass   connection   hive connect host DEFAULT SERVER  port  DEFAULT PORT  authMechanism  LDAP   user u         DEFAULT DOMAIN  password s  statement    select   from user yuti Temp CredCard where pir post dt    2014-05-01  limit 100  cur   connection cursor    cur execute statement  df   cur fetchall      In addition to the standard python program  a few libraries need to be installed to allow Python to build the connection to the Hadoop databae   1 Pyhs2  Python Hive Server 2 Client Driver  2 Sasl  Cyrus-SASL bindings for Python  3 Thrift  Python bindings for the Apache Thrift RPC system  4 PyHive  Python interface to Hive  Remember to change the permission of the executable  chmod  x test hive2 py   test hive2 py  Wish it helps you   Reference  https   sites google com site tingyusz home blogs hiveinpython

User · Answer

I assert that you are using HiveServer2  which is the reason that makes the code doesn t work   You may use pyhs2 to access your Hive correctly and the example code like that   import pyhs2  with pyhs2 connect host  localhost                  port 10000                 authMechanism  PLAIN                  user  root                  password  test                  database  default   as conn      with conn cursor   as cur           Show databases         print cur getDatabases             Execute query         cur execute  select   from table             Return column info from query         print cur getSchema             Fetch table results         for i in cur fetch                print i   Attention that you may install python-devel x86 64 cyrus-sasl-devel x86 64 before installing pyhs2 with pip   Wish this can help you   Reference  https   cwiki apache org confluence display Hive Setting Up HiveServer2 SettingUpHiveServer2-PythonClientDriver

User · Answer

The easiest way is to use PyHive  To install you ll need these libraries  pip install sasl pip install thrift pip install thrift-sasl pip install PyHive  After installation  you can connect to Hive like this  from pyhive import hive conn   hive Connection host  quot YOUR HIVE HOST quot   port PORT  username  quot YOU quot    Now that you have the hive connection  you have options how to use it  You can just straight-up query  cursor   conn cursor   cursor execute  quot SELECT cool stuff FROM hive table quot   for result in cursor fetchall      use result result      or to use the connection to make a Pandas dataframe  import pandas as pd df   pd read sql  quot SELECT cool stuff FROM hive table quot   conn

[python] How to Access Hive via Python?

Examples related to python

Examples related to hadoop

Examples related to hive