In this blog I am going to list out installation guide through the hive installation , sort of cook book recipe.
This blog runs everything on single machine hive server, meta-store and the client in form of CLI. You can write your client that can understand thrift and your hive server is accessible through TCP .
- Download TAR ball for hive.
- Extract TAR ball to your Users home directory ,preferable the same user for hadoop.
- Make a soft link
~:ln -s hive-0.8.1/ hive
*Making soft links works best in scenario when you want to upgrade , just change the link 😉
-
Configuring Hive
hive relies on underlying hadoop infrastructure, in this blog we are utilizing only local instance. Hence its important to install hadoop first.
- Add HADOOP_HOME to system variable.
~:sudo cat "HADOOP_HOME=~/hadoop" >> ~/.bashrc
*You should be a sudoer or have write permission on profile for hadoop user.
- Make a Copy of file ~/hive/conf/hive-default.xml.template
~:cp ~/hive/conf/hive-default.xml.template ~/hive/conf/hive-default.xml
The above file is a config file , for hive fine tuning we override the properties , for a list of properties and there usage please visit.
- Add configuration properties as below to the end of file
<!-- Nishant -->
<property>
<name>mapred.job.tracker</name>
<value>local</value>
<description>Hive fully supports local mode execution</description>
</property>
Metastore
Hive depends on a Meta information which it maintains b/w Hive tables/partitions and actual data on HDFS. This can be stored in a persistent store like a DB. I am using Derby here as it gives an in memory advantage. Its a tradeoff that an architect has to make. It can be Embedded,Local,Remote. The advantage of one over the other are driven by
– Performance (for embedded)
– Redundancy (for remote)
– Other db features
In this example we are going with Local with derby (you can use embedded as well).
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/nishant/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
Web Interface
Hive Provides you a web interface so you can do browsing of schema and multi query execution. This comes handy when multiple hive clients are running queries on hive server.
<!-- Nishant -->
<property>
<name>hive.hwi.war.file</name>
<value>lib/hive-hwi-0.6.0.war</value>
<description>This sets the path to the HWI war file, relative to ${HIVE_HOME}. </description>
</property>
<property>
<name>hive.hwi.listen.host</name>
<value>127.0.0.1</value>
<description>This is the host address the Hive Web Interface will listen on</description>
</property>
<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
<description>This is the port the Hive Web Interface will listen on</description>
</property>
- Starting hive Hive server
~/hive: bin/hive --service hiveserver
*Hive runs on 10000 port on the server.
That’s it , the above will start the hive server.
- Client CLI
~/hive: bin/hive
The above will start CLI , but to run queries or any other hive operation , the metastore should be existing.
Hive Clients can be any which can work using ODBC/JDBC connection with hive on URL jdbc:hive://<HOST>:<PORT:/default
- Starting metastore.
Since I user a local derby server I am starting the metastore.
~/derby/bin/startNetworkServer
- Starting Hadoop
~/hadoop:bin/start-all.sh
Testing Hive
TBD
A Atypical Hive Deployment might run like
Typical Deployment for Hive
Migtation Issue
We faced on issue when migrating the underlying Hadoop and changing its configurations.We changed the port for the Hadoop from 60000 to 54310(default) . Reason being our hbase was runnign default on 60000 . So doing this our hive failed because it was again trying to hit the previos case, with error as below
hive> select * from pokes;
OK
Failed with exception java.io.IOException:java.net.ConnectException: Call to localhost/127.0.0.1:60000 failed on connection exception: java.net.ConnectException: Connection refused
Time taken: 10.507 seconds
The reason is that the metadata stores all information for the location of the file on hadoop , hence when i saw the extended description of table i got something as
hive> describe extended pokes;
OK
foo int
bar string
new_col int
Detailed Table Information Table(tableName:pokes, dbName:default, owner:sysadminn, createTime:1337064947, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col, type:int, comment:null)], location:hdfs://localhost:60000/user/hive/warehouse/pokes, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{last_modified_by=sysadminn, last_modified_time=1337065255, transient_lastDdlTime=1337065459}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
which still points to olded info . So I altered the table with command as below
ALTER TABLE pokes SET LOCATION "hdfs://localhost:54310/user/hive/warehouse/pokes";
And thus the table restored its metatdata and hadoop links. It is important to note that the URL is for the Hadoop location (hdfs) and not he jobtracker. Otherwise you’ll get this error.
hive> select * from pokes;
OK
Failed with exception java.io.IOException:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown protocol to job tracker: org.apache.hadoop.hdfs.protocol.ClientProtocol
at org.apache.hadoop.mapred.JobTracker.getProtocolVersion(JobTracker.java:222)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
The above is also equally and extensible applicable in case of partitions. We need to individually override location invormation for partitins as below, along with table.
Altering Table
hive> describe extended invites;
OK
foo int
bar string
new_col2 int a comment
ds string
Detailed Table Information Table(tableName:invites, dbName:default, owner:sysadminn, createTime:1337065237, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col2, type:int, comment:a comment), FieldSchema(name:ds, type:string, comment:null)], location:hdfs://localhost:60000/user/hive/warehouse/invites, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, comment:null)], parameters:{last_modified_by=sysadminn, last_modified_time=1337065270, transient_lastDdlTime=1337065270}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
hive> ALTER TABLE invites SET LOCATION "hdfs://localhost:54310/user/hive/warehouse/invites";
hive> describe extended invites;
OK
foo int
bar string
new_col2 int a comment
ds string
Detailed Table Information Table(tableName:invites, dbName:default, owner:sysadminn, createTime:1337065237, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col2, type:int, comment:a comment), FieldSchema(name:ds, type:string, comment:null)], location:hdfs://localhost:54310/user/hive/warehouse/invites, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, comment:null)], parameters:{last_modified_by=sysadminn, last_modified_time=1337158863, transient_lastDdlTime=1337158863}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Listing Partition information for table
hive> SHOW PARTITIONS invites;
OK
ds=2008-08-08
ds=2008-08-15
Altering Partitions
hive> DESCRIBE EXTENDED invites PARTITION (ds='2008-08-08');
OK
foo int
bar string
new_col2 int a comment
ds string
Detailed Partition Information Partition(values:[2008-08-08], dbName:default, tableName:invites, createTime:1337065605, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col2, type:int, comment:a comment)], location:hdfs://localhost:60000/user/hive/warehouse/invites/ds=2008-08-08, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), parameters:{transient_lastDdlTime=1337065605})
hive> ALTER TABLE invites PARTITION (ds='2008-08-08') SET LOCATION "hdfs://localhost:54310/user/hive/warehouse/invites/ds=2008-08-08";
OK
hive> DESCRIBE EXTENDED invites PARTITION (ds='2008-08-08');
OK
foo int
bar string
new_col2 int a comment
ds string
Detailed Partition Information Partition(values:[2008-08-08], dbName:default, tableName:invites, createTime:1337065605, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col2, type:int, comment:a comment)], location:hdfs://localhost:54310/user/hive/warehouse/invites/ds=2008-08-08, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), parameters:{last_modified_by=sysadminn, last_modified_time=1337159527, transient_lastDdlTime=1337159527})
And for another partition
hive> DESCRIBE EXTENDED invites PARTITION (ds='2008-08-15');
OK
foo int
bar string
new_col2 int a comment
ds string
Detailed Partition Information Partition(values:[2008-08-15], dbName:default, tableName:invites, createTime:1337065580, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col2, type:int, comment:a comment)], location:hdfs://localhost:60000/user/hive/warehouse/invites/ds=2008-08-15, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), parameters:{transient_lastDdlTime=1337065580})
hive> ALTER TABLE invites PARTITION (ds='2008-08-15') SET LOCATION "hdfs://localhost:54310/user/hive/warehouse/invites/ds=2008-08-15";
OK
hive> DESCRIBE EXTENDED invites PARTITION (ds='2008-08-15');
OK
foo int
bar string
new_col2 int a comment
ds string
Detailed Partition Information Partition(values:[2008-08-15], dbName:default, tableName:invites, createTime:1337065580, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:foo, type:int, comment:null), FieldSchema(name:bar, type:string, comment:null), FieldSchema(name:new_col2, type:int, comment:a comment)], location:hdfs://localhost:54310/user/hive/warehouse/invites/ds=2008-08-15, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), parameters:{last_modified_by=sysadminn, last_modified_time=1337159560, transient_lastDdlTime=1337159560})