前言

本文继安装Hadoop 2.7.2伪分布模式到Ubuntu 16.04教程,旨在以尽可能短的篇幅介绍Hive 1.2.1在Ubuntu 16.04上的安装过程,适合于有过Hive或Hadoop安装经验的同学。

本文首先介绍Hive本地模式的安装,然后介绍分布模式的安装。

关于Hive

Hive是基于Hadoop的一个数据仓库工具。Hive可直接用类似SQL的语言描述数据处理逻辑,避免开发人员在开发大数据查询分析处理程序时,编写复杂的基于JAVA的MapReduce程序。换句话说,Hive是将MapReduce抽象为类似SQL语句,在执行SQL语句时,Hive将其转换为MapReduce任务并运行。

很明显Hive需要依赖Hadoop,而且,不同于HBase,Hive必须依赖于HDFS,不能使用本地文件系统;Hive基于Hadoop的分布式存储系统HDFS和HBase以及MapReduce并行计算框架工作。

下载和初始化Hive

本文假定已经如前文安装好Hadoop

下载

使用来自CNNIC的Hive镜像

$ cd ~
$ wget http://mirrors.cnnic.cn/apache/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
$ tar apache-hive-1.2.1-bin.tar.gz
$ sudo mv apache-hive-1.2.1-bin /usr/local/hive

初始化Hive

Hadoop路径

由于Hive依赖Hadoop,需要设置Hadoop路径。依照前文安装Hadoop,其路径为/usr/local/hadoop

设置HADOOP_HOME

$ cd /usr/local/hive
$ cp conf/hive-env.sh.template conf/hive-env.sh
$ nano conf/hive-env.sh

在其中加入HADOOP_HOME(根据自己实际修改)

HADOOP_HOME=/usr/local/hadoop

初始化配置

$ cp conf/hive-default.xml.template conf/hive-default.xml

后续操作全部在/usr/local/hive路径下

本地模式

本地模式下Hive依靠本机的Hadoop环境运行,此时仅需要HDFS的支持,不需要如YARN的支持。

配置

新建conf/hive-site.xml,内容为

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>local</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/tmp/hadoop/mapred/local</value>
  </property>
</configuration>

实际上这是重载了Hadoop的配置。

测试

启动Hive

启动Hive前必须先启动HDFS!

$ /usr/local/hadoop/sbin/start-dfs.sh

然后

$ bin/hive

出现

hive>

新建表

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

出现

OK
Time taken: 11.741 seconds

显示表

SHOW TABLES;

出现

OK
invites
Time taken: 0.962 seconds, Fetched: 1 row(s)
DESCRIBE invites;

出现

OK
foo                     int
bar                     string
ds                      string

# Partition Information
# col_name              data_type               comment

ds                      string
Time taken: 1.44 seconds, Fetched: 8 row(s)

修改表字段

ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');

分别出现

OK
Time taken: 0.804 seconds

OK
Time taken: 0.577 seconds

从文件插入数据到表中

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

分别出现

Loading data to table default.invites partition (ds=2008-08-15)
Partition default.invites{ds=2008-08-15} stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0]
OK
Time taken: 4.879 seconds

Loading data to table default.invites partition (ds=2008-08-08)
Partition default.invites{ds=2008-08-08} stats: [numFiles=1, numRows=0, totalSize=216, rawDataSize=0]
OK
Time taken: 1.607 seconds

SQL查询

直接查询

SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

出现

...
285
35
227
395
244
Time taken: 3.851 seconds, Fetched: 500 row(s)

将查询结果写入到HDFS中

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

此时为MapReduce任务,出现

Query ID = hadoop_20160527103747_2efd85ec-b858-4fcb-8a9a-df6aa90b4d7f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-05-27 10:37:54,169 Stage-1 map = 0%,  reduce = 0%
2016-05-27 10:37:55,222 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local718531458_0001
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hdfs_out/.hive-staging_hive_2016-05-27_10-37-47_317_3150862765548535991-1/-ext-10000
Moving data to: /tmp/hdfs_out
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 11582 HDFS Write: 18798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 8.535 seconds

删除表

DROP TABLE invites;

出现

OK
Time taken: 4.013 seconds

退出Hive

quit;

如果有必要,同时停止HDFS

$ /usr/local/hadoop/sbin/stop-dfs.sh

分布模式

分布模式时Hive需要如YARN的支持。

初始化

配置

你有两种选择

一是将hive-site.xml删除即可。

$ mv conf/hive-site.xml conf/hive-site.xml.bak

二是将hive-site.xmlmapreduce.framework.name值修改为yarn,即

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/tmp/hadoop/mapred/local</value>
  </property>
</configuration>

创建目录

Hive需要在HDFS中存储数据,需要提前创建目录,默认情况下

$ /usr/local/hadoop/bin/hdfs dfs -mkdir /tmp
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/hive/warehouse
$ /usr/local/hadoop/bin/hdfs dfs -chmod g+w /tmp
$ /usr/local/hadoop/bin/hdfs dfs -chmod g+w /user/hive/warehouse

测试

启动Hive

启动Hive前必须先启动HDFS和YARN

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

然后

$ bin/hive

出现

hive>

测试表操作

测试表操作和本地模式一样,在显示结果上有细微差别。

在操作“将查询结果写入到HDFS中”时,会出现

Query ID = hadoop_20160527105353_49d6a7ec-5124-4e07-bd68-87e20bf87278
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1464317481996_0001, Tracking URL = http://localhost:8088/proxy/application_1464317481996_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1464317481996_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-05-27 10:55:02,780 Stage-1 map = 0%,  reduce = 0%
2016-05-27 10:55:28,781 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 7.04 sec
MapReduce Total cumulative CPU time: 7 seconds 40 msec
Ended Job = job_1464317481996_0001
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hdfs_out/.hive-staging_hive_2016-05-27_10-53-53_594_4595666558140512513-1/-ext-10000
Moving data to: /tmp/hdfs_out
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 7.04 sec   HDFS Read: 9165 HDFS Write: 12791 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 40 msec
OK
Time taken: 98.675 seconds

此时MapReduce的Job会有ID,通过浏览器访问http://localhost:8088/,可以看到这个Job。

退出Hive

quit;

如果有必要,同时停止HDFS和YARN

$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh

启动和停止Hive

启动

在启动HDFS和YARN后

bin/hive

停止

在停止HDFS和YARN前

quit;

小结

从安装运行上来看,Hive和HBase的一个明显不同就是Hive没有守护进程,不需要启动脚本;这很容易理解,Hive的命令执行总是有开始和结束的,并不需要维持一个环境。而这里所说的启动和停止Hive,严格来说是使用和退出Hive CLI (command line interface),需要注意。

参考