导航：首页 > 互联网科技 >

Hive笔记整理（二）

发表于：2024-11-23 作者：千家信息网编辑

千家信息网最后更新 2024年11月23日，[TOC]Hive笔记整理（二）Hive中表的分类managed_table-受控表、管理表、内部表表中的数据的生命周期/存在与否，受到了表结构的影响，当表结构被删除的，表中的数据随之一并被删除。默认

千家信息网最后更新 2024年11月23日Hive笔记整理（二）

[TOC]

Hive笔记整理（二）

Hive中表的分类

managed_table-受控表、管理表、内部表

表中的数据的生命周期/存在与否，受到了表结构的影响，当表结构被删除的，表中的数据随之一并被删除。默认创建的表就是这种表。

可以在cli中通过desc extended tableName来查看表的详细信息，当然也可以在MySQL中hive的元数据信息表TBLS中查看。

external_table-外部表

表中的数据的生命周期/存在与否，不受到了表结构的影响，当表结构被删除的，表中对应数据依然存在。这相当于只是表对相应数据的引用。

创建外部表：

create external table t6_external(    id int);增加数据：alter table t6_external set location "/input/hive/hive-t6.txt";还可以在创建外部表的时候就可以指定相应数据create external table t6_external_1(    id int) location "/input/hive/hive-t6.txt";上述hql报错：MetaException(message:hdfs://ns1/input/hive/hive-t6.txt is not a directory or unable to create one意思是说在创建表的时候指定的数据，不期望为一个具体文件，而是一个目录create external table t6_external_1(    id int) location "/input/hive/";

当使用外部表时，是不允许删除操作的，但是可以添加数据，并且这样做也会影响到hdfs中引用的文本数据。

内部表和外部表的简单用途区别：

当考虑到数据的安全性的时候，或者数据被多部门协调使用的，一般用到外部表。当考虑到hive和其它框架(比如hbase)进行协调集成的时候，一般用到外部表。

可以对内部表和外部表进行互相转换：

外--->内部    alter table t6_external set tblproperties("EXTERNAL"="FALSE");内部表---->外部表    alter table t2 set tblproperties("EXTERNAL"="TRUE");

持久表和临时表

以上的表都是持久表，表的存在和会话没有任何关系。

临时表：

在一次会话session中创建的临时存在的表，当会话断开的时候，该表所有数据(包括元数据)都会消失，表数据是临时存储在内存中，(实际上，创建临时表后，在hdfs中会创建/tmp/hive目录来保存临时文件，但只要退出会话，数据就会马上删除)在元数据库中没有显示。这种临时表通常就做临时的数据存储或交换临时表的特点    不能是分区表创建临时表非常简单，和外部表一样，将external关键字替换成temporary就可以了

功能表

分区表

假如up_web_log表的结构如下：

user/hive/warehouse/up_web_log/            web_log_2017-03-09.log            web_log_2017-03-10.log            web_log_2017-03-11.log            web_log_2017-03-12.log            ....            web_log_2018-03-12.log

对该表的结构解释如下：

该表存放的是web日志，在hive中，一个表就是hdfs中的一个目录，web日志的保存统计是按天进行的，所以每天结束后都会将日志数据加载到hive中，所以可以看到up_web_log目录下有多个文件，但是对于hive来说，这些日志数据都是属于up_web_log这个表的，显然，随着时间的推移，这张表的数据会越来越多。

该表存在的问题：

原先的是一张大表，这张表下面存放有若干数据，要想查看其中某一天的数据，只能首先在表中定义一个日期字段(比如：dt)，然后再在查询的时候添加过滤字段where dt="2017-03-12"如此才能求出相应结果，但是有个问题，这种方式需要加载该表下面所有的文件中的数据，造成在内存中加载了大量的不相关的数据，造成了我们hql运行效率低下。

那么如何对该表进行优化呢？

要想对这个问题进行优化，我们可以依据hive表的特性，其实在管理的是一个文件夹，也就是说，通过表能够定位到一个hdfs上面的目录，我们就可以在该表/目录的基础之上再来创建一/多级子目录，来完成对该大表的一个划/拆分，我们通过某种特征标识，比如子文件夹名称datadate=2017-03-09...以后再来查询其中一天的数据的时候，只需要定位到该子文件夹，即可类似加载一张表数据一样，加载该子文件夹下面的所有的数据。这样就不会再去全量加载该大表下面所有的数据，只加载了其中的一部分，减少了内存数据量，提高了hql运行效率。我们把这种技术称之为，表的分区，这种表称之为分区表。把这个子文件夹称之为分区表的分区。

分区表的组成说明如下：

分区有两部分组成，分区字段和具体的分区值组成，中间使用"="连接，分区字段在整个表中的地位就相当于一个字段，要想查询某一分区下面的数据，如下操作 where datadate="2017-03-09"hdfs中关于该表的存储结构为：user/hive/warehouse/up_web_log/            /datadate=2017-03-09                web_log_2017-03-09.log            /datadate=2017-03-10                    web_log_2017-03-10.log            /datadate=2017-03-11                web_log_2017-03-11.log            /datadate=2017-03-12                  web_log_2017-03-12.log                ....                web_log_2018-03-12.log

创建一张分区表：

create table t7_partition (    id int) partitioned by (dt date comment "date partition field");load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition;FAILED: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned不能直接向分区表加载数据，必须在加载数据之前明确加载到哪一个分区中，也就是子文件夹中。

分区表的DDL：

创建一个分区：    alter table t7_partition add partition(dt="2017-03-10");查看分区列表：    show partitions t7_partition;删除一个分区：    alter table t7_partition drop partition(dt="2017-03-10");

增加数据：

向指定分区中增加数据：    load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition partition (dt="2017-03-10");这种方式，会自动创建分区

有多个分区字段的情况：

统计学校，每年，每个学科的招生，就业的情况/每年就业情况create table t7_partition_1 (    id int) partitioned by (year int, school string); 添加数据：    load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition_1 partition(year=2015, school='python');

桶表

分区表存在的问题：

因为分区表还有可能造成某些分区数据非常大，某些则非常小，造成查询不均匀，这不是我们所预期，就需要使用一种技术，对这些表进行相对均匀的打散，把这种技术称之为分桶，分桶之后的表称之为桶表。

创建一张分桶表：

create table t8_bucket(    id int) clustered by(id) into 3 buckets;

向分桶表增加数据：

只能从表的表进行转换，不能使用上面的load这种方式（不会对数据进行拆分）insert into t8_bucket select * from t7_partition_1 where year=2016 and school="mysql";FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different 't8_bucket': Table insclause-0 has 1 columns, but query has 3 columns.我们的桶表中只有一个字段，但是分区表中有3个字段，所以在使用insert into 的方式导入数据的时候，一定要注意前后字段个数必须保持一致。insert into t8_bucket select id from t7_partition_1 where year=2016 and school="mysql";增加数据后，查看表中的数据：> select * from t8_bucket;OK634152Time taken: 0.08 seconds, Fetched: 6 row(s)可以看到，数据的顺序跟原来不同，那是因为数据分成了3份，使用的分桶算法为哈希算法，如下：6%3 = 0, 3%3 = 0，放在第1个桶4%3 = 1, 2%3 = 1，放在第2个桶5%3 = 2, 2%3 = 2，放在第3个桶查看hdfs中表t8_bucket的结构：hive (mydb1)> dfs -ls /user/hive/warehouse/mydb1.db/t8_bucket;Found 3 items-rwxr-xr-x   3 uplooking supergroup          4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000000_0-rwxr-xr-x   3 uplooking supergroup          4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000001_0-rwxr-xr-x   3 uplooking supergroup          4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000002_0可以看到，数据被分别保存到t8_bucket的3个不同的子目录中。

注意：操作分桶表的时候，本地模式不起作用。

数据的加载和导出

[]==>可选，<> ==>必须

加载

load

load data [local] inpath 'path'  [overwrite] into table [partition_psc];local：    有==>从linux本地加载数据    无==>从hdfs加载数据，相当于执行mv操作(无指的是没有local参数时，而不是本地中没有这个文件)overwrite    有==>覆盖掉表中原来的数据    无==>在原来的基础上追加新的数据

从其他表加载

insert  [table(当前面参数为overwrite时必须加table)] t_des select [...] from t_src [...];    overwrite        有==>覆盖掉表中原来的数据        无==>在原来的基础上追加新的数据   ==>会转化成为MR执行需要注意的地方：t_des中列要和select [...] from t_src这里面的[...]一一对应起来。当选择的参数为overwrite时，后面必须要加table，如：insert overwrite table test select * from t8_bucket;

创建表的时候加载

create table t_des as select [...] from t_src [...];这样会创建一张表，表结构为select [...] from t_src中的[...]eg.create temporary table tmp as select distinct(id) from t8_bucket;

动态分区的加载

快速复制表结构：

create table t_d_partition like t_partition_1;hive (default)> show partitions t_partition_1;OKpartitionyear=2015/class=bigdatayear=2015/class=linuxyear=2016/class=bigdatayear=2016/class=linux

要将2016的数据都到入到t_d_partition的相关的分区中：

insert into table t_d_partition partition(class, year=2016) select id, name, class from t_partition_1 where year=2016;

要将t_partition_1中所有数据都到入到t_d_partition的相关的分区中：

insert overwrite table t_d_partition partition(year, class) select id, name, year, class from t_partition_1;(我操作时出现的提示：            FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set                      shive.exec.dynamic.partition.mode=nonstrict            )

其它问题：

从hdfs上面删除的数据，并没有删除表结构，我们show partitions t_d_partition;是从metastore中查询出来的内容，如果你是手动删除的hdfs上面数据，它的元数据信息依然在。insert into t10_p_1 partition(year=2016, class) select * from t_partition_1;FAILED: SemanticException [Error 10094]: Line 1:30 Dynamic partition cannot be the parent of a static partition 'professional'动态分区不能成为静态分区的父目录需要将hive.exec.dynamic.partition.mode设置为nonstrict    hive.exec.max.dynamic.partitions    1000    Maximum number of dynamic partitions allowed to be created in total.

import导入hdfs上的数据：

import table stu from '/data/stu';目前测试时会出现下面的错误：hive (mydb1)> import table test from '/input/hive/';FAILED: SemanticException [Error 10027]: Invalid pathhive (mydb1)> import table test from 'hdfs://input/hive/';FAILED: SemanticException [Error 10324]: Import Semantic Analyzer Errorhive (mydb1)> import table test from 'hdfs://ns1/input/hive/';FAILED: SemanticException [Error 10027]: Invalid path

导出

1.hadoop fs -cp src_uri dest_uri(hdfs dfs -cp src_uri dest_uri)2.hive> export table tblName to 'hdfs_uri';导出到的hdfs目录必须是一个空目录，如果不存在时，则会自动创建该目录。这种方式同时会将元数据信息导出。3.insert overwrite [local] directory 'linux_fs_path' select ...from... where ...; 如果不加local，则数据会导出到hdfs，否则会导出到linux文件系统不管哪一种方式，如果目录不存在，则会自动创建，如果存在，则会覆盖。

很赞哦！