千家信息网

Hive-1.2.0学习笔记(二)Hive数据类型

发表于:2025-02-24 作者:千家信息网编辑
千家信息网最后更新 2025年02月24日,鲁春利的工作笔记,谁说程序员不能有文艺范?在hive中创建mywork数据库,以后的测试在该数据库中进行,避免每次都使用default数据库。hive> create database mywork;
千家信息网最后更新 2025年02月24日Hive-1.2.0学习笔记(二)Hive数据类型

鲁春利的工作笔记,谁说程序员不能有文艺范?



在hive中创建mywork数据库,以后的测试在该数据库中进行,避免每次都使用default数据库。

hive> create database mywork;OKTime taken: 0.487 secondshive> show databases;OKdefaultmyworkTime taken: 0.614 seconds, Fetched: 2 row(s)hive> hive> use mywork;OKTime taken: 0.064 secondshive> create table student(id int, name string);OKTime taken: 0.519 secondshive>

查看Hive在HDFS上的存储

[hadoop@dnode1 ~]$ hdfs dfs -ls -R /user/hivedrwxrw-rw-   - hadoop hadoop          0 2015-12-08 21:37 /user/hive/warehousedrwxrw-rw-   - hadoop hadoop          0 2015-12-08 21:36 /user/hive/warehouse/mywork.dbdrwxrw-rw-   - hadoop hadoop          0 2015-12-08 21:36 /user/hive/warehouse/mywork.db/student[hadoop@dnode1 ~]$


Hive支持的数据类型如下:

原生类型:

TINYINT                1字节SMALLINT               2字节INT                    4字节BIGINT                 8字节BOOLEAN                true/falseFLOAT                  4字节DOUBLE                 8字节STRING                 字符串BINARY (Hive 0.8.0以上才可用)TIMESTAMP (Hive 0.8.0以上才可用)

复合类型:

arrays: ARRAY                有序字段,类型必须相同maps: MAP    无序的键/值对structs: STRUCT    一组命名的字段union: UNIONTYPE

说明:ARRAY数据类型通过下标来获取值,如arrays[0],MAP通过["指定域名称"]访问, STRUCT类型通过点方式访问(如structs.col_name)。


建表示例:

hive> create table employee (    > eno int comment 'the no of employee',    > ename string comment 'name of employee',    > salary float comment 'salary of employee',    > subordinates array comment 'employees managed by current employee',    > deductions map comment 'deductions',    > address struct comment 'address'    > ) comment 'This is table of employee info';OKTime taken: 0.33 secondshive>

在Hive中各列之间,以及复合类型内部使用了不同的分隔符来指定,每行数据对应一条记录。


在${HIVE_HOME}/data目录下创建文件data_default.txt文件,采用默认分隔符,内容为:

Hive默认的字段分隔符为ascii码的控制符\001,建表的时候用fields terminated by '\001'。造数据在vi 打开文件里面,用ctrl+v然后再ctrl+a可以输入这个控制符\001(即^A)。按顺序,\002的输入方式为ctrl+v,ctrl+b。以此类推。

说明:

1000                    员工编号zhangsan                员工姓名5000.0                  员工工资lisi^Bwangwu            下属员工ptax^C200^Bpension^C200                工资扣除金额(如税收等)shandong^Bheze^Bdingtao^B274106        家庭住址(struct结构只需指定值即可)


加载数据

hive> load data local inpath 'data/data_default.txt' into table employee;Loading data to table mywork.employeeTable mywork.employee stats: [numFiles=1, numRows=0, totalSize=83, rawDataSize=0]OKTime taken: 0.426 secondshive> select * from employee;OK1000    zhangsan        5000.0  ["lisi","wangwu"]       {"ptax":200.0,"pension":200.0}  {"province":"shandong","city":"heze","street":"dingtao","zip":274106}Time taken: 0.114 seconds, Fetched: 1 row(s)hive># 对于复合类型数据查询方式如下hive> select eno, ename, salary, subordinates[0], deductions['ptax'], address.province from employee;OK1000    zhangsan        5000.0  lisi    200.0   shandongTime taken: 0.129 seconds, Fetched: 1 row(s)hive>


查看HDFS数据结构

[hadoop@dnode1 ~]$ hdfs dfs -ls -R /user/hive/warehouse/drwxrw-rw-   - hadoop hadoop          0 2015-12-09 00:00 /user/hive/warehouse/mywork.dbdrwxrw-rw-   - hadoop hadoop          0 2015-12-09 00:00 /user/hive/warehouse/mywork.db/employee-rwxrw-rw-   2 hadoop hadoop         83 2015-12-09 00:00 /user/hive/warehouse/mywork.db/employee/data_default.txtdrwxrw-rw-   - hadoop hadoop          0 2015-12-08 23:03 /user/hive/warehouse/mywork.db/student[hadoop@dnode1 ~]$ hdfs dfs -text /user/hive/warehouse/mywork.db/employee/data_default.txt1000zhangsan5000.0lisiwangwuptax200pension200shandonghezedingtao274106[hadoop@dnode1 ~]$



自定义分隔符:

hive> create table employee_02 (    > eno int comment 'the no of employee',    > ename string comment 'name of employee',    > salary float comment 'salary of employee',    > subordinates array comment 'employees managed by current employee',    > deductions map comment 'deductions',    > address struct comment 'address'    > ) comment 'This is table of employee info'    > row format delimited fields terminated by '\t'    > collection items terminated by ','    > map keys terminated by ':'    > lines terminated by '\n';OKTime taken: 0.228 secondshive> load data local inpath 'data/data_employee02.txt' into table employee_02;Loading data to table mywork.employee_02Table mywork.employee_02 stats: [numFiles=1, totalSize=99]OKTime taken: 0.371 secondshive> select * from employee_02;OK1000    'zhangsan'      5000.0  ["'lisi'","'wangwu'"]   {"'ptax'":200.0,"'pension'":200.0}      {"province":"'shandong'","city":"'heze'","street":"'dingtao'","zip":274106}Time taken: 0.101 seconds, Fetched: 1 row(s)hive>


data/employee02.txt文件内容为

[hadoop@nnode data]$ pwd/usr/local/hive1.2.0/data[hadoop@nnode data]$ cat data_employee02.txt 1000    'zhangsan'      5000.0  'lisi','wangwu' 'ptax':200,'pension':200        'shandong','heze','dingtao',274106[hadoop@nnode data]$

说明:由于在文本文件中包含有单引号,在load到hive的表之后表示方式为属性加双引号,这里的单引号被认为了是属性或值的一部分了,需要注意。


查看详细表定义

# 建表时为默认设置hive> describe formatted employee;OK# col_name              data_type               comment                              eno                     int                     the no of employee  ename                   string                  name of employee    salary                  float                   salary of employee  subordinates            array           employees managed by current employeedeductions              map       deductions          address                 struct       address                              # Detailed Table Information             Database:               mywork                   Owner:                  hadoop                   CreateTime:             Tue Dec 08 23:10:07 CST 2015     LastAccessTime:         UNKNOWN                  Protect Mode:           None                     Retention:              0                        Location:               hdfs://cluster/user/hive/warehouse/mywork.db/employee    Table Type:             MANAGED_TABLE            Table Parameters:                        COLUMN_STATS_ACCURATE   true                        comment                 This is table of employee info        numFiles                1                           numRows                 0                           rawDataSize             0                           totalSize               83                          transient_lastDdlTime   1449590423                           # Storage Information            SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeInputFormat:            org.apache.hadoop.mapred.TextInputFormatOutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatCompressed:             No                       Num Buckets:            -1                       Bucket Columns:         []                       Sort Columns:           []                       Storage Desc Params:                     serialization.format    1                   Time taken: 0.098 seconds, Fetched: 37 row(s)# 建表时自定义了分隔符hive> describe formatted employee_02;OK# col_name              data_type               comment                              eno                     int                     the no of employee  ename                   string                  name of employeesalary                  float                   salary of employee  subordinates            array           employees managed by current employeedeductions              map       deductions          address                 struct address                 # Detailed Table Information             Database:               mywork                   Owner:                  hadoop                   CreateTime:             Wed Dec 09 00:12:53 CST 2015     LastAccessTime:         UNKNOWN                  Protect Mode:           None                     Retention:              0                        Location:               hdfs://cluster/user/hive/warehouse/mywork.db/employee_02   Table Type:             MANAGED_TABLE            Table Parameters:                        COLUMN_STATS_ACCURATE   true                        comment                 This is table of employee info        numFiles                1                           totalSize               99                          transient_lastDdlTime   1449591260                           # Storage Information            SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeInputFormat:            org.apache.hadoop.mapred.TextInputFormatOutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatCompressed:             No                       Num Buckets:            -1                       Bucket Columns:         []                       Sort Columns:           []                       Storage Desc Params:                     colelction.delim        ,                           field.delim             \t                          line.delim              \n                          mapkey.delim            :                           serialization.format    \t                  Time taken: 0.116 seconds, Fetched: 39 row(s)hive>


遗留问题:

hive> delete from student;FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.hive>


注意事项:如果sql语句中含有tab格式的内容,则会出现如下问题


0