详解 Hive 分区表和分桶表-EW帮帮网

一、分区表

1. 介绍

分区表实际上是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件
Hive 中的分区就是分目录，即把一个大的数据集根据业务需要分割成小的数据集
在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，可以避免全表扫描，进而提高查询效率

2. 基本操作

2.1 数据准备

原始数据文件

-- dept_1.log
10  ACCOUNTING  1700
20  RESEARCH  1800

-- dept_2.log
30  SALES  1900
40  OPERATIONS  1700

-- dept_3.log
50  TEST  2000
60  DEV 1900

创建分区表

--使用 partitioned by 设置分区字段，分区字段不能是表中已存在的字段
create table if not exists dept_par
(
	deptNo int,
    deptName string,
    loc int
)
partitioned by (idate string)
row format delimited fields terminated by '\t';

加载数据到分区表

--加载数据时使用 partition 指定分区
load data local inpath '/opt/module/hive/datas/dept_1.log' into table dept_par 
partition(idate='2022-04-01');

load data local inpath '/opt/module/hive/datas/dept_2.log' into table dept_par 
partition(idate='2022-04-02');

load data local inpath '/opt/module/hive/datas/dept_3.log' into table dept_par 
partition(idate='2022-04-03');

2.2 查询分区

select * from dept_par where idate='2022-04-01';

2.3 增加分区

-- 创建单个分区
alter table dept_par add partition(idate='2022-04-04');
-- 创建多个分区（空格划分）
alter table dept_par add partition(idate='2022-04-05') partition(idate='2022-04-06');

2.4 删除分区

-- 删除单个分区
alter table dept_par drop partition(idate='2022-04-06');
-- 删除多个分区（逗号划分）
alter table dept_par drop partition(idate='2022-04-04'), partition(idate='2022-04-05');

2.5 查看分区

-- 查看分区表的分区个数
show partitions dept_par;

-- 查看分区表结构
desc formatted dept_par;

3. 二级分区

3.1 数据准备

创建二级分区表

create table if not exists dept_par2
(
	deptNo int,
    deptName String,
    loc int
)
partitioned by (iDate string, hour string)
row format delimited fields terminated by '\t';

加载数据到二级分区表

load data local inpath '/opt/module/hive/datas/dept_1.log' into table dept_par2
partition(iDate='20220401',hour='10');

load data local inpath '/opt/module/hive/datas/dept_2.log' into table dept_par2
partition(iDate='20220401',hour='11');

load data local inpath '/opt/module/hive/datas/dept_3.log' into table dept_par2
partition(iDate='20220402',hour='10');

3.2 查询

select * from dept_par2 where iDate='20220401' and hour='10';

3.3 同步分区表元数据和数据

问题：

-- 普通表
-- 1.使用 hadoop 命令创建目录并上传数据
hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept
hadoop fs -put dept_1.log /user/hive/warehouse/mydb.db/dept
-- 2.创建表
create table if not exists dept(deptNo int,deptName string,loc string)
row format delimited fields terminated by '\t';
-- 3.查询
select * from dept;  -- 有数据

-- 分区表
-- 1.使用 hadoop 命令创建目录并上传数据
hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_par3/iDate=20220403
hadoop fs -put dept_1.log /user/hive/warehouse/mydb.db/dept_par3/iDate=20220403
-- 2.创建表
create table if not exists dept_par3(deptNo int,deptName string,loc string)
partitioned by (iDate string)
row format delimited fields terminated by '\t';
-- 3.查询
select * from dept_par3;  -- 没有数据


-- 结论：分区表中由于多了分区信息，直接创建分区目录不会在元数据中添加分区信息，从而导致查询目录无法匹配

解决方法：

分区修复
```
msck repair table dept_par3;
```

执行增加分区命令

alter table dept_par3 add partition(iDate='20220403');

load 数据到分区

load data local inpath '/opt/module/hive/datas/dept_1.log' into table dept_par3
partition(iDate='20220403');

4. 动态分区调整

动态分区机制：对分区表 Insert 数据时，自动根据分区字段的值，将数据插入到相应的分区中

4.1 参数配置

开启动态分区功能（默认 true，开启）
```
set hive.exec.dynamic.partition=true;
```
设置为非严格模式（默认 strict，表示必须指定至少一个分区为静态分区，nonstrict 模式表示允许所有的分区字段都可以使用动态分区）
```
set hive.exec.dynamic.partition.mode=nonstrict;
```
在所有执行 MR 的节点上，最大一共可以创建多少个动态分区。默认 1000
```
set hive.exec.max.dynamic.partitions=1000;
```
在每个执行 MR 的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即 day 字段有 365 个值，那么该参数就要设置成大于 365，如果使用默认值 100，则会报错
```
set hive.exec.max.dynamic.partitions.pernode=100;
```
整个 MR Job 中，最大可以创建多少个 HDFS 文件。默认 100000
```
set hive.exec.max.created.files=100000;
```
当有空分区生成时，是否抛出异常。一般不需要设置。默认 false
```
set hive.error.on.empty.partition=false;
```

4.2 案例

需求：将 dept 表中的数据按照地区（loc 字段），插入到目标表 dept_par 的相应分区中

实现：

创建目标分区表

create table dept_par_dy(deptNo int, deptName string) 
partitioned by (loc int) 
row format delimited fields terminated by '\t';

动态分区导入数据

insert into table dept_par_dy partition(loc)
select deptNo, deptName, loc from dept;

查看分区数
```
show partitions dept_par_dy;
```

4.3 Hive 3.x 新特性

导入数据是 partition 关键词可以省略

-- 等价
insert into table dept_par_dy partition(loc)
select deptNo, deptName, loc from dept;

insert into table dept_par_dy
select deptNo, deptName, loc from dept;

可以在严格模式下进行动态分区

set hive.exec.dynamic.partition.mode=strict;

insert into table dept_par_dy
select deptNo, deptName, loc from dept;

二、分桶表

1. 介绍

对于一张表或者分区，Hive 可以进一步组织成桶，进行更为细粒度的数据范围划分
分桶是将数据集分解成更容易管理的若干部分的另一个技术
分区针对的是数据的存储路径；分桶针对的是数据文件

2. 基本操作

2.1 数据准备

原始数据

-- stu.txt
1001  ss1
1002  ss2
1003  ss3
1004  ss4
1005  ss5
1006  ss6
1007  ss7
1008  ss8
1009  ss9
1010  ss10
1011  ss11
1012  ss12
1013  ss13
1014  ss14
1015  ss15
1016  ss16

创建分桶表

create tabble if not exists stu_buck
(
	id int,
    name string
)
clustered by (id) into 4 buckets
row format delimited fields terminated by '\t';

加载数据到分桶表

load data local inpath '/opt/module/hive/datas/stu.txt' into table stu_buck;

2.2 操作

查看分桶信息
```
desc formatted stu_buck;
```
查询
```
select * from stu_buck;
```

2.3 分桶规则

对分桶字段的值进行哈希运算，然后除以桶的个数取余数，从而决定该条记录存放在哪个桶当中

2.4 注意事项

reduce 的个数设置为 -1,让 Job 自行决定需要用多少个 reduce 或者将 reduce 的个数设置为大于等于分桶表的桶数
从 hdfs 中 load 数据到分桶表中，避免本地文件找不到问题
不要使用本地模式

详解 Hive 分区表和分桶表

一、分区表

1. 介绍

2. 基本操作

2.1 数据准备

2.2 查询分区

2.3 增加分区

2.4 删除分区

2.5 查看分区

3. 二级分区

3.1 数据准备

3.2 查询

3.3 同步分区表元数据和数据

4. 动态分区调整

4.1 参数配置

4.2 案例

4.3 Hive 3.x 新特性

二、分桶表

1. 介绍

2. 基本操作

2.1 数据准备

2.2 操作

2.3 分桶规则

2.4 注意事项

网站公告

今日签到

热门文章

最新发布