Hive SQL 是基于 Hadoop 的数据仓库基础设施,用于提供数据查询和分析。下面介绍常用的 Hive SQL 语句,进行数据处理和分析。
1. 数据库操作
创建数据库 :CREATE DATABASE IF NOT EXISTS db_name;
使用数据库:USE db_name;
删除数据库:DROP DATABASE IF EXISTS db_name CASCADE;
2.表操作
创建表:
CREATE TABLE IF NOT EXISTS table_name (
col1 datatype,
col2 datatype,
col3 datatype
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
查看表结构:
DESCRIBE table_name;
删除表:
DROP TABLE IF EXISTS table_name;
3.数据导入
从本地文件系统导入数据:
LOAD DATA LOCAL INPATH '/path/to/file' INTO TABLE table_name;
从 HDFS 导入数据:
LOAD DATA INPATH '/path/in/hdfs' INTO TABLE table_name;
4.数据查询
基本查询:SELECT col1, col2 FROM table_name WHERE condition;
分组查询:SELECT col1, COUNT(*) FROM table_name GROUP BY col1;
排序查询:SELECT * FROM table_name ORDER BY col1 DESC;
5.连接操作
内连接:SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;
左连接:SELECT * FROM table1 LEFT JOIN table2 ON table1.id = table2.id;
6.窗口函数
计算累计和:
SELECT col1, SUM(col2) OVER (ORDER BY col1) AS running_total FROM table_name;
分组排名:
SELECT col1, col2, RANK() OVER (PARTITION BY col1 ORDER BY col2 DESC) AS rank FROM table_name;
7.分区表操作
创建分区表:
CREATE TABLE IF NOT EXISTS table_name ( col1 datatype, col2 datatype ) PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
添加分区:
ALTER TABLE table_name ADD PARTITION (dt='2023-01-01');
8.插入数据
INSERT INTO TABLE table_name VALUES (val1, val2, val3);
9、聚合函数
计算平均值:
SELECT AVG(col1) FROM table_name;
计算最大值:
SELECT MAX(col1) FROM table_name;
10.其他常用操作
创建视图:
CREATE VIEW view_name AS SELECT col1, col2 FROM table_name;
设置Hive参数:
SET hive.execution.engine=tez;