🍋🍋大数据学习🍋🍋
🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞
一、基础窗口函数与分组聚合
1. 连续登录问题
题目:
找出连续登录至少 5 天的用户及其最长连续登录天数。
表结构:user_logs(user_id, login_date)
。参考答案:
WITH ranked_logs AS ( SELECT user_id, login_date, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn FROM user_logs ), consecutive_groups AS ( SELECT user_id, DATE_SUB(login_date, INTERVAL rn DAY) AS grp, COUNT(*) AS consecutive_days FROM ranked_logs GROUP BY user_id, grp ) SELECT user_id, MAX(consecutive_days) AS max_consecutive_days FROM consecutive_groups WHERE consecutive_days >= 5 GROUP BY user_id;
优化点:
- 对
user_id
和login_date
建立索引。 - 用
DISTINCT
去重(若存在同一天多次登录):SELECT DISTINCT user_id, login_date FROM user_logs
。
- 对
2. 移动平均计算
题目:
计算用户近 7 天的滚动平均消费金额(含当天)。
表结构:orders(user_id, order_date, amount)
。参考答案:
SELECT user_id, order_date, AVG(amount) OVER ( PARTITION BY user_id ORDER BY UNIX_TIMESTAMP(order_date) RANGE BETWEEN 6 * 24 * 3600 PRECEDING AND CURRENT ROW ) AS rolling_7day_avg FROM orders;
优化点:
- Hive 中
RANGE
对日期的支持有限,需用UNIX_TIMESTAMP
转换为秒。 - 预先过滤近 30 天数据以减少计算量:
WHERE order_date >= DATE_SUB(CURRENT_DATE, 30)
。
- Hive 中
二、日期函数与时间序列分析
3. 每月首次 / 最后一次登录
题目:
计算用户每月首次和最后一次登录日期。
表结构:user_logs(user_id, login_date)
。参考答案:
SELECT user_id, month, MIN(login_date) AS first_login, MAX(login_date) AS last_login FROM ( SELECT user_id, login_date, DATE_FORMAT(login_date, 'yyyy-MM') AS month FROM user_logs ) t GROUP BY user_id, month;
优化点:
- 使用
DISTRIBUTE BY user_id SORT BY month
优化数据分布。 - 若数据已按日期分区,直接按分区过滤(如
WHERE dt >= '2023-01-01'
)。
- 使用
4. 购买季节性分析
题目:
计算每个商品每周销量占全年总销量的比例。
表结构:sales(product_id, sale_date, quantity)
。参考答案:
WITH yearly_sales AS ( SELECT product_id, SUM(quantity) AS yearly_quantity FROM sales GROUP BY product_id ), weekly_sales AS ( SELECT product_id, YEAR(sale_date) AS sale_year, WEEKOFYEAR(sale_date) AS sale_week, SUM(quantity) AS weekly_quantity FROM sales GROUP BY product_id, YEAR(sale_date), WEEKOFYEAR(sale_date) ) SELECT ws.product_id, ws.sale_year, ws.sale_week, ws.weekly_quantity / ys.yearly_quantity AS weekly_ratio FROM weekly_sales ws JOIN yearly_sales ys ON ws.product_id = ys.product_id;
优化点:
- 使用
CUBE
或ROLLUP
同时计算周、月、年聚合:SELECT product_id, YEAR(sale_date) AS sale_year, WEEKOFYEAR(sale_date) AS sale_week, SUM(quantity) AS quantity, GROUPING__ID -- 0=完整分组,1=按年聚合,3=总聚合 FROM sales GROUP BY product_id, YEAR(sale_date), WEEKOFYEAR(sale_date) WITH ROLLUP;
- 使用
三、多表关联与复杂查询
5. 用户留存分析
题目:
计算新用户次日、7 日留存率(首日登录后,次日 / 7 日仍登录的用户比例)。
表结构:user_logs(user_id, login_date)
user_info(user_id, register_date)
参考答案:
WITH first_login AS ( SELECT user_id, MIN(login_date) AS first_login_date FROM user_logs GROUP BY user_id ), retention AS ( SELECT fl.user_id, fl.first_login_date, MAX(CASE WHEN DATEDIFF(ul.login_date, fl.first_login_date) = 1 THEN 1 ELSE 0 END) AS is_retained_day2, MAX(CASE WHEN DATEDIFF(ul.login_date, fl.first_login_date) = 7 THEN 1 ELSE 0 END) AS is_retained_day7 FROM first_login fl LEFT JOIN user_logs ul ON fl.user_id = ul.user_id GROUP BY fl.user_id, fl.first_login_date ) SELECT COUNT(*) AS total_users, SUM(is_retained_day2) / COUNT(*) AS retention_rate_day2, SUM(is_retained_day7) / COUNT(*) AS retention_rate_day7 FROM retention;
优化点:
- 对
user_logs
按user_id
分桶,user_info
按user_id
排序,加速JOIN
。 - 使用
MAPJOIN
优化小表关联:SELECT /*+ MAPJOIN(user_info) */ ... FROM user_logs JOIN user_info ON ...
- 对
6. 商品关联分析
题目:
找出经常与商品 A 一起购买的前 10 个商品(同订单中同时出现的频率最高)。
表结构:order_items(order_id, product_id)
。参考答案:
WITH a_orders AS ( SELECT DISTINCT order_id FROM order_items WHERE product_id = 'A' ), co_products AS ( SELECT oi.product_id AS co_product_id, COUNT(*) AS co_occurrence FROM a_orders ao JOIN order_items oi ON ao.order_id = oi.order_id WHERE oi.product_id != 'A' -- 排除商品 A 自身 GROUP BY oi.product_id ) SELECT co_product_id, co_occurrence FROM co_products ORDER BY co_occurrence DESC LIMIT 10;
优化点:
- 使用
BROADCAST JOIN
分发小表(a_orders
):SELECT /*+ BROADCAST(a_orders) */ ... FROM a_orders JOIN order_items ...
- 预先过滤低频商品:
WHERE product_id IN (SELECT product_id FROM order_items GROUP BY product_id HAVING COUNT(*) > 100)
。
- 使用
四、综合业务场景
7. 订单完成率分析
题目:
计算每月订单从创建到支付的平均耗时,并按年同比增长排序。
表结构:orders(order_id, create_time, pay_time, status)
order_status(status_id, status_name)
参考答案:
WITH monthly_avg AS ( SELECT YEAR(o.create_time) AS sale_year, MONTH(o.create_time) AS sale_month, AVG(UNIX_TIMESTAMP(o.pay_time) - UNIX_TIMESTAMP(o.create_time)) / 3600 AS avg_hours FROM orders o JOIN order_status os ON o.status = os.status_id WHERE os.status_name = '已支付' GROUP BY YEAR(o.create_time), MONTH(o.create_time) ), yoy_growth AS ( SELECT curr.sale_year, curr.sale_month, curr.avg_hours, prev.avg_hours AS prev_avg_hours, (curr.avg_hours - prev.avg_hours) / prev.avg_hours AS growth_rate FROM monthly_avg curr LEFT JOIN monthly_avg prev ON curr.sale_month = prev.sale_month AND curr.sale_year = prev.sale_year + 1 ) SELECT sale_year, sale_month, avg_hours, prev_avg_hours, growth_rate FROM yoy_growth ORDER BY growth_rate DESC;
优化点:
- 对
orders
表按status
分区,减少数据扫描:WHERE status = 'paid' AND dt >= '2023-01-01'
。 - 使用窗口函数替代
JOIN
计算同比:SELECT sale_year, sale_month, avg_hours, LAG(avg_hours, 12) OVER (ORDER BY sale_year, sale_month) AS prev_avg_hours, (avg_hours - LAG(avg_hours, 12) OVER (ORDER BY sale_year, sale_month)) / LAG(avg_hours, 12) OVER (ORDER BY sale_year, sale_month) AS growth_rate FROM monthly_avg;
- 对
8. 高价值用户识别
题目:
找出RFM 模型中得分最高的前 10% 用户(最近购买时间 R、购买频率 F、消费金额 M)。
表结构:orders(user_id, order_date, amount)
。参考答案:
WITH user_rfm AS ( SELECT user_id, DATEDIFF(CURRENT_DATE, MAX(order_date)) AS recency, COUNT(*) AS frequency, SUM(amount) AS monetary FROM orders GROUP BY user_id ), rfm_scores AS ( SELECT user_id, recency, frequency, monetary, NTILE(4) OVER (ORDER BY recency) AS r_score, -- 最近购买:值越小排名越高 NTILE(4) OVER (ORDER BY frequency DESC) AS f_score, -- 购买频率:值越大排名越高 NTILE(4) OVER (ORDER BY monetary DESC) AS m_score -- 消费金额:值越大排名越高 FROM user_rfm ), rfm_combined AS ( SELECT user_id, r_score, f_score, m_score, r_score * 100 + f_score * 10 + m_score AS total_score FROM rfm_scores ) SELECT user_id, total_score FROM rfm_combined WHERE total_score >= (SELECT PERCENTILE(total_score, 0.9) FROM rfm_combined) ORDER BY total_score DESC;
优化点:
- 使用
BUCKETED BY user_id
分桶存储orders
表,加速聚合。 - 预先过滤无效订单:
WHERE order_date >= DATE_SUB(CURRENT_DATE, 365)
。
- 使用
五、性能优化专项练习
9. 大表 JOIN 优化
题目:
优化以下 SQL 的执行效率:sql
SELECT u.user_id, u.username, o.total_amount FROM users u JOIN ( SELECT user_id, SUM(amount) AS total_amount FROM orders GROUP BY user_id ) o ON u.user_id = o.user_id;
优化方案:
SET hive.auto.convert.join=true; -- 启用自动小表广播 SET hive.optimize.skewjoin=true; -- 优化数据倾斜 SELECT /*+ MAPJOIN(o) */ -- 手动指定 MAPJOIN u.user_id, u.username, o.total_amount FROM users u JOIN ( SELECT /*+ STREAMTABLE(u) */ -- 指定流式处理表 user_id, SUM(amount) AS total_amount FROM orders GROUP BY user_id ) o ON u.user_id = o.user_id;
优化点:
- 使用
MAPJOIN
将小表(如users
)广播到每个节点。 - 添加
DISTRIBUTE BY user_id
确保数据按user_id
均匀分布。 - 对
orders
表按user_id
分桶,与users
表分桶数一致。
- 使用
10. 子查询转窗口函数
题目:
将以下子查询改写为窗口函数,提升性能:SELECT user_id, order_date, amount, (SELECT SUM(amount) FROM orders WHERE user_id = o.user_id) AS total_amount FROM orders o;
优化方案:
SELECT user_id, order_date, amount, SUM(amount) OVER (PARTITION BY user_id) AS total_amount FROM orders;
优化点:
- 窗口函数避免了重复扫描表,只需一次全量扫描。
- 添加
CLUSTER BY user_id
优化数据分布,减少 shuffle。