大数据学习(125)-hive数据分析

发布于:2025-05-30 ⋅ 阅读:(25) ⋅ 点赞:(0)

🍋🍋大数据学习🍋🍋

🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞


1. 连续登录问题变种
  • 题目
    找出恰好连续登录 3 天的用户(不允许更长的连续区间)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH ranked_logs AS (
        SELECT 
            user_id,
            login_date,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
        FROM user_logs
    ),
    consecutive_groups AS (
        SELECT 
            user_id,
            DATE_SUB(login_date, INTERVAL rn DAY) AS grp,
            MIN(login_date) AS start_date,
            MAX(login_date) AS end_date,
            COUNT(*) AS days
        FROM ranked_logs
        GROUP BY user_id, grp
    )
    SELECT user_id, start_date, end_date
    FROM consecutive_groups
    WHERE days = 3;
    
2. 连续未登录问题
  • 题目
    找出用户最长连续未登录天数(假设表中仅记录登录日期)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH next_logs AS (
        SELECT 
            user_id,
            login_date,
            LEAD(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS next_login
        FROM user_logs
    )
    SELECT 
        user_id,
        MAX(DATEDIFF(next_login, login_date) - 1) AS max_consecutive_missing
    FROM next_logs
    WHERE next_login IS NOT NULL
    GROUP BY user_id;
    

二、窗口函数高级应用

3. 移动平均值计算
  • 题目
    计算用户最近 7 天的平均消费金额(滑动窗口)。
    表结构orders(user_id, order_date, amount)

  • 参考答案

    SELECT 
        user_id,
        order_date,
        AVG(amount) OVER (
            PARTITION BY user_id 
            ORDER BY order_date 
            RANGE BETWEEN INTERVAL '6 DAY' PRECEDING AND CURRENT ROW
        ) AS rolling_7day_avg
    FROM orders;
    
4. 增长率计算
  • 题目
    计算每个用户月消费金额的环比增长率
    表结构orders(user_id, order_date, amount)

  • 参考答案

    WITH monthly_sales AS (
        SELECT 
            user_id,
            DATE_FORMAT(order_date, '%Y-%m') AS month,
            SUM(amount) AS total_amount
        FROM orders
        GROUP BY user_id, month
    )
    SELECT 
        user_id,
        month,
        total_amount,
        (total_amount / LAG(total_amount) OVER (PARTITION BY user_id ORDER BY month) - 1) * 100 AS growth_rate
    FROM monthly_sales;
    

三、时间序列分析

5. 缺失日期填充
  • 题目
    生成用户每日登录状态(0 = 未登录,1 = 登录),包括缺失的日期。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH date_range AS (
        SELECT 
            user_id,
            MIN(login_date) AS start_date,
            MAX(login_date) AS end_date
        FROM user_logs
        GROUP BY user_id
    ),
    all_dates AS (
        SELECT 
            dr.user_id,
            d.calendar_date
        FROM date_range dr
        CROSS JOIN (
            SELECT CURDATE() - INTERVAL n DAY AS calendar_date
            FROM (SELECT @row := @row + 1 AS n FROM (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t1,
                        (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t2,
                        (SELECT @row := -1) t3) t
        ) d
        WHERE d.calendar_date BETWEEN dr.start_date AND dr.end_date
    )
    SELECT 
        ad.user_id,
        ad.calendar_date,
        IF(ul.login_date IS NULL, 0, 1) AS is_logged_in
    FROM all_dates ad
    LEFT JOIN user_logs ul 
    ON ad.user_id = ul.user_id AND ad.calendar_date = ul.login_date;
    
6. 周期性检测
  • 题目
    找出用户每周固定某天登录的行为模式(如每周一登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH day_of_week AS (
        SELECT 
            user_id,
            login_date,
            DAYOFWEEK(login_date) AS dow
        FROM user_logs
    )
    SELECT 
        user_id,
        dow,
        COUNT(DISTINCT WEEK(login_date)) AS weeks_count,
        COUNT(*) AS login_count
    FROM day_of_week
    GROUP BY user_id, dow
    HAVING login_count = weeks_count; -- 每周该天均登录
    

四、复杂业务场景

7. 购买间隔分析
  • 题目
    计算用户平均购买间隔,并找出间隔超过 30 天的用户。
    表结构orders(user_id, order_date)

  • 参考答案

    WITH order_intervals AS (
        SELECT 
            user_id,
            order_date,
            DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY user_id ORDER BY order_date)) AS days_since_last
        FROM orders
    )
    SELECT 
        user_id,
        AVG(days_since_last) AS avg_interval
    FROM order_intervals
    WHERE days_since_last IS NOT NULL
    GROUP BY user_id
    HAVING avg_interval > 30;
    
8. 活跃 / 流失用户分析
  • 题目
    标记用户每月状态(活跃 = 当月有登录,流失 = 连续 3 个月未登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH months AS (
        SELECT 
            user_id,
            DATE_FORMAT(login_date, '%Y-%m') AS month,
            MAX(login_date) AS last_login
        FROM user_logs
        GROUP BY user_id, month
    ),
    status AS (
        SELECT 
            m.user_id,
            m.month,
            m.last_login,
            LEAD(m.last_login, 3) OVER (PARTITION BY m.user_id ORDER BY m.month) AS next_3rd_month_login
        FROM months m
    )
    SELECT 
        user_id,
        month,
        CASE 
            WHEN next_3rd_month_login IS NULL THEN '流失'
            ELSE '活跃'
        END AS status
    FROM status;
    

五、进阶挑战

9. 最长连续事件链
  • 题目
    找出用户最长的连续事件链(如连续点赞、评论等,事件类型相同)。
    表结构events(user_id, event_time, event_type)

  • 参考答案

    WITH ranked_events AS (
        SELECT 
            user_id,
            event_time,
            event_type,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS rn
        FROM events
    ),
    event_groups AS (
        SELECT 
            user_id,
            event_type,
            DATE_SUB(event_time, INTERVAL rn SECOND) AS grp,
            COUNT(*) AS chain_length
        FROM ranked_events
        GROUP BY user_id, event_type, grp
    )
    SELECT 
        user_id,
        event_type,
        MAX(chain_length) AS max_chain
    FROM event_groups
    GROUP BY user_id, event_type;
    
10. 会话识别
  • 题目
    将用户行为按会话分组(假设会话间隔为 30 分钟)。
    表结构actions(user_id, action_time, action_type)

  • 参考答案

    WITH time_diff AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            TIMESTAMPDIFF(MINUTE, 
                          LAG(action_time) OVER (PARTITION BY user_id ORDER BY action_time), 
                          action_time) AS minutes_since_last
        FROM actions
    ),
    session_markers AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            IF(minutes_since_last > 30 OR minutes_since_last IS NULL, 1, 0) AS new_session
        FROM time_diff
    ),
    sessions AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            SUM(new_session) OVER (PARTITION BY user_id ORDER BY action_time) AS session_id
        FROM session_markers
    )
    SELECT * FROM sessions;
    
  1. 先手动模拟数据:创建测试表并插入少量数据,验证逻辑正确性。
  2. 对比不同方法:例如连续值问题,尝试用 LEAD()DATE_SUB + ROW_NUMBER 等多种方法实现。
  3. 注意边界条件:处理空值、同一天多次记录、跨年 / 跨月等场景。

网站公告

今日签到

点亮在社区的每一天
去签到