sql 求助,果然只会 mysql 的写 hive 要命 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
请不要在回答技术问题时复制粘贴 AI 生成的内容
Distand
V2EX    程序员

sql 求助,果然只会 mysql 的写 hive 要命

  •  
  •   Distand 2022-05-16 20:36:33 +08:00 2597 次点击
    这是一个创建于 1258 天前的主题,其中的信息可能已经有所发展或是发生改变。

    数据开发跑路了,只能自己去 hive 导数据,需求是:

    1. xfrom='sugg'的全要

    2. 其它 xfrom 的按 sub_region 分组根据 pubtime 一个月前取 xx 条,一个月内取 xx 条,都要按 click 倒序排

    3. 汇总到一起

    渣渣业务开发只能写出这种 sql 了,跑了下执行一次要 49 块钱,求个优化

    WITH total AS ( SELECT * FROM xx.xx WHERE log_date='20220515' ), other AS ( SELECT * FROM total WHERE xfrom != 'sugg' ), final AS ( SELECT * FROM total WHERE xfrom='sugg' UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='a' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) a1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='a' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) a2 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='b' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) b1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='b' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) b2 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='c' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 250) c1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='c' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 250) c2 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='d' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) d1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='d' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) d2 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='e' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) e1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='e' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) e2 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='f' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) f1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='f' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) f2 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='g' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) g1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='g' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) g2 UNION ALL -- 这个 sub_region 有特殊要求,其它都是一样的条件,只是 limit 数量不同 SELECT * FROM (SELECT * FROM other WHERE sub_region='h' AND pubtime < '${yyyyMMdd,-1m}' AND (tag LIKE '%th%' OR tag LIKE '%ph%') ORDER BY click DESC LIMIT 150) h1 UNION ALL SELECT * FROM (SELECT * FROM other WHERE sub_region='h' AND pubtime >= '${yyyyMMdd,-1m}' AND (tag LIKE '%th%' OR tag LIKE '%ph%') ORDER BY click DESC LIMIT 150) h2 ) INSERT OVERWRITE TABLE yy.yy SELECT * FROM final; 
    9 条回复    2022-05-20 14:51:56 +08:00
    efaun
        1
    efaun  
       2022-05-16 21:07:32 +08:00
    看到第二条说什么分组什么取多少条, 你看看分区函数适不适合你
    https://www.cnblogs.com/linJie1930906722/p/6036053.html
    chineselittleboy
        2
    chineselittleboy  
       2022-05-16 21:55:15 +08:00 via Android
    最里面的 select * 改成想要的一些字段能便宜点
    zhusimaji
        3
    zhusimaji  
       2022-05-16 23:25:58 +08:00
    关于取数分组的语句可以参考 select * from (select XXX, row_number() over (partition by sub_region order by click desc ) as rank from XXX)a where rank <1000
    zhusimaji
        4
    zhusimaji  
       2022-05-16 23:26:46 +08:00
    所以后续一个月内和一个月外只要 union 一下就可以了,应该就可以输出结果了
    FYFX
        5
    FYFX  
       2022-05-16 23:30:44 +08:00
    你下面那一堆 sql 可以根据 pubtime 分成两部分,然后用窗口函数 partition by sub_region order by click 算一下就行了吧,而且你这只取一天数据要 49 块钱是不是有点离谱
    lyang
        6
    lyang  
       2022-05-17 08:36:42 +08:00
    49 块钱是什么意思
    Distand
        7
    Distand  
    OP
       2022-05-17 09:48:25 +08:00
    感谢各位老哥,先 case when pubtime < '${yyyyMMdd,-1m}' then 0 else 1 end as pt ,再 row_number() over(partition by sub_region,pt order by click desc) as rank 后只需要 2 块钱了
    512357301
        8
    512357301  
       2022-05-17 13:15:21 +08:00 via Android
    这是一条 SQL 就搞定的吧(通过辅助列判断需要哪些数据,需要用到开窗函数,然后 where 就行了,怎么你这搞了那么多的 union all ,hive 执行 union all 效率很低的
    shuianfendi6
        9
    shuianfendi6  
       2022-05-20 14:51:56 +08:00
    用 partition 和 row_number 吧

    spark-sql 执行效率会好不少
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2504 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 03:49 PVG 11:49 LAX 20:49 JFK 23:49
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86