导航：首页 > 数据库 >

对齐JOIN字段类型，使用数组代替字符串，降低字符串处理开销，列存降低扫描开销

发表于：2025-02-01 作者：千家信息网编辑

千家信息网最后更新 2025年02月01日，背景Greenplum通常被用作OLAP，在一些用户使用过程中，可能因为数据结构设计，SQL问题等原因导致性能不佳，虽然通过增加节点可以解决问题，但是如果能优化的话，可以节约不少硬件资源。例如1、对齐

千家信息网最后更新 2025年02月01日对齐JOIN字段类型，使用数组代替字符串，降低字符串处理开销，列存降低扫描开销

背景

Greenplum通常被用作OLAP，在一些用户使用过程中，可能因为数据结构设计，SQL问题等原因导致性能不佳，虽然通过增加节点可以解决问题，但是如果能优化的话，可以节约不少硬件资源。

例如

1、对齐JOIN字段类型。如果等值JOIN的字段类型不一致，无法使用HASH JOIN。

2、对齐where条件字段类型。同上，无法使用HASH JOIN，或者索引扫描。

3、使用数组代替字符串，降低字符串处理开销。如果字符串本身需要大量的格式化处理FILTER，那么使用数组的性能会好很多。

4、列存降低扫描开销，统计型的SQL由于涉及的字段有限，使用列存比行存储性能好很多。

例子

1、这个查询耗费230秒。

SELECT col4,count(DISTINCT c.col1) ptnum       from tbl1 a       INNER JOIN tbl2 b on b.col2=a.id       inner join tbl3 t2 on t2.ID <= (length(b.col3) - length(replace(b.col3,',',''))+1)        INNER JOIN tbl4 c        on replace(replace(Split_part(reverse(Split_part(reverse(Split_part(b.col3,',',cast(t2.id as int))),',',1)),':',1),'{',''),'"','') = c.id       INNER JOIN tbl5 s on a.col4=s.id       where replace(replace(reverse(Split_part(Split_part(reverse(Split_part(b.col3,',',cast(t2.id as int))),',',1),':',1)),'"',''),'}','') >'0'        and c.col1 not in ('xxxxxx')       GROUP BY col4;

2、使用explain analyze分析瓶颈

3、问题：

3.1、JOIN类型不一致，导致未使用HASH JOIN。

3.2、有两个表JOIN时产生笛卡尔积来进行不等于的判断，数据量叠加后需要计算几十万亿次。

tbl2.col3字符串格式如下(需要计算几十万亿次)

{"2":"1","10":"1","13":"1","16":"1","21":"1","26":"1","28":"1","30":"1","32":"1","33":"1","34":"1","35":"1","36":"1","37":"1","39":"1","40":"1","99":"2","100":"2","113":"1","61":"1","63":"4","65":"2"}

3.3、使用了行存储，查询时扫描的量较大，并且无法使用向量计算。

优化

1、使用列存代替行存(除nestloop的内表tbl3，继续使用索引FILTER)

create table tmp_tbl1 (like tbl1) WITH (APPENDONLY=true, ORIENTATION=column);  insert into tmp_tbl1 select * from tbl1;  create table tmp_tbl4 (like tbl4) WITH (APPENDONLY=true, ORIENTATION=column);  insert into tmp_tbl4 select * from tbl4;  create table tmp_tbl5 ( like tbl5) WITH (APPENDONLY=true, ORIENTATION=column);  insert into tmp_tbl5 select * from tbl5;  create table tmp_tbl2 (like tbl2) WITH (APPENDONLY=true, ORIENTATION=column) distributed by (col2);  insert into tmp_tbl2 select * from tbl2;

2、使用array代替text

alter table tmp_tbl2 alter column col3 type text[] using (case col3 when '[]' then '{}' else replace(col3,'"','') end)::text[];

修改后的类型、内容如下

digoal=> select col3 from tmp_tbl2  limit 2;                                                      col3                                                       ------------------------------------------------------------------------------------------------------------------------   {63:1,65:1,70:1,71:1,73:1,75:1,77:1,45:3,78:1,54:2,44:1,80:1,36:1,84:1,96:2}   {2:2,10:1,13:1,16:1,30:1,107:1,26:1,28:1,32:1,33:1,34:1,35:1,36:1,37:1,39:1,99:2,100:2,113:1,40:1,57:1,63:2,64:1,65:4}  (2 rows)

3、join 字段保持一致

alter table tmp_tbl2 alter column col2 type int8;

4、将原来的查询SQL修改成如下（字符串处理变成了数组）

（本例也可以使用二维数组，完全规避字符串处理。）

SELECT col4,count(DISTINCT c.col1) ptnum       from tmp_tbl1 a       INNER JOIN tmp_tbl2 b on b.col2=a.id       inner join tbl3 t2 on t2.ID <= array_length(col3,1)  -- 更改       INNER JOIN tmp_tbl4 c        on split_part(b.col3[cast(t2.id as int)], ':', 1) = c.id        INNER JOIN tmp_tbl5 s on a.col4=s.id       where split_part(b.col3[cast(t2.id as int)], ':', 2) > '0'        and c.col1 not in ('xxxxxx')       GROUP BY col4;

执行计划

                                                                                           QUERY PLAN                                                                                              -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   Gather Motion 32:1  (slice7; segments: 32)  (cost=543258065.87..543259314.50 rows=41621 width=12)     ->  GroupAggregate  (cost=543258065.87..543259314.50 rows=1301 width=12)           Group By: a.col4           ->  Sort  (cost=543258065.87..543258169.93 rows=1301 width=12)                 Sort Key: a.col4                 ->  Redistribute Motion 32:32  (slice6; segments: 32)  (cost=542355803.38..543254872.50 rows=1301 width=12)                       Hash Key: a.col4                       ->  GroupAggregate  (cost=542355803.38..543254040.08 rows=1301 width=12)                             Group By: a.col4                             ->  Sort  (cost=542355803.38..542655042.19 rows=3740486 width=11)                                   Sort Key: a.col4                                   ->  Redistribute Motion 32:32  (slice5; segments: 32)  (cost=6247.23..518770960.13 rows=3740486 width=11)                                         Hash Key: c.col1                                         ->  Hash Join  (cost=6247.23..516377049.63 rows=3740486 width=11)                                               Hash Cond: split_part(b.col3[t2.id::integer], ':'::text, 1) = c.id::text                                               ->  Nested Loop  (cost=5494.14..476568597.41 rows=3852199 width=491)                                                     Join Filter: split_part(b.col3[t2.id::integer], ':'::text, 2) > '0'::text                                                     ->  Broadcast Motion 32:32  (slice3; segments: 32)  (cost=5494.14..115247.73 rows=277289 width=483)                                                           ->  Hash Join  (cost=5494.14..23742.36 rows=8666 width=483)                                                                 Hash Cond: b.col2 = a.id                                                                 ->  Seq Scan on tmp_tbl2 b  (cost=0.00..14088.89 rows=8666 width=487)                                                                 ->  Hash  (cost=4973.86..4973.86 rows=1301 width=12)                                                                       ->  Redistribute Motion 32:32  (slice2; segments: 32)  (cost=2280.93..4973.86 rows=1301 width=12)                                                                             Hash Key: a.id                                                                             ->  Hash Join  (cost=2280.93..4141.42 rows=1301 width=12)                                                                                   Hash Cond: s.id = a.col4                                                                                   ->  Append-only Columnar Scan on tmp_tbl5 s  (cost=0.00..1220.97 rows=1491 width=4)                                                                                   ->  Hash  (cost=1760.66..1760.66 rows=1301 width=12)                                                                                         ->  Redistribute Motion 32:32  (slice1; segments: 32)  (cost=0.00..1760.66 rows=1301 width=12)                                                                                               Hash Key: a.col4                                                                                               ->  Append-only Columnar Scan on tmp_tbl1 a  (cost=0.00..928.22 rows=1301 width=12)                                                     ->  Index Scan using idx_codeid on tbl3 t2  (cost=0.00..23.69 rows=42 width=8)                                                           Index Cond: t2.id <= array_length(b.col3, 1)                                               ->  Hash  (cost=364.69..364.69 rows=972 width=11)                                                     ->  Broadcast Motion 32:32  (slice4; segments: 32)  (cost=0.00..364.69 rows=972 width=11)                                                           ->  Append-only Columnar Scan on tmp_tbl4 c  (cost=0.00..44.26 rows=31 width=11)                                                                 Filter: col1 <> 'xxxxxx'::text   Settings:  effective_cache_size=8GB; enable_nestloop=off; gp_statistics_use_fkeys=on   Optimizer status: legacy query optimizer  (39 rows)