AnalyticDB for PostgreSQL 6 新特性解析 - Index Only Scan

2019-10-14 1465

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： PG中所有的索引都是二级索引，即在进行索引查询的过程中，需要同时访问索引数据和源表数据。Index Only Scan按照字面意思理解，即在查询过程中只需要扫描索引数据。这种扫描方式需要一个前提就是索引中包含了查询所需要的所有数据（也叫覆盖索引），如出现在SELECT, WHERE, ORDER BY中所引用的列。

原理介绍

PG中所有的索引都是二级索引，即在进行索引查询的过程中，需要同时访问索引数据和源表数据。Index Only Scan按照字面意思理解，即在查询过程中只需要扫描索引数据。这种扫描方式需要一个前提就是索引中包含了查询所需要的所有数据（也叫覆盖索引），如出现在SELECT, WHERE, ORDER BY中所引用的列。

由于PG的MVCC机制，在没有Index only scan之前，任何索引查询都需要经过通过源表数据进行可见性检查，如图所示：

在索引扫描到过程中，需要通过源表获取每个Record的可见性信息。
在PG9.2版本以后，支持了Index Only Scan，如果一个查询所需要的数据能够完全可以被索引覆盖，那么Index Only Scan就会成为一种新的扫描路径，并且通过Visibility map避免了通过获取源表进行可见性检查，提升了查询性能，如果所示：

这里主要依赖了Visibility map的机制，Visibility map中有一个标记位，标记了Page中的元组是否都是可见的，也就意味着如果表没有被delete、update过或者已经被vacuum过了。
如果Visibility map能够确认该Index entry所对应的Page都是可见的，那么就不再获取源表Record进行可见性判断了，否则还需要获取源表元组并进行可见性判断。

使用示例

GP6版本集成了PG9.4版本，因此也支持了Index Only Scan的特性。
例如存在一张表，并在其中一个列上创建了索引：

postgres=# \d customer_reviews_hp
         Table "public.customer_reviews_hp"
        Column        |      Type       | Modifiers
----------------------+-----------------+-----------
 customer_id          | text            |
 review_date          | date            |
 review_rating        | integer         |
 review_votes         | integer         |
 review_helpful_votes | integer         |
 product_id           | character(10)   |
 product_title        | text            |
 product_sales_rank   | bigint          |
 product_group        | text            |
 product_category     | text            |
 product_subcategory  | text            |
 similar_product_ids  | character(10)[] |
Indexes:
    "c_review_rating" btree (review_rating)
Distributed by: (customer_id)

查询：

postgres=# explain analyze select count(*), review_rating from customer_reviews_hp where review_rating > 1 group by 2;
                                                                                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------
 Gather Motion 4:1  (slice2; segments: 4)  (cost=49979.36..49979.50 rows=5 width=12) (actual time=782.673..782.726 rows=4 loops=1)
   ->  GroupAggregate  (cost=49979.36..49979.50 rows=2 width=12) (actual time=782.384..782.385 rows=2 loops=1)
         Group Key: customer_reviews_hp.review_rating
         ->  Sort  (cost=49979.36..49979.37 rows=2 width=12) (actual time=782.376..782.377 rows=8 loops=1)
               Sort Key: customer_reviews_hp.review_rating
               Sort Method:  quicksort  Memory: 132kB
               ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.18..49979.30 rows=2 width=12) (actual time=76.538..782.345 rows=8 loops=1)
                     Hash Key: customer_reviews_hp.review_rating
                     ->  GroupAggregate  (cost=0.18..49979.20 rows=2 width=12) (actual time=5.102..73.709 rows=4 loops=1)
                           Group Key: customer_reviews_hp.review_rating
                           ->  Index Only Scan using c_review_rating on customer_reviews_hp  (cost=0.18..41742.09 rows=411854 width=4) (actual time=0.128..643.718 rows=1061311 lo
ops=1)
                                 Index Cond: (review_rating > 1)
                                 Heap Fetches: 0
 Planning time: 0.212 ms
   (slice0)    Executor memory: 220K bytes.
   (slice1)    Executor memory: 156K bytes avg x 4 workers, 156K bytes max (seg0).
   (slice2)    Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).  Work_mem: 33K bytes max.
 Memory used:  2047000kB
 Optimizer: Postgres query optimizer
 Execution time: 783.308 ms
(20 rows)

由此可见启用了Index Only Scan。

可以通过enable_indexonlyscan来控制是否使用Index Only Scan，例如同样上面的查询设置enable_indexonlyscan为off后，再次执行：

postgres=# explain analyze select count(*), review_rating from customer_reviews_hp where review_rating > 1 group by 2;
                                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
 Gather Motion 4:1  (slice2; segments: 4)  (cost=49979.36..49979.50 rows=5 width=12) (actual time=951.830..951.840 rows=4 loops=1)
   ->  GroupAggregate  (cost=49979.36..49979.50 rows=2 width=12) (actual time=951.566..951.567 rows=2 loops=1)
         Group Key: customer_reviews_hp.review_rating
         ->  Sort  (cost=49979.36..49979.37 rows=2 width=12) (actual time=951.556..951.556 rows=8 loops=1)
               Sort Key: customer_reviews_hp.review_rating
               Sort Method:  quicksort  Memory: 132kB
               ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.18..49979.30 rows=2 width=12) (actual time=75.010..951.527 rows=8 loops=1)
                     Hash Key: customer_reviews_hp.review_rating
                     ->  GroupAggregate  (cost=0.18..49979.20 rows=2 width=12) (actual time=5.211..77.359 rows=4 loops=1)
                           Group Key: customer_reviews_hp.review_rating
                           ->  Index Scan using c_review_rating on customer_reviews_hp  (cost=0.18..41742.09 rows=411854 width=4) (actual time=0.118..817.460 rows=1061311 loops=1
)
                                 Index Cond: (review_rating > 1)
 Planning time: 0.217 ms
   (slice0)    Executor memory: 156K bytes.
   (slice1)    Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).
   (slice2)    Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).  Work_mem: 33K bytes max.
 Memory used:  2047000kB
 Optimizer: Postgres query optimizer
 Execution time: 952.473 ms
(19 rows)

只是用到了索引，不是Index Only Scan，执行时间上增加了将近200ms，下降了20%左右。
但同时需要注意的是，Index Only Scan并不是银弹，做到Index Only Scan往往需要创建联合索引，联合索引本身也会有性能问题，例如影响写入、更新性能等。需要具体问题具体分析，Index Only Scan只是多了一种可优化路径选择。

GP的限制

Orca优化器不支持Index Only Scan，GP6版本中，只有PG原生的优化器支持Index Only Scan。
列存表也不支持Index Only Scan，Index Only Scan依赖Visibility map机制实现，列存表显然做不到Index Only Scan。
GP上的Index Only Scan在explain analyze时，Heap Fetches显示不准确，例如：

create table test (a , b ,c);
create table test (a int, b int ,c int);
insert into test values(generate_series(1,100000),generate_series(1,100000),generate_series(1,100000));
create index a_ind on test(a,b,c);

-- Master上执行：
postgres=# explain analyze select * from test where a > 1 order by a;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.17..2463.87 rows=99990 width=12) (actual time=1.169..84.196 rows=99999 loops=1)
   Merge Key: a
   ->  Index Only Scan using a_ind on test  (cost=0.17..2463.87 rows=99990 width=12) (actual time=0.116..44.373 rows=99999 loops=1)
         Index Cond: (a > 1)
         Heap Fetches: 0
 Planning time: 0.685 ms
   (slice0)    Executor memory: 216K bytes.
   (slice1)    Executor memory: 148K bytes (seg0).
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution time: 96.809 ms
(11 rows)

显示Heap Fetchs为0，而直接连上segment进行explain analyze：

postgres=# explain analyze select * from test where a > 1 order by a;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using a_ind on test  (cost=0.29..1255.62 rows=33334 width=12) (actual time=0.072..39.561 rows=99999 loops=1)
   Index Cond: (a > 1)
   Heap Fetches: 99999
 Planning time: 0.148 ms
   (slice0)
 Optimizer: Postgres query optimizer
 Execution time: 47.481 ms
(7 rows)

其实是存在Heap Fetches的，从执行时间上看，Master上的Heap Fetches项显示不对。
这种情况需要依赖Vacuum来做Visibility Map的清理工作了。正常情况下做下Vacuum analyze就能保证不不需要Heap Fetch。

参考

https://www.postgresql.org/docs/current/indexes-index-only-scans.html

AnalyticDB for PostgreSQL 6 新特性解析 - Index Only Scan

原理介绍

使用示例

GP的限制

参考

云原生数据仓库 AnalyticDB PostgreSQL版

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像