PostgreSQL SQL自动优化案例 - 极简，自动推荐索引-阿里云开发者社区

PostgreSQL SQL自动优化案例 - 极简，自动推荐索引

2018-01-27 3762

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生数据库 PolarDB MySQL 版，Serverless 5000PCU 100GB

云原生数据库 PolarDB 分布式版，标准版 2核8GB

云数据库 RDS MySQL Serverless，0.5-2RCU 50GB

简介：

背景

SQL自动优化是数据库的高级特性，其中包括SQL 自动rewrite，自动推荐索引，自动优化执行计划等。

1、自动优化执行计划

目前postgrespro在做这块

https://postgrespro.com/roadmap/56513

Machine learning

Query planner selects “cheapest” query plan based on its cost estimation. But it’s done with many rough assumptions. This is why the estimated cost could be inadequate to real execution cost. One possibility is to improve the cost estimate mechanism itself by adding features like multivariate statistics. Another possibility is to use query execution feedback: see how estimated parameter values differ from actual parameter values. We can apply machine learning techniques to improve the cost estimates using this feedback, so DBMS would learn on its own mistakes.

We’ve already done this in a simple case, and further work is planned in the following directions:

Extend implemented model to cover more use cases,
Provide the infrastructure necessary to make our machine learning an extension.

Execution-time planning

Currently, query planning strictly precedes query execution. Sometimes it appears to be a serious limitation. When one part of a plan is already executed it could be possible to significantly improve the rest of the plan on the basis of gathered statistics. We can see two cases when this approach could be applied:

Online reordering of filter expressions. During sequential scan of large table it’s important to do the cheapest and the most selective checks first. However estimated selectivity and cost of filtering are inaccurate, and thus the order of applying filters based on estimates can be not optimal. But filter expressions could be reordered online on the base of statistics of their previous execution.
Some queries could be divided into sequence of steps when subsequent steps could be replanned on the base of results of previous steps. For instance, suppose that step 1 is a scan of table A, and step 2 is a join of tables A and B. Depending on row count and data distribution from the first step we could choose different join algorithm on the second step.

2、自动rewrite SQL，实际上PostgreSQL已经做了很多这样的工作，不过还有更多可以做的。这里有一些例子。

《PostgreSQL 优化器逻辑推理能力源码解析》

3、自动推荐索引，就是本文要说的。

EDB已支持：https://www.enterprisedb.com/docs/en/9.5/asguide/EDB_Postgres_Advanced_Server_Guide.1.56.html

自动推荐索引的方法

1、首先要判断哪些表可能需要加索引，方法如下：

1.1、TOP SQL与慢SQL，可以从pg_stat_statements中获取。

慢 SQL还可以从日志中获取。

参考：

《PostgreSQL 如何查找TOP SQL (例如IO消耗最高的SQL)》

1.2、统计信息

从统计信息中，同样可以分析出哪些表可能需要加索引。

《PostgreSQL pg_stat_ pg_statio_ 统计信息(scan,read,fetch,hit)源码解读》

分析方法如下，

A=pg_stat_all_tables.seq_scan，代表全表扫描次数   
   
B=pg_stat_all_tables.seq_tup_read，代表返回记录数   
   
C=pg_class.reltuples，代表表的记录数   
   
D=pg_class.relpages，代表表的大小（占用了多少数据块）   
   
E=pg_statio_all_tables.heap_blks_read + pg_statio_all_tables.heap_blks_hit，代表扫描了多少个堆表的数据块

如果A很大，并且B/A很小，并且C较大或D较大，那么说明需要加索引。

如果E/A较大，并且D或C较大，那么说明需要加索引。

判断标准就是:

大表被全表扫描了很多数据块，并返回了少量记录。

1.3、针对需要加索引的表，在pg_stat_statements中，找到对应的SQL。

2、判断需要针对哪些列加索引

针对找到的需要加索引的SQL，生成执行计划。

从执行计划中，判断哪些位置的COST较大，并且是SEQUENCE SCAN，找到对应的filter字段。

3、判断需要加什么类型的索引(btree, brin, gin, gist, sp-gist, hash, bloom, ...?)

对2得到的字段，判断应该加什么类型的索引？原理如下，根据统计信息，以及索引接口的特性，自动判断加什么类型的索引：

《自动选择正确索引访问接口(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法》

《PostgreSQL 9种索引的原理和应用场景》

4、记录慢SQL执行计划

记录第2步骤得到的执行计划。

5、加虚拟索引

虚拟索引，并不是针对索引，不会影响实际的执行计划，也不耗资源，方法如下：

《PostgreSQL 虚拟|虚假索引(hypothetical index) - HypoPG》

6、查看加完索引后慢SQL执行计划是否发生变化。

使用虚拟索引生成执行计划，并对比4得到的执行计划，是否用上了索引。

方法如下：

《PostgreSQL 虚拟|虚假索引(hypothetical index) - HypoPG》

7、输出报告

输出两个执行计划的前后对比，输出报告。

8、删除虚拟索引

9、加真实索引

用户可以查看报告，并决定是否加索引。

加索引时，可以使用并行语法，不堵塞DML操作。

Command:     CREATE INDEX   
Description: define a new index   
Syntax:   
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]   
    ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )   
    [ WITH ( storage_parameter = value [, ... ] ) ]   
    [ TABLESPACE tablespace_name ]   
    [ WHERE predicate ]