Full-text Search Index Optimization - Alibaba Cloud RDS PostgreSQL Best Practices

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
云原生数据库 PolarDB 分布式版,标准版 2核8GB
简介: Will indexes be used in full-text searches that do not contain a certain keyword? Learn more about GIN, Generalized Inverted Index.

Alibaba_Cloud_Whitepaper_Securing_the_Data_Center_in_a_Cloud_First_World

Background

With built-in GIN indexes, PostgreSQL supports full-text search and searching multi-value data types including arrays.

Will indexes be used in full-text searches that do not contain a certain keyword?

GIN stands for Generalized Inverted Index. A query that does not contain a keyword is actually a scanning process that skips the token above the main tree.

It will be cost-effective if the skipped token contains a relatively large amount of data. PostgreSQL uses a CBO (cost-based optimizer) execution plan optimizer to automatically choose the best index.

Example 1. Full-text search - "Does Not Contain" query

1.Create a test table

postgres=# create table notcontain (id int, info tsvector);  
CREATE TABLE  

2.Create a function to generate random strings

CREATE OR REPLACE FUNCTION   
gen_rand_str(integer)    
 RETURNS text    
 LANGUAGE sql    
 STRICT    
AS $function$    
  select string_agg(a[(random()*6)::int+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a);    
$function$;   

3.Insert 1 million entries of test data

postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector(gen_rand_str(256));

4.Create full-text index (a GIN index)

create index idx_notcontain_info on notcontain using gin (info);

5.Query one of the records

postgres=# select * from notcontain limit 1;  
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
id   | 1  
info | 'afbbeeccbf':3 'b':16 'bdcdfd':2 'bdcfbcecdeeaed':8 'bfedfecbfab':7 'cd':9 'cdcaefaccdccadeafadededddcbdecdaefbcfbdaefcec':14 'ceafecff':6 'd':17,18 'dbc':12 'dceabcdcbdca':10 'dddfdbffffeaca':13 'deafcccfbcdebdaecda':11 'dfbadcdebdedbfa':19 'eb':15 'ebe':1 'febdcbdaeaeabbdadacabdbbedfafcaeabbdcedaeca':5 'fedeecbcdfcdceabbabbfcdd':4  

6.Test - Does Not Contain a keyword

The database automatically chooses Full Table Scan, and does not use the GIN index.

Why wasn't an index used? As I have explained before, the number of data records containing the keyword is quite small. It is not cost-effective to use indexes for filtering in a query that does not contain a keyword.

(However, if the query contains a keyword, it will be very cost-effective to use the GIN indexes. "Contains" and "Does Not Contain" are inversions of each other, and so are the associated costs.)

select * from notcontain t1 where info @@ to_tsquery ('!eb');  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb');  
                                                             QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..318054.51 rows=950820 width=412) (actual time=0.016..1087.463 rows=947911 loops=1)  
   Output: id, info  
   Filter: (t1.info @@ to_tsquery('!eb'::text))  
   Rows Removed by Filter: 52089  
   Buffers: shared hit=55549  
 Planning time: 0.131 ms  
 Execution time: 1134.571 ms  
(7 rows)  

7.Force disable Full Table Scan, and allow the database to choose the indexes.

We can see that, searches that use index are really slow. Most of the time we can trust the database to find the most efficient method to complete our queries. (All of these can be calibrated as long as cost factors and environmental performance measuring are accurate enough.)

postgres=# set enable_seqscan=off;  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb');  
                                                                       QUERY PLAN                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on postgres.notcontain t1  (cost=82294981.00..82600120.25 rows=950820 width=412) (actual time=1325.587..1540.145 rows=947911 loops=1)  
   Output: id, info  
   Recheck Cond: (t1.info @@ to_tsquery('!eb'::text))  
   Heap Blocks: exact=55549  
   Buffers: shared hit=171948  
   ->  Bitmap Index Scan on idx_notcontain_info  (cost=0.00..82294743.30 rows=950820 width=0) (actual time=1315.663..1315.663 rows=947911 loops=1)  
         Index Cond: (t1.info @@ to_tsquery('!eb'::text))  
         Buffers: shared hit=116399  
 Planning time: 0.135 ms  
 Execution time: 1584.670 ms  
(10 rows)  

Example 2. Full-text search - "Does Not Contain" query

In this example, I am going to create unevenly distributed data. This token contains a large number of repeated content, and I will filter them out with a "Does Not Contain" operator. Let's see if the index will be used.

1.Generate test data

postgres=# truncate notcontain ;  
TRUNCATE TABLE  
postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector('abc');  
INSERT 0 1000000  

2.Test search entries that do not contain ABC

The database automatically chooses indexed scanning and skips data blocks that do not need to be searched.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc');  
                                                              QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on postgres.notcontain t1  (cost=220432.15..220433.71 rows=1 width=21) (actual time=107.936..107.936 rows=0 loops=1)  
   Output: id, info  
   Recheck Cond: (t1.info @@ to_tsquery('!abc'::text))  
   Buffers: shared hit=268      
   ->  Bitmap Index Scan on idx_notcontain_info  (cost=0.00..220432.15 rows=1 width=0) (actual time=107.933..107.933 rows=0 loops=1)  
         Index Cond: (t1.info @@ to_tsquery('!abc'::text))  
         Buffers: shared hit=268  
 Planning time: 0.183 ms  
 Execution time: 107.962 ms  
(9 rows)  

3.Force enable Full Table Scan. We can see that the performance is truly no match for indexed scanning, which proves that PostgreSQL is a cost-based optimizer and can automatically choose the most optimal execution plan.

postgres=# set enable_bitmapscan =off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc');  
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..268870.00 rows=1 width=21) (actual time=1065.436..1065.436 rows=0 loops=1)  
   Output: id, info  
   Filter: (t1.info @@ to_tsquery('!abc'::text))  
   Rows Removed by Filter: 1000000  
   Buffers: shared hit=6370  
 Planning time: 0.059 ms  
 Execution time: 1065.449 ms  
(7 rows)  

Example 3. Ordinary B-tree indexes - "Not Equal To" searches

This example is an ordinary search that uses a B-tree index. Let's see if PostgreSQL supports the "Not Equal To" indexed search.

The test method is similar to that of GIN test. In this test, I am going to use both unevenly and evenly distributed data.

1.For a query that uses the "Does Not Contain" operator on unevenly distributed data, the number of records filtered out with indexes is very small.

So far, "Does Not Contain" searches that uses the B-tree indexes are not supported at the kernel level. (Although this can be achieved by skipping Branch nodes that do not need to be scanned with INDEX SKIP SCAN.)

postgres=# truncate notcontain ;  
TRUNCATE TABLE  
postgres=# insert into notcontain select generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# create index idx1 on notcontain (id);  
CREATE INDEX  
postgres=# set enable_bitmapscan =on;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1;  
                                                           QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..16925.00 rows=999999 width=36) (actual time=0.011..110.592 rows=999999 loops=1)  
   Output: id, info  
   Filter: (t1.id <> 1)  
   Rows Removed by Filter: 1  
   Buffers: shared hit=4425  
 Planning time: 0.195 ms  
 Execution time: 156.013 ms  
(7 rows)  
  
  
postgres=# set enable_seqscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1;  
                                                                   QUERY PLAN                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=10000000000.00..10000016925.00 rows=999999 width=36) (actual time=0.011..110.964 rows=999999 loops=1)  
   Output: id, info  
   Filter: (t1.id <> 1)  
   Rows Removed by Filter: 1  
   Buffers: shared hit=4425  
 Planning time: 0.062 ms  
 Execution time: 156.461 ms  
(7 rows)  

2.Change the SQL write method to achieve indexed search. As INDEX SKIP SCAN is not used in this case, we need a JOIN process. The resulting performance is not very good.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Merge Anti Join  (cost=0.85..25497.28 rows=999999 width=36) (actual time=0.023..277.639 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Merge Cond: (t1.id = t2.id)  
   Buffers: shared hit=7164  
   ->  Index Scan using idx1 on postgres.notcontain t1  (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.009..148.520 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=7160  
   ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1)  
         Output: t2.id  
         Index Cond: (t2.id = 1)  
         Heap Fetches: 1  
         Buffers: shared hit=4  
 Planning time: 0.223 ms  
 Execution time: 322.798 ms  
(14 rows)  
postgres=# set enable_mergejoin=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Hash Anti Join  (cost=3.05..27053.05 rows=999999 width=36) (actual time=0.060..251.232 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Hash Cond: (t1.id = t2.id)  
   Buffers: shared hit=4432  
   ->  Seq Scan on postgres.notcontain t1  (cost=0.00..14425.00 rows=1000000 width=36) (actual time=0.011..84.659 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=4425  
   ->  Hash  (cost=3.04..3.04 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)  
         Output: t2.id  
         Buckets: 1024  Batches: 1  Memory Usage: 9kB  
         Buffers: shared hit=4  
         ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=1)  
               Output: t2.id  
               Index Cond: (t2.id = 1)  
               Heap Fetches: 1  
               Buffers: shared hit=4  
 Planning time: 0.147 ms  
 Execution time: 297.127 ms  
(18 rows)  
  
postgres=# set enable_seqscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Hash Anti Join  (cost=3.48..35622.27 rows=999999 width=36) (actual time=0.036..324.401 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Hash Cond: (t1.id = t2.id)  
   Buffers: shared hit=7164  
   ->  Index Scan using idx1 on postgres.notcontain t1  (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.017..149.383 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=7160  
   ->  Hash  (cost=3.04..3.04 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=1)  
         Output: t2.id  
         Buckets: 1024  Batches: 1  Memory Usage: 9kB  
         Buffers: shared hit=4  
         ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)  
               Output: t2.id  
               Index Cond: (t2.id = 1)  
               Heap Fetches: 1  
               Buffers: shared hit=4  
 Planning time: 0.141 ms  
 Execution time: 369.749 ms  
(18 rows)  

3.PostgreSQL uses multi-core parallel computing to allow incredible performance improvements to Full Table Scan.

Parallel scanning can remarkably improve the performance in case with a significantly large number of records.

postgres=# create  unlogged table ptbl(id int);  
CREATE TABLE  
postgres=# insert into ptbl select generate_series(1,100000000);  
  
postgres=# alter table ptbl set (parallel_workers =32);  
  
\timing  
  
Nonparallel query:  
postgres=# set max_parallel_workers_per_gather =0;  
postgres=# select count(*) from ptbl where id<>1;  
  count     
----------  
 99999999  
(1 row)  
  
Time: 11863.151 ms (00:11.863)  
  
并行查询:  
postgres=# set max_parallel_workers_per_gather =32;  
postgres=# select count(*) from ptbl where id<>1;  
  count     
----------  
 99999999  
(1 row)  
  
Time: 610.017 ms  

Performance see marked improvement when we use a parallel query.

Example 4. Ordinary partial B-tree indexes - "Not Equal To" searches

For fixed "Not Equal To" queries, we can use the partial index function of PostgreSQL.

create table tbl (id int, info text, crt_time timestamp, c1 int);

select * from tbl where c1<>1;

insert into tbl select generate_series(1,10000000), 'test', now(), 1;
insert into tbl values (1,'abc',now(),2);

create index idx_tbl_1 on tbl(id) where c1<>1;

Cool! It takes only 0.03 ms to run a "Not Equal To" search over 10 million entries of data using partial index.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1<>1;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_tbl_1 on postgres.tbl  (cost=0.12..1.44 rows=1 width=21) (actual time=0.015..0.015 rows=1 loops=1)
   Output: id, info, crt_time, c1
   Buffers: shared hit=1 read=1
 Planning time: 0.194 ms
 Execution time: 0.030 ms
(5 rows)

Summary

With built-in GIN indexes, PostgreSQL supports full-text search and searching multi-value data types including arrays.

PostgreSQL uses a cost-based execution plan optimizer. It automatically chooses the best execution plan. When running a "Does Not Contain" search, PostgreSQL automatically chooses whether to use indexed scanning.

With regards to B-tree indexes, technically it can be used to implement the "Not Equal To" searches (INDEX SKIP SCAN), but it is currently not supported at the kernel level. We can use indexed scanning by adjusting the current SQL write methods.

PostgreSQL uses multi-core parallel computing to allow incredible performance improvements to Full Table Scan. Parallel scanning can remarkably improve the performance in case with a significantly large number of records.

PostgreSQL supports partial index, which supports partitioned or partial indexes. Performance is outstanding for "Not Equal To" queries with fixed conditions.

相关实践学习
跟我学:如何一键安装部署 PolarDB-X
《PolarDB-X 动手实践》系列第一期,体验如何一键安装部署 PolarDB-X。
目录
相关文章
|
7月前
|
SQL 关系型数据库 MySQL
将MySQL 数据迁移到 PostgreSQL
将MySQL 数据迁移到 PostgreSQL 可以采用以下步骤: 安装 PostgreSQL 数据库:首先,需要安装 PostgreSQL 数据库。可以从官方网站(https://www.postgresql.org/)下载最新版本的 PostgreSQL,并根据官方指南进行安装。 创建 PostgreSQL 数据库:在 PostgreSQL 中创建与 MySQL 数据库相对应的数据库。可以使用 pgAdmin 或命令行工具(如 psql)来创建数据库。例如,如果在 MySQL 中有一个名为 "mydb" 的数据库,那么可以在 PostgreSQL 中创建一个具有相同名称的数据库。 导
925 0
|
13天前
|
关系型数据库 MySQL 测试技术
【专栏】PostgreSQL数据库向MySQL迁移的过程、挑战及策略
【4月更文挑战第29天】本文探讨了PostgreSQL数据库向MySQL迁移的过程、挑战及策略。迁移步骤包括评估规划、数据导出与转换、创建MySQL数据库、数据导入。挑战包括数据类型不匹配、函数和语法差异、数据完整性和性能问题。应对策略涉及数据类型映射、代码调整、数据校验和性能优化。迁移后需进行数据验证、性能测试和业务验证,确保顺利过渡。在数字化时代,掌握数据库迁移技能对技术人员至关重要。
|
15天前
|
存储 关系型数据库 MySQL
【专栏】在众多开源数据库中,MySQL和PostgreSQL无疑是最受欢迎的两个
【4月更文挑战第27天】MySQL与PostgreSQL是两大主流开源数据库,各有特色。MySQL注重简单、便捷和高效,适合读操作密集场景,而PostgreSQL强调灵活、强大和兼容,擅长并发写入与复杂查询。MySQL支持多种存储引擎和查询缓存,PostgreSQL则具备扩展性、强事务支持和高可用特性。选择时应考虑项目需求、团队技能和预期收益。
|
24天前
|
SQL Oracle 关系型数据库
常用数据库的分页语句(mySQL、oracle、PostgreSQL、SQL Server)
常用数据库的分页语句(mySQL、oracle、PostgreSQL、SQL Server)
|
5月前
|
SQL 关系型数据库 MySQL
postgresql|数据库|MySQL数据库向postgresql数据库迁移的工具pgloader的部署和初步使用
postgresql|数据库|MySQL数据库向postgresql数据库迁移的工具pgloader的部署和初步使用
127 0
|
2月前
|
存储 关系型数据库 MySQL
TiDB与MySQL、PostgreSQL等数据库的比较分析
【2月更文挑战第25天】本文将对TiDB、MySQL和PostgreSQL等数据库进行详细的比较分析,探讨它们各自的优势和劣势。TiDB作为一款分布式关系型数据库,在扩展性、并发性能等方面表现突出;MySQL以其易用性和成熟性受到广泛应用;PostgreSQL则在数据完整性、扩展性等方面具有优势。通过对比这些数据库的特点和适用场景,帮助企业更好地选择适合自己业务需求的数据库系统。
|
7月前
|
存储 NoSQL 关系型数据库
深入探索地理空间查询:如何优雅地在MySQL、PostgreSQL及Redis中实现精准的地理数据存储与检索技巧
深入探索地理空间查询:如何优雅地在MySQL、PostgreSQL及Redis中实现精准的地理数据存储与检索技巧
782 0
|
4月前
|
canal 缓存 SpringCloudAlibaba
Springcloud Alibaba 使用Canal将MySql数据实时同步到Elasticsearch
本篇文章在Springcloud Alibaba使用Canal将Mysql数据实时同步到Redis保证缓存的一致性-CSDN博客 基础上使用canal将mysql数据实时同步到Elasticsearch。
|
4月前
|
canal 缓存 关系型数据库
Springcloud Alibaba使用Canal将Mysql数据实时同步到Redis保证缓存的一致性
canal [kə'næl] ,译意为水道/管道/沟渠,主要用途是基于 MySQL 数据库增量日志解析,提供增量数据订阅和消费。其诞生的背景是早期阿里巴巴因为杭州和美国双机房部署,存在跨机房同步的业务需求,实现方式主要是基于业务 trigger 获取增量变更。从 2010 年开始,业务逐步尝试数据库日志解析获取增量变更进行同步,由此衍生出了大量的数据库增量订阅和消费业务。
100 0
|
4月前
|
关系型数据库 MySQL 数据处理
MySQL vs. PostgreSQL:选择适合你的开源数据库
在当今信息时代,开源数据库成为许多企业和开发者的首选。本文将比较两个主流的开源数据库——MySQL和PostgreSQL,分析它们的特点、优势和适用场景,以帮助读者做出明智的选择。