备案控制台

开发者社区

开发者社区数据库关系型数据库文章正文

PostgreSQL 聚合函数讲解 - 6 分组排序聚合

2015-12-14 9279

版权

版权声明：

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生数据库 PolarDB MySQL 版，Serverless 5000PCU 100GB

云原生数据库 PolarDB 分布式版，标准版 2核8GB

云数据库 RDS MySQL Serverless，0.5-2RCU 50GB

简介：

分组排序聚合的例子.

Table 9-51. Ordered-Set Aggregate Functions

Function	Direct Argument Type(s)	Aggregated Argument Type(s)	Return Type	Description
`mode() WITHIN GROUP (ORDER BYsort_expression)`		any sortable type	same as sort expression	returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results)
`percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)`	double precision	double precisionor interval	same as sort expression	continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed
`percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression)`	double precision[]	double precisionor interval	array of sort expression's type	multiple continuous percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the value corresponding to that percentile
`percentile_disc(fraction) WITHIN GROUP (ORDER BY sort_expression)`	double precision	any sortable type	same as sort expression	discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction
`percentile_disc(fractions) WITHIN GROUP (ORDER BY sort_expression)`	double precision[]	any sortable type	array of sort expression's type	multiple discrete percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the input value corresponding to that percentile

All the aggregates listed in Table 9-51 ignore null values in their sorted input. For those that take a fraction parameter, the fraction value must be between 0 and 1; an error is thrown if not. However, a null fraction value simply produces a null result.

mode比较好理解, 就是取分组中出现频率最高的值或表达式, 如果最高频率的值有多个, 则随机取一个.

postgres=# create table test(id int, info text);

CREATE TABLE

postgres=# insert into test values (1,'test1');

INSERT 0 1

postgres=# insert into test values (1,'test1');

INSERT 0 1

postgres=# insert into test values (1,'test2');

INSERT 0 1

postgres=# insert into test values (1,'test3');

INSERT 0 1

postgres=# insert into test values (2,'test1');

INSERT 0 1

postgres=# insert into test values (2,'test1');

INSERT 0 1

postgres=# insert into test values (2,'test1');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# select * from test;

id | info

----+-------

1 | test1

1 | test1

1 | test2

1 | test3

2 | test1

2 | test1

2 | test1

3 | test4

3 | test4

3 | test4

3 | test4

3 | test4

(12 rows)

取出所有数据中, 出现频率最高的info, 有可能是test1也有可能是test4, 因为他们的出现频率一致.

mode的返回结果数据类型和order by后面的表达式一致.

postgres=# select mode() within group (order by info) from test;

mode

-------

test1

(1 row)

如果按INFO来分组的话, 取出出现频率最高的info, 实际上这个操作是没有任何意义的, 返回值就是所有记录的info的唯一值.

postgres=# select mode() within group (order by info) from test group by info;

mode

-------

test1

test2

test3

test4

(4 rows)

按id来分组, 取出组内出现频率最高的info值, 这个是有意义的.

postgres=# select mode() within group (order by info) from test group by id;

mode

-------

test1

test1

test4

(3 rows)

id=1 , 出现频率最高的info是test1. 出现2次.

如下 :

postgres=# select id,info,count(*) from test group by id,info;

id | info | count

----+-------+-------

1 | test1 | 2

1 | test3 | 1

3 | test4 | 5

1 | test2 | 1

2 | test1 | 3

(5 rows)

如果要返回mode()并返回频率次数. 可以使用row_number()窗口来实现. 如下.

postgres=# select id,info,cnt from (select id,info,cnt,row_number() over(partition by id order by cnt desc) as rn from (select id,info,count(*) cnt from test group by id,info) t) t where t.rn=1;

id | info | cnt

----+-------+-----

1 | test1 | 2

2 | test1 | 3

3 | test4 | 5

(3 rows)

其他, mode的返回结果数据类型和order by后面的表达式一致.

postgres=# select mode() within group (order by id) from test;

mode

------

3

(1 row)

postgres=# select mode() within group (order by id+1) from test;

mode

------

4

(1 row)

另外还有4个函数是和数据分布有关的, 需要指定从0到1的分布位置. 返回排序后, 在指定分布位置的值或表达式的值.

src/backend/utils/adt/orderedsetaggs.c

if (percentile < 0 || percentile > 1 || isnan(percentile))

ereport(ERROR,

(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),

errmsg("percentile value %g is not between 0 and 1",

percentile)));

同时还需要注意区分连续分布和离散分布.

postgres=# create table test(id int, info text);

CREATE TABLE

postgres=# insert into test values (1,'test1');

INSERT 0 1

postgres=# insert into test values (2,'test2');

INSERT 0 1

postgres=# insert into test values (3,'test2');

INSERT 0 1

postgres=# insert into test values (4,'test2');

INSERT 0 1

postgres=# insert into test values (5,'test2');

INSERT 0 1

postgres=# insert into test values (6,'test2');

INSERT 0 1

postgres=# insert into test values (7,'test2');

INSERT 0 1

postgres=# insert into test values (8,'test3');

INSERT 0 1

postgres=# insert into test values (100,'test3');

INSERT 0 1

postgres=# insert into test values (1000,'test4');

INSERT 0 1

postgres=# select * from test;

id | info

------+-------

1 | test1

2 | test2

3 | test2

4 | test2

5 | test2

6 | test2

7 | test2

8 | test3

100 | test3

1000 | test4

(10 rows)

取连续分布的中位数可以用percentile_cont(0.5)来获得.

postgres=# select percentile_cont(0.5) within group (order by id) from test;

percentile_cont

-----------------

5.5

(1 row)

这个5.5是怎么计算来的呢? 参考本文末尾 :

If (CRN = FRN = RN) then the result is

(value of expression from row at RN)

Otherwise the result is

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

解释 :

N = 当前分组的行数 = 10

RN = (1+传入参数*(N-1)) = (1+0.5*(10-1)) = 5.5

CRN = ceiling(RN) = 6

FRN = floor(RN) = 5

value of expression for row at FRN : 当前分组内第FRN行的值 = 5

value of expression for row at CRN : 当前分组内第CRN行的值 = 6

所以最终中位数值 :

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN) =

(6-5.5)*(5) + (5.5 - 5)*(6) = 5.5;

使用info分组 :

postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;

percentile_cont | info

-----------------+-------

1 | test1

4.5 | test2

54 | test3

1000 | test4

(4 rows)

验证这个值4.5 | test2 :

2 | test2

3 | test2

4 | test2

5 | test2

6 | test2

7 | test2

N = 当前分组的行数 = 6

RN = (1+传入参数*(N-1)) = (1+0.5*(6-1)) = 3.5

CRN = ceiling(RN) = 4

FRN = floor(RN) = 3

value of expression for row at FRN : 当前分组内第FRN行的值 = 4

value of expression for row at CRN : 当前分组内第CRN行的值 = 5

所以最终中位数值 :

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN) =

(4-3.5)*(4) + (3.5 - 3)*(5) = 4.5;

当输入参数为数组时, 返回值也是数组, 如下 :

postgres=# select percentile_cont(array[0.5, 1]) within group (order by id) from test;

percentile_cont

-----------------

{5.5,1000}

(1 row)

接下来看一下稀疏分布 :

返回行号大于等于指定百分比的值或表达式值.

例如 :

postgres=# select id from test;

id

------

1

2

3

4

5

6

7

8

100

1000

(10 rows)

当前组一共10行, 取位置在0.5的.即行号>=0.5*10的第一行的值或表达式的值.

postgres=# select percentile_disc(0.5) within group (order by id) from test;

percentile_disc

-----------------

5

(1 row)

postgres=# select percentile_disc(0.5) within group (order by id^2) from test;

percentile_disc

-----------------

25

(1 row)

输入0.11, 表示行号返回>=1.1的第一行的值.

postgres=# select percentile_disc(0.11) within group (order by id) from test;

percentile_disc

-----------------

2

(1 row)

再看个例子

postgres=# select id,info,count(*) over (partition by info) from test;

id | info | count

------+-------+-------

1 | test1 | 1

2 | test2 | 6

3 | test2 | 6

4 | test2 | 6

5 | test2 | 6

6 | test2 | 6

7 | test2 | 6

8 | test3 | 2

100 | test3 | 2

1000 | test4 | 1

(10 rows)

取分组的数据, 主要看test2 这个组一共有6行, 0.3*6=1.8, 所以它需要取第二行的数据.

postgres=# select info,percentile_disc(0.3) within group (order by id) from test group by info;

info | percentile_disc

-------+-----------------

test1 | 1

test2 | 3

test3 | 8

test4 | 1000

(4 rows)

[注意]

最终计算的是表达式的分布数, 而不是计算列值的分布数后再计算表达式.

验证如下 :

或者你可以看代码 :

postgres=# select percentile_cont(0.5) within group (order by id^2),info from test group by info;

percentile_cont | info

-----------------+-------

1 | test1

20.5 | test2

5032 | test3

1000000 | test4

(4 rows)

postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;

percentile_cont | info

-----------------+-------

1 | test1

4.5 | test2

54 | test3

1000 | test4

(4 rows)

postgres=# select 4.5^2;

?column?

---------------------

20.2500000000000000

(1 row)

postgres=# select 54^2;

?column?

----------

2916

(1 row)

[参考]

1. http://www.postgresql.org/docs/devel/static/functions-aggregate.html

2. http://blog.163.com/digoal@126/blog/static/16387704020152223539859/

3. http://blog.163.com/digoal@126/blog/static/1638770402015224124337/

4. http://blog.163.com/digoal@126/blog/static/16387704020137124851944

5. src/backend/utils/adt/orderedsetaggs.c

6. 算法 :

PERCENTILE_CONT函数解释 :

The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, you can compute the row number you are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+(P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

The final result will be:

If (CRN = FRN = RN) then the result is

(value of expression from row at RN)

Otherwise the result is

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

PERCENTILE_DISC函数解释 :

The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value. This expression must be constant within each aggregate group. The ORDER BY clause takes a single expression that can be of any type that can be sorted.

For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.

MEDIAN(中位数)详解, Oracle有单独的计算中位数的函数, 实际上就是PERCENTILE_CONT(0.5) :

MEDIAN is an inverse distribution function that assumes a continuous distribution model. It takes a numeric or datetime value and returns the middle value or an interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

This function takes as arguments any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. If you specify only expr, then the function returns the same data type as the numeric data type of the argument. If you specify the OVER clause, then Oracle Database determines the argument with the highest numeric precedence, implicitly converts the remaining arguments to that data type, and returns that data type.

The result of MEDIAN is computed by first ordering the rows. Using N as the number of rows in the group, Oracle calculates the row number (RN) of interest with the formula RN = (1 + (0.5*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

The final result will be:

if (CRN = FRN = RN) then

(value of expression from row at RN)

else

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

文章标签：

云原生数据库 PolarDB

关系型数据库

移动开发

PostgreSQL

Oracle

关键词：

云原生数据库 PolarDB排序

云原生数据库 PolarDB聚合

云数据库 RDS PostgreSQL 版排序

云数据库 RDS PostgreSQL 版聚合

云数据库 RDS MySQL 版聚合函数讲解

相关实践学习

使用PolarDB和ECS搭建门户网站

本场景主要介绍基于PolarDB和ECS实现搭建门户网站。

阿里云数据库产品家族及特性

阿里云智能数据库产品团队一直致力于不断健全产品体系，提升产品性能，打磨产品功能，从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手，打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系，结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台，为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案，提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。

德哥,digoal

目录

相关文章

翎野君

|

3月前

|

关系型数据库 PostgreSQL

PostgreSQL排序字段不唯一导致分页查询结果出现重复数据

PostgreSQL排序字段不唯一导致分页查询结果出现重复数据

翎野君

41 0 0

yzshover

|

9月前

|

关系型数据库大数据 PostgreSQL

PostgreSQL16-新特性-并行聚合

PostgreSQL16-新特性-并行聚合

yzshover

102 0 0

德哥,digoal

|

关系型数据库 PostgreSQL

PostgreSQL listagg within group (order by) 聚合兼容用法 string_agg ( order by) - 行列变换，CSV构造...

标签 PostgreSQL , order-set agg , listagg , string_agg , order 背景 listagg — Rows to Delimited Strings The listagg function transforms values from a g...

德哥,digoal

5802 0 0

德哥,digoal

|

7月前

|

存储关系型数据库数据库

沉浸式学习PostgreSQL|PolarDB 13: 博客、网站按标签内容检索, 并按匹配度排序

本文主要教大家怎么用好数据库, 而不是怎么运维管理数据库、怎么开发数据库内核.

德哥,digoal

697 0 1

德哥,digoal

|

存储算法 Cloud Native

PolarDB 开源版通过 rum 实现高效率搜索和高效率排序的解决方案

PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力. 本文将介绍PolarDB 开源版通过 rum 实现高效率搜索和高效率排序的解决方案

德哥,digoal

174 0 0

传说中的德哥

|

关系型数据库 PostgreSQL

【重新发现PostgreSQL之美】- 48 聚合、窗口过滤器

大家好，这里是重新发现PostgreSQL之美 - 48 聚合、窗口过滤器

传说中的德哥

124 0 0

德哥,digoal

|

SQL 分布式计算并行计算

PostgreSQL 并行计算解说之9 - parallel 自定义并行聚合

标签 PostgreSQL , cpu 并行 , smp 并行 , 并行计算 , gpu 并行 , 并行过程支持背景 PostgreSQL 11 优化器已经支持了非常多场合的并行。简单估计，已支持27余种场景的并行计算。 parallel seq scan parallel index scan

德哥,digoal

552 0 0

德哥,digoal

|

SQL 关系型数据库 PostgreSQL

PostgreSQL 空间聚合性能 - 行政区、电子围栏空间聚合 - 时间、空间热力图

标签 PostgreSQL , 空间聚合 , 空间热力图 , 行政区 , 电子围栏背景某个时间区间(或其他条件)，出现在某些围栏、行政区(多边形信息)中的对象(空间点信息)有多少，按颜色深浅渲染这些多边形。

德哥,digoal

2395 0 0

德哥,digoal

|

物联网关系型数据库流计算

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

标签 PostgreSQL , IoT , 轨迹聚合 , pipelinedb , 流计算 , 实时聚合背景 IoT场景，车联网场景，共享单车场景，人的行为位点等，终端实时上报的是孤立的位点，我们需要将其补齐成轨迹。

德哥,digoal

1776 0 0

李博 bluemind

|

SQL 关系型数据库 PostgreSQL

PostgreSQL 10.1 手册_部分 II. SQL 语言_第 11 章索引_11.10. 索引和排序规则

11.10. 索引和排序规则一个索引在每一个索引列上只能支持一种排序规则。如果需要多种排序规则，你可能需要多个索引。考虑这些语句： CREATE TABLE test1c ( id integer, content varchar COLLATE "x" ); CREATE INDEX test1c_content_index ON test1c (content); 该索引自动使用下层列的排序规则。

李博 bluemind

1166 0 0

数据库

关系型数据库

热门文章

最新文章

你的数据库用对索引了吗？一文揭秘PolarDB XPlan索引选择

为PostgreSQL讨说法 - 浅析《UBER ENGINEERING SWITCHED FROM POSTGRES TO MYSQL》

德哥的PostgreSQL私房菜 - 史上最屌PG资料合集

什么场景应该用 MongoDB ？

PostgreSQL 如何实现批量更新、删除、插入

Linux 性能诊断 perf使用指南

PostgreSQL 9种索引的原理和应用场景

PostgreSQL 如何潇洒的处理每天上百TB的数据增量

Jedis常见异常汇总

时序数据库分析 - TimescaleDB时序数据库介绍

【设计模式系列笔记】建造者模式

【设计模式系列笔记】工厂模式

【设计模式系列笔记】抽象工厂模式

【设计模式系列笔记】单例模式

PolarDB闪电助攻，《香肠派对》百亿好友关系实现毫秒级查询

【MySQL系列笔记】MVCC

你的数据库用对索引了吗？一文揭秘PolarDB XPlan索引选择

【MySQL系列笔记】索引

【MySQL系列笔记】隔离锁

【MySQL系列笔记】事务

相关产品

云原生数据库 PolarDB

文档详情产品详情

相关课程

更多

MySQL企业常见架构与调优经验分享

云数据库MySQL版快速上手教程

阿里云云原生数据仓库AnalyticDB MySQL版使用教程

MySQL实战进阶

数据库及SQL/MySQL基础

云数据库MySQL快速入门

相关电子书

更多

搭建电商项目架构连接MySQL

搭建4层电商项目架构，实战连接MySQL

PolarDB MySQL引擎重磅功能及产品能力盛大发布

相关实验场景

更多

测试场景（RDS无优惠）

MySQL引擎及架构优化

基于EBS部署高性能的MySQL服务

使用Python完成RDS数据开发及分析

Excel文件转存到RDS数据库

使用CloudLens采集RDS日志并进行审计分析

推荐镜像

更多

mysql

postgresql

mariadb

下一篇

阿里云oss简介和使用流程