PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介:
PivotalR是R的一个包, 这个包提供了将R翻译成SQL语句的能力, 即对大数据进行挖掘的话. 用户将大数据存储在数据库中, 例如PostgreSQL , Greenplum. 
用户在R中使用R的语法即可, 不需要直接访问数据库, 因为PivotalR 会帮你翻译成SQL语句, 并且返回结果给R.
这个过程不需要传输原始数据到R端, 所以可以完成R不能完成的任务(因为R是数据在内存中的运算, 如果数据量超过内存会有问题)
PivotalR还封装了MADlib, 里面包含了大量的机器学习的函数, 回归分析的函数等.


这个包的说明 : 
PivotalR-package 
An R font-end to PostgreSQL and Greenplum database, and wrapper
for in-database parallel and distributed machine learning open-source
library MADlib

Description
PivotalR is a package that enables users of R, the most popular open source statistical programming
language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal
HD/HAWQ for Big Data analytics. It does so by providing an interface to the operations on tables/views
in the database. These operations are almost the same as those of data.frame. Thus the
users of R do not need to learn SQL when they operate on the objects in the database. The latest
code is available at https://github.com/madlib-internal/PivotalR. A training video and a
quick-start guide are available at http://zimmeee.github.io/gp-r/#pivotalr.

Details
Package: PivotalR
Type: Package
Version: 0.1.17
Date: 2014-09-15
License: GPL (>= 2)
Depends: methods, DBI, RPostgreSQL

This package enables R users to easily develop, refine and deploy R scripts that leverage the parallelism
and scalability of the database as well as in-database analytics libraries to operate on big
data sets that would otherwise not fit in R memory - all this without having to learn SQL because
the package provides an interface that they are familiar with.

The package also provides a wrapper for MADlib. MADlib is an open-source library for scalable
in-database analytics. It provides data-parallel implementations of mathematical, statistical and
machine-learning algorithms for structured and unstructured data. The number of machine learning
algorithms that MADlib covers is quickly increasing.

As an R front-end to the PostgreSQL-like databases, this package minimizes the amount of data
transferred between the database and R. All the big data is stored in the database. The user enters
their familiar R syntax, and the package translates it into SQL queries and sends the SQL query into
database for parallel execution. The computation result, which is small (if it is as big as the original
data, what is the point of big data analytics?), is returned to R to the user.

On the other hand, this package also gives the usual SQL users the access of utilizing the powerful
analytics and graphics functionalities of R. Although the database itself has difficulty in plotting,
the result can be analyzed and presented beautifully with R.

This current version of PivotalR provides the core R infrastructure and data frame functions as well
as over 50 analytical functions in R that leverage in-database execution. These include

* Data Connectivity - db.connect, db.disconnect, db.Rquery
* Data Exploration - db.data.frame, subsets
* R language features - dim, names, min, max, nrow, ncol, summary etc
* Reorganization Functions - merge, by (group-by), samples
* Transformations - as.factor, null replacement
* Algorithms - linear regression and logistic regression wrappers for MADlib

Note
This package is differernt from PL/R, which is another way of using R with PostgreSQL-like
databases. PL/R enables the users to run R scripts from SQL. In the parallel Greenplum database,
one can use PL/R to implement parallel algorithms.

However, PL/R still requires non-trivial knowledge of SQL to use it effectively. It is mostly limited
to explicitly parallel jobs. And for the end user, it is still a SQL interface.

This package does not require any knowledge of SQL, and it works for both explicitly and implicitly
parallel jobs by employing the open-source MADlib library. It is much more scalable. And for the
end user, it is a pure R interface with the conventional R syntax.

Author(s)
Author: Predictive Analytics Team at Pivotal Inc. <user@madlib.net>, with contributions from
Data Scientist Team at Pivotal Inc.
Maintainer: Caleb Welton, Pivotal Inc. <cwelton@pivotal.io>

References
[1] MADlib website, http://madlib.net
[2] MADlib user docs, http://doc.madlib.net/master
[3] MADlib Wiki page, http://github.com/madlib/madlib/wiki
[4] MADlib contribution guide, https://github.com/madlib/madlib/wiki/Contribution-Guide
[5] MADlib on GitHub, https://github.com/madlib/madlib

See Also
madlib.lm Linear regression
madlib.glm Linear, logistic and multinomial logistic regressions
madlib.summary summary of a table in the database.

Examples
## Not run:
## get the help for the package
help("PivotalR-package")
## get help for a function
help(madlib.lm)
## create multiple connections to different databases
db.connect(port = 5433) # connection 1, use default values for the parameters
db.connect(dbname = "test", user = "qianh1", password = "", host =
"remote.machine.com", madlib = "madlib07", port = 5432) # connection 2
db.list() # list the info for all the connections
## list all tables/views that has "ornst" in the name
db.objects("ornst")
## list all tables/views
db.objects(conn.id = 1)
## create a table and the R object pointing to the table
## using the example data that comes with this package
delete("abalone", conn.id = cid)
x <- as.db.data.frame(abalone, "abalone")
## OR if the table already exists, you can create the wrapper directly
## x <- db.data.frame("abalone")
dim(x) # dimension of the data table
names(x) # column names of the data table
madlib.summary(x) # look at a summary for each column
lk(x, 20) # look at a sample of the data
## look at a sample sorted by id column
lookat(sort(x, decreasing = FALSE, x$id), 20)
lookat(sort(x, FALSE, NULL), 20) # look at a sample ordered randomly
## linear regression Examples --------
## fit one different model to each group of data with the same sex
fit1 <- madlib.lm(rings ~ . - id | sex, data = x)
fit1 # view the result
lookat(mean((x$rings - predict(fit1, x))^2)) # mean square error
## plot the predicted values v.s. the true values
ap <- x$rings # true values
ap$pred <- predict(fit1, x) # add a column which is the predicted values
## If the data set is very big, you do not want to load all the
## data points into R and plot. We can just plot a random sample.
random.sample <- lk(sort(ap, FALSE, "random"), 1000) # sort randomly
plot(random.sample) # plot a random sample
## fit a single model to all data treating sex as a categorical variable ---------
y <- x # make a copy, y is now a db.data.frame object
y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now
fit2 <- madlib.lm(rings ~ . - id, data = y)
fit2 # view the result
lookat(mean((y$rings - predict(fit2, y))^2)) # mean square error
## logistic regression Examples --------
## fit one different model to each group of data with the same sex
fit3 <- madlib.glm(rings < 10 ~ . - id | sex, data = x, family = "binomial")
fit3 # view the result
## the percentage of correct prediction
lookat(mean((x$rings < 10) == predict(fit3, x)))
## fit a single model to all data treating sex as a categorical variable ----------
y <- x # make a copy, y is now a db.data.frame object
y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now
fit4 <- madlib.glm(rings < 10 ~ . - id, data = y, family = "binomial")
fit4 # view the result
## the percentage of correct prediction
lookat(mean((y$rings < 10) == predict(fit4, y)))
## Group by Examples --------
## mean value of each column except the "id" column
lk(by(x[,-1], x$sex, mean))
## standard deviation of each column except the "id" column
lookat(by(x[,-1], x$sex, sd))
## Merge Examples --------
## create two objects with different rows and columns
key(x) <- "id"
y <- x[1:300, 1:6]
z <- x[201:400, c(1,2,4,5)]
## get 100 rows
m <- merge(y, z, by = c("id", "sex"))
lookat(m, 20)
## operator Examples --------
y <- x$length + x$height + 2.3
z <- x$length * x$height / 3
lk(y < z, 20)
## ------------------------------------------------------------------------
## Deal with NULL values
delete("null_data")
x <- as.db.data.frame(null.data, "null_data")
## OR if the table already exists, you can create the wrapper directly
## x <- db.data.frame("null_data")
dim(x)
names(x)
## ERROR, because of NULL values
fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = x)
## remove NULL values
y <- x # make a copy
for (i in 1:10) y <- y[!is.na(y[i]),]
dim(y)
fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = y)
fit
## Or we can replace all NULL values
x[is.na(x)] <- 45
## End(Not run)

安装,使用 : 
> install.packages("PivotalR")
> library(PivotalR)
Loading required package: Matrix
Attaching package: ‘PivotalR’
The following objects are masked from ‘package:stats’:
    sd, var
The following object is masked from ‘package:base’:
    cbind

[参考]
相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
分布式计算 资源调度 Hadoop
Hadoop常见错误及解决方案、Permission denied: user=dr.who, access=WRITE, inode=“/“:summer:supergroup:drwxr-xr-x
Hadoop常见错误及解决方案、Permission denied: user=dr.who, access=WRITE, inode=“/“:summer:supergroup:drwxr-xr-x
Hadoop常见错误及解决方案、Permission denied: user=dr.who, access=WRITE, inode=“/“:summer:supergroup:drwxr-xr-x
Hadoop2.x Permission denied: user=dr.who, access=READ_EXECUTE inode="/tmp"
在hadoop2中查看网页中的/tmp目录出现下面的错误: Permission denied: user=dr.who, access=READ_EXECUTE inode="/tmp"   修改权限就好了。
2105 0
|
15天前
|
存储 分布式计算 Hadoop
大数据处理架构Hadoop
【4月更文挑战第10天】Hadoop是开源的分布式计算框架,核心包括MapReduce和HDFS,用于海量数据的存储和计算。具备高可靠性、高扩展性、高效率和低成本优势,但存在低延迟访问、小文件存储和多用户写入等问题。运行模式有单机、伪分布式和分布式。NameNode管理文件系统,DataNode存储数据并处理请求。Hadoop为大数据处理提供高效可靠的解决方案。
37 2
|
14天前
|
分布式计算 Hadoop 大数据
大数据技术与Python:结合Spark和Hadoop进行分布式计算
【4月更文挑战第12天】本文介绍了大数据技术及其4V特性,阐述了Hadoop和Spark在大数据处理中的作用。Hadoop提供分布式文件系统和MapReduce,Spark则为内存计算提供快速处理能力。通过Python结合Spark和Hadoop,可在分布式环境中进行数据处理和分析。文章详细讲解了如何配置Python环境、安装Spark和Hadoop,以及使用Python编写和提交代码到集群进行计算。掌握这些技能有助于应对大数据挑战。
|
16天前
|
SQL 分布式计算 Hadoop
利用Hive与Hadoop构建大数据仓库:从零到一
【4月更文挑战第7天】本文介绍了如何使用Apache Hive与Hadoop构建大数据仓库。Hadoop的HDFS和YARN提供分布式存储和资源管理,而Hive作为基于Hadoop的数据仓库系统,通过HiveQL简化大数据查询。构建过程包括设置Hadoop集群、安装配置Hive、数据导入与管理、查询分析以及ETL与调度。大数据仓库的应用场景包括海量数据存储、离线分析、数据服务化和数据湖构建,为企业决策和创新提供支持。
57 1
|
1月前
|
消息中间件 SQL 分布式计算
大数据Hadoop生态圈体系视频课程
熟悉大数据概念,明确大数据职位都有哪些;熟悉Hadoop生态系统都有哪些组件;学习Hadoop生态环境架构,了解分布式集群优势;动手操作Hbase的例子,成功部署伪分布式集群;动手Hadoop安装和配置部署;动手实操Hive例子实现;动手实现GPS项目的操作;动手实现Kafka消息队列例子等
20 1
大数据Hadoop生态圈体系视频课程
|
4月前
|
分布式计算 资源调度 搜索推荐
《PySpark大数据分析实战》-02.了解Hadoop
大家好!今天为大家分享的是《PySpark大数据分析实战》第1章第2节的内容:了解Hadoop。
48 0
《PySpark大数据分析实战》-02.了解Hadoop
|
4月前
|
存储 搜索推荐 算法
【大数据毕设】基于Hadoop的音乐推荐系统的设计和实现(六)
【大数据毕设】基于Hadoop的音乐推荐系统的设计和实现(六)
171 0
|
4月前
|
分布式计算 Hadoop Java
【大数据实训】基于Hadoop的2019年11月至2020年2月宁波天气数据分析(五)
【大数据实训】基于Hadoop的2019年11月至2020年2月宁波天气数据分析(五)
53 1
|
4月前
|
存储 分布式计算 搜索推荐
【大数据毕设】基于Hadoop的音乐管理系统论文(三)
【大数据毕设】基于Hadoop的音乐管理系统论文(三)
106 0