PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)-阿里云开发者社区

PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)

2016-03-28 2166

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生数据库 PolarDB MySQL 版，Serverless 5000PCU 100GB

简介：

PivotalR是R的一个包, 这个包提供了将R翻译成SQL语句的能力, 即对大数据进行挖掘的话. 用户将大数据存储在数据库中, 例如PostgreSQL , Greenplum.

用户在R中使用R的语法即可, 不需要直接访问数据库, 因为PivotalR 会帮你翻译成SQL语句, 并且返回结果给R.

这个过程不需要传输原始数据到R端, 所以可以完成R不能完成的任务(因为R是数据在内存中的运算, 如果数据量超过内存会有问题)

PivotalR还封装了MADlib, 里面包含了大量的机器学习的函数, 回归分析的函数等.

这个包的说明 :

PivotalR-package

An R font-end to PostgreSQL and Greenplum database, and wrapper

for in-database parallel and distributed machine learning open-source

library MADlib

Description

PivotalR is a package that enables users of R, the most popular open source statistical programming

language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal

HD/HAWQ for Big Data analytics. It does so by providing an interface to the operations on tables/views

in the database. These operations are almost the same as those of data.frame. Thus the

users of R do not need to learn SQL when they operate on the objects in the database. The latest

code is available at https://github.com/madlib-internal/PivotalR. A training video and a

quick-start guide are available at http://zimmeee.github.io/gp-r/#pivotalr.

Details

Package: PivotalR

Type: Package

Version: 0.1.17

Date: 2014-09-15

License: GPL (>= 2)

Depends: methods, DBI, RPostgreSQL

This package enables R users to easily develop, refine and deploy R scripts that leverage the parallelism

and scalability of the database as well as in-database analytics libraries to operate on big

data sets that would otherwise not fit in R memory - all this without having to learn SQL because

the package provides an interface that they are familiar with.

The package also provides a wrapper for MADlib. MADlib is an open-source library for scalable

in-database analytics. It provides data-parallel implementations of mathematical, statistical and

machine-learning algorithms for structured and unstructured data. The number of machine learning

algorithms that MADlib covers is quickly increasing.

As an R front-end to the PostgreSQL-like databases, this package minimizes the amount of data

transferred between the database and R. All the big data is stored in the database. The user enters

their familiar R syntax, and the package translates it into SQL queries and sends the SQL query into

database for parallel execution. The computation result, which is small (if it is as big as the original

data, what is the point of big data analytics?), is returned to R to the user.

On the other hand, this package also gives the usual SQL users the access of utilizing the powerful

analytics and graphics functionalities of R. Although the database itself has difficulty in plotting,

the result can be analyzed and presented beautifully with R.

This current version of PivotalR provides the core R infrastructure and data frame functions as well

as over 50 analytical functions in R that leverage in-database execution. These include

* Data Connectivity - db.connect, db.disconnect, db.Rquery

* Data Exploration - db.data.frame, subsets

* R language features - dim, names, min, max, nrow, ncol, summary etc

* Reorganization Functions - merge, by (group-by), samples

* Transformations - as.factor, null replacement

* Algorithms - linear regression and logistic regression wrappers for MADlib

Note

This package is differernt from PL/R, which is another way of using R with PostgreSQL-like

databases. PL/R enables the users to run R scripts from SQL. In the parallel Greenplum database,

one can use PL/R to implement parallel algorithms.

However, PL/R still requires non-trivial knowledge of SQL to use it effectively. It is mostly limited

to explicitly parallel jobs. And for the end user, it is still a SQL interface.

This package does not require any knowledge of SQL, and it works for both explicitly and implicitly

parallel jobs by employing the open-source MADlib library. It is much more scalable. And for the

end user, it is a pure R interface with the conventional R syntax.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc. <user@madlib.net>, with contributions from

Data Scientist Team at Pivotal Inc.

Maintainer: Caleb Welton, Pivotal Inc. <cwelton@pivotal.io>

References

[1] MADlib website, http://madlib.net

[2] MADlib user docs, http://doc.madlib.net/master

[3] MADlib Wiki page, http://github.com/madlib/madlib/wiki

[4] MADlib contribution guide, https://github.com/madlib/madlib/wiki/Contribution-Guide

[5] MADlib on GitHub, https://github.com/madlib/madlib

PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)

热门文章

最新文章

相关课程

相关电子书

相关实验场景