最近 CDH 5.2.0 发布了，想看看其做了哪些改进、带来哪些不兼容以及是否有必要升级现有的 hadoop 集群。

1. CDH 5.2.0 新特性

1.1. Apache Avro

Avro 版本使用1.7.6，重要的一些改变：

AVRO-1398。增加同步间隔，从16k 调整到64k，该参数可以在 mapreduce 的配置参数中通过 avro.mapred.sync.interval 参数来设置
AVRO-1355。schema 中不能包括相同的 field 名称。

1.2. Apache Hadoop

HDFS

提供新的功能：

HDFS Data at Rest Encryption。hdfs 数据的加密，该功能在5.2.0中还有一些限制，尚不能用于生产环境。
每一个 HDFS 文件上添加了一个新的属性：XAttrs，详见：https://issues.apache.org/jira/browse/HDFS-2006
使用 HTTP proxy server时的Authentication改进
增加了一个新的 Metrics sink，允许直接将监控数据写到 Graphite
Specification for Hadoop Compatible Filesystem effort
增加 OfflineImageViewer 通过 WebHDFS API 浏览 fsimage
对 NFS 支持的改进
hdfs daemons 的 web ui 改进

MapReduce

CDH 5.2 提供了一个 mapper 端 shuffle 的优化实现，使用该实现需要修改原来的实现类，默认未开启该实现。

可以修改 mapreduce.job.map.output.collector.class 参数为 org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator来开启该特性。

使用了自定义的可写的类型或者比较器时，无法使用该特性。

YARN

Fair Scheduler 新特性：
允许为每个队列设置 fairsharePreemptionThreshold 属性，该值在 fair-scheduler.xml 中设置，默认值为0.5
允许为每个队列设置 fairsharePreemptionTimeout 属性，该值在 fair-scheduler.xml 中设置
在 web ui 中可以显示 Steady Fair Share
Fair Scheduler 改进：
Fair Scheduler uses Instantaneous Fair Share (fairshare that considers only active queues) for scheduling decisions to improve the time to achieve steady state (fairshare).
maxAMShare 默认值设为0.5，意思是只有一半的集群资源可以被 Application Master 使用。该参数可以在 fair-scheduler.xml 中设置。
YARN 的 rest api 支持提交和杀掉 application 。
YARN 的 timeline store 和 Kerberos 集成

1.3 Apache Crunch

Scrunch 改进：

新的 join API
新的 aggregation API，支持 Algebird-based aggregation
增加新的模块 crunch-hive，用于使用 Crunch 读写 ORC 文件。

1.4 Apache Flume

集成 Kafka，添加 KafkaSource 和 KafkaSink。
Kite Sink 可以写数据到 hive 和 hbase。
Flume agent 可以通过 zookeeper 配置（试验中）。
嵌入式的 agent 支持拦截器。
syslog source 支持配置那个字段可以保留。
File Channel replay 速度变快
添加新的正则表达式查询替换拦截器
Backup checkpoint 可以可选的被压缩。

1.5 Hue

添加新的应用修改数据和表上的 Sentry 的角色和权限
arch App
添加 Heatmap, Tree, Leaflet 组件
Micro-analysis of fields
Exclusion facets
Oozie Dashboard: bulk actions, faster display
File Browser: drag-and-drop upload, history, ACLs edition
Hive and Impala: LDAP pass-through, query expiration, SSL (Hive), new graphs
Job Browser: YARN kill application button

1.6 Apache HBase

HBase 版本升级到 0.98.6

1.7 Apache Hive

hive 版本升级到 0.13，增加如下特性：
where 语句支持子查询
Common table expressions
Parquet 支持 timestamp
HiveServer2 可以配置 hiverc 文件，当连接的时候，自动执行该文件内容
Permanent UDFs
HiveServer2 添加 session 和操作超时
Beeline 接受一个 -i 参数执行初始化的 sql 文件
新的 join 语法(implicit joins)
建表语句支持 AVRO 存储格式
Hive 支持额外的数据类型：
hive 可以读 hive 和 impala 创建的 char 和 varchar 数据类型
Impala 可以读 hive 和 impala 创建的 char 和 varchar 数据类型
DESCRIBE DATABASE 命令添加两个新属性：owner_name 和 owner_type。

1.7 Impala

impala 版本升级到 2.0，改进包括：

添加 spill to disk 支持，当内存不够时自动转换到磁盘上进行处理。
子查询改进：
WHERE 语句中支持子查询，可以用于 in 查询
支持 EXISTS 和 NOT EXISTS 操作
子查询中可以使用 IN 和 NOT IN
where 语句可以使用如下语句： WHERE column = (SELECT MAX(some_other_column FROM table) 或者 WHERE column IN (SELECT some_other_column FROM table WHERE conditions)
Correlated subqueries let you cross-reference values from the outer query block and the subquery.
Scalar subqueries let you substitute the result of single-value aggregate functions such as MAX(), MIN(), COUNT(), or AVG(), where you would normally use a numeric value in a WHERE clause.
添加几个聚合函数： RANK(), LAG(), LEAD(), FIRST_VALUE()
添加新的数据类型：
VARCHAR
char
Security方面的改进：
Using Multiple Authentication Methods with Impala
GRANT
REVOKE
CREATE ROLE
DROP ROLE
SHOW ROLES
–disk_spill_encryption
Impala 可以读取 gzip, bzip, 或 Snappy 的压缩数据
Query hints can now use comment notation, /* +hint_name */ or – +hint_name，更详细说明见：Hints
QUERY_TIMEOUT_S 用于设置查询超时时间。
添加 VAR_SAMP() 和 VAR_POP()，分别为 VARIANCE_SAMP() 和 VARIANCE_POP() 别名
添加新的日期和时间类型函数：DATE_PART()
APPX_COUNT_DISTINCT，如果设置该参数，impala 会用 NDV() 函数代替 COUNT(DISTINCT)，加快查询速度并支持多个 COUNT(DISTINCT) 操作
添加 DECODE() 函数，是 CASE() 函数的一种简写方式，详细参考 Impala Conditional Functions
STDDEV(), STDDEV_POP(), STDDEV_SAMP(), VARIANCE(), VARIANCE_POP(), VARIANCE_SAMP(), NDV() 返回 double 类型
Parquet 块大小默认值由 1G 改为256M，也可以通过 PARQUET_FILE_SIZE 参数设置
支持 Anti-joins，可以使用 LEFT ANTI JOIN 和 RIGHT ANTI JOIN 语句
impala-shell 中可以执行 set 语句，可以设置 PARQUET_FILE_SIZE、MEM_LIMIT和 SYNC_DDL 等参数，详细说明见：SET Statement
impala-shell 可以读取 $HOME/.impalarc 中的配置，详细说明见：impala-shell Configuration Options

1.8 Kite

Kite is an open source set of libraries, references, tutorials, and code samples for building data-oriented systems and applications.

1.9 Apache Parquet (incubating)

版本更新： 5.2 Parquet is rebased on Parquet 1.5 and Parquet-format 2.1.0.

1.10 Apache Spark

Apache Spark/Streaming 版本使用 1.1

稳定性和性能改进
新的 sort-based shuffle 实现，默认未开启。
Spark UI 更好的监控性能改进
PySpark 支持 Hadoop InputFormats
改进 Yarn 的支持，并修复一些 bug

1.11 Apache Sqoop

CDH 5.2 Sqoop 1 is rebased on Sqoop 1.4.5

Mainframe connector added.
Parquet support added.

2.0 不兼容改变

2.1 HDFS

当没有快照目录时，getSnapshottableDirListing() 方法返回 null
NameNode ` -finalize 启动参数被删除，为了完成集群的升级，应该使用 hdfs dfsadmin -finalizeUpgrade` 命令
libhdfs 函数返回正确的错误码
HDFS balancer 命令运行错误时候返回0，运行成功返回1
Disable symlinks temporarily
Files named .snapshot or .reserved must not exist within HDFS.

Change in High-Availability Support：

CDH5 中唯一的 HA 实现是基于 Quorum-based storage，使用 NFS 的共享存储不再支持。

2.2 MapReduce

CATALINA_BASE 变量不再用于决定一个组件是否配置为 YARN 或者 MRv1
YARN Fair Scheduler ACL change. Root queue defaults to everybody, and other queues default to nobody.
YARN 高可用配置参数修改了 key 名称
YARN_HOME 改为 HADOOP_YARN_HOME
yarn-site.xml 中的以下参数改名：
mapreduce.shuffle 改为 mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class 改为 yarn.nodemanager.aux-services.mapreduce_shuffle.class
yarn.resourcemanager.resourcemanager.connect.max.wait.secs 改为 yarn.resourcemanager.connect.max-wait.secs
yarn.resourcemanager.resourcemanager.connect.retry_interval.secs 改名为 yarn.resourcemanager.connect.retry-interval.secs
yarn.resourcemanager.am.max-retries 改名为 yarn.resourcemanager.am.max-attempts

2.3 HBase

HBase 版本变化太大，这里不做说明。

2.4 Hive

CDH 5 提供一个新的离线命令用于升级元数据：

schemaTool -d <dbType> -upgradeSchema

CDH 4.x 和 CDH 5 中不兼容的地方：

CDH 4 JDBC 客户端和 CDH5 HiveServer2 不兼容
连接 HiveServer2 需要 CDH5 的 jar 包
因为权限和并发问题，hive 命令行和 hiveserver1 将删除不再使用，建议使用 HiveServer2 和 Beeline
CDH 5 Hue 不能用于 CDH 4 的HiveServer2
删除 npath 函数
Cloudera recommends that custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors, for serialization purposes. See HIVE-5380 for more details.
The SerDe interface has been changed which requires the custom SerDe modules to be reworked.
The decimal data type format has changed as of CDH 5 Beta 2 and is not compatible with CDH 4.

CDH 5 和 CDH 5.2.x 中不兼容的地方：

The CDH 5.2 Hive JDBC driver is not wire-compatible with the CDH 5.1

2.4 Apache Spark

3. 性能改进

1、Disabling Transparent Hugepage Compaction

查看是否开启

cat /sys/kernel/mm/redhat_transparent_hugepage/defrag

关闭该特性，并将其加入到 /etc/rc.local

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

2、设置 swap 交换

查看是否开启：

cat /proc/sys/vm/swappiness

On most systems, it is set to 60 by default. This is not suitable for Hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons.

建议修改为0：

sysctl -w vm.swappiness=0

3、Improving Performance in Shuffle Handler and IFile Reader

Shuffle Handler，开启预先读取数据：

对于 YARN，设置 mapreduce.shuffle.readahead.bytes，默认值为4MB
对于 MRv1，设置 mapred.tasktracker.shuffle.readahead.bytes，默认值为4MB

IFile Reader，开启预先读取IFile文件可以改进合并文件性能，开启该特性，请设置 mapreduce.ifile.readahead property 为 true，默认值为 true，更进一步，可以设置mapreduce.ifile.readahead.bytes 参数值，该值默认为4MB

4、MapReduce配置最佳实践

设置 mapreduce.tasktracker.outofband.heartbeat 为 true，该值默认为 false

在一个小集群中，设置 JobTracker heartbeat 间隔到一个更小的值，参数为 apreduce.jobtracker.heartbeat.interval.min ，默认值为10

5、立即启动 MapReduce 的 JVM

对于小任务，设置 mapred.reduce.slowstart.completed.maps 值为0，对于比较大的任务，最大设置为 50%

6、调整 MRv1 日志级别

mapreduce.map.log.level
mapreduce.reduce.log.level

4. 存在的问题

4.1 Apache Flume

Flume does not provide a native sink that stores the data that can be directly consumed by Hive.
Fast Replay does not work with encrypted File Channel

4.2 Apache Hadoop

DistCp between unencrypted and encrypted locations fails
NameNode - KMS communication fails after long periods of inactivity
Spark fails when the KMS is configured to use SSL
Files inside encryption zones cannot be read in Hue
Cannot move encrypted files to trash
No error when changing permission to 777 on .snapshot directory
Snapshots do not retain directories’ quotas settings
NameNode cannot use wildcard address in a secure cluster
Permissions for dfs.namenode.name.dir incorrectly set.
hadoop fsck -move does not work in a cluster with host-based Kerberos
HttpFS cannot get delegation token without prior authenticated request.
DistCp does not work between a secure cluster and an insecure cluster in some cases
Using DistCp with Hftp on a secure cluster using SPNEGO requires that the dfs.https.port property be configured
Offline Image Viewer (OIV) tool regression: missing Delimited outputs.
Snapshot operations are not supported by ViewFileSystem

4.3 MapReduce

Starting an unmanaged ApplicationMaster may fail
No JobTracker becomes active if both JobTrackers are migrated to other hosts
Hadoop Pipes may not be usable in an MRv1 Hadoop installation done through tarballs
Task-completed percentage may be reported as slightly under 100% in the web UI, even when all of a job’s tasks have successfully completed.
Encrypted shuffle in MRv2 does not work if used with LinuxContainerExecutor and encrypted web UIs.
Link from ResourceManager to Application Master does not work when the Web UI over HTTPS feature is enabled.
Hadoop client JARs don’t provide all the classes needed for clean compilation of client code
The ulimits setting in /etc/security/limits.conf is applied to the wrong user if security is enabled.

4.4 Apache Hive

Hive’s Timestamp type cannot be stored in Parquet
Hive’s Decimal type cannot be stored in Parquet and Avro
Hive creates an invalid table if you specify more than one partition with alter table
PostgreSQL 9.0+ requires additional configuration，需要设置 standard_conforming_strings 为 off
Setting hive.optimize.skewjoin to true causes long running queries to fail
JDBC - executeUpdate does not returns the number of rows modified
Hive Auth (Grant/Revoke/Show Grant) statements do not support fully qualified table names (default.tab1)

4.5 Apache Parquet (incubating)

Parquet file writes run out of memory if (number of partitions) times (block size) exceeds available memory
Hive cannot read arrays in Parquet written by parquet-avro or parquet-thrift

5. 总结

本篇文章主要是翻译了 cloudera 官网上关于 CDH5.2 的新特性、不兼容变化、性能改进以及可能存在的问题等相关文档，以便清楚的了解 hadoop 各组件的特性并为是否升级 hadoop 版本做出决策支持。

CDH 5.2.0 的改变