HBase2.0官方文档翻译-RegionServer Sizing Rules of Thumb

易虹 2020-04-30

数据存储与数据库 大数据 hbase string timestamp

36. On the number of column families

HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing and compaction interaction can make for a bunch of needless i/o (To be addressed by changing flushing and compaction to work on a per column family basis). For more information on compactions, see Compaction.


Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.


36.1. 列族基数(Cardinality of ColumnFamilies)

Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.


37. Rowkey Design

37.1. 热点(Hotspotting)

Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other. However, poorly designed row keys are a common source of hotspotting. Hotspotting occurs when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability. This can also have adverse effects on other regions hosted by the same region server as that host is unable to service the requested load. It is important to design data access patterns such that the cluster is fully and evenly utilized.


To prevent hotspotting on writes, design your row keys such that rows that truly do need to be in the same region are, but in the bigger picture, data is being written to multiple regions across the cluster, rather than one at a time. Some common techniques for avoiding hotspotting are described below, along with some of their advantages and drawbacks.



Salting in this sense has nothing to do with cryptography, but refers to adding random data to the start of a row key. In this case, salting refers to adding a randomly-assigned prefix to the row key to cause it to sort differently than it otherwise would. The number of possible prefixes correspond to the number of regions you want to spread the data across. Salting can be helpful if you have a few "hot" row key patterns which come up over and over amongst other more evenly-distributed rows. Consider the following example, which shows that salting can spread write load across multiple RegionServers, and illustrates some of the negative implications for reads.


Example 11. Salting Example

Suppose you have the following list of row keys, and your table is split such that there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b' is another. In this table, all rows starting with 'f' are in the same region. This example focuses on rows with keys like the following:



Now, imagine that you would like to spread these across four different regions. You decide to use four different salts: a, b, c, and d. In this scenario, each of these letter prefixes will be on a different region. After applying the salts, you have the following rowkeys instead. Since you can now write to four separate regions, you theoretically have four times the throughput when writing that you would have if all the writes were going to the same region.

现在,想象下你需要将他们分散到不同的region去。你决定使用四种不同的盐:a, b, c, and d。


Then, if you add another row, it will randomly be assigned one of the four possible salt values and end up near one of the existing rows.



Since this assignment will be random, you will need to do more work if you want to retrieve the rows in lexicographic order. In this way, salting attempts to increase throughput on writes, but has a cost during reads.



Instead of a random assignment, you could use a one-way hash that would cause a given row to always be "salted" with the same prefix, in a way that would spread the load across the RegionServers, but allow for predictability during reads. Using a deterministic hash allows the client to reconstruct the complete rowkey and use a Get operation to retrieve that row as normal.


Example 12. Hashing Example

Given the same situation in the salting example above, you could instead apply a one-way hash that would cause the row with key foo0003 to always, and predictably, receive the a prefix. Then, to retrieve that row, you would already know the key. You could also optimize things so that certain pairs of keys were always in the same region, for instance.


反转键(Reversing the Key)

A third common trick for preventing hotspotting is to reverse a fixed-width or numeric row key so that the part that changes the most often (the least significant digit) is first. This effectively randomizes row keys, but sacrifices row ordering properties.


See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, and article on Salted Tables from the Phoenix project, and the discussion in the comments of HBASE-11682 for more information about avoiding hotspotting.

查看https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, 和Phoenix项目中其它加盐表相关的文章,以及HBASE-11682中评论的讨论,以了解更多关于避免热点的信息。

37.2. 递增rowkey/时序数据(Monotonically Increasing Row Keys/Timeseries Data)

In the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table’s regions (and thus, a single node), then moving onto the next region, etc.

With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad.

The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

在Tom White关于Hadoop的书中的HBase章节中:在权威指南里面有一个优化说明,其中指出要注意这样一种现象,所有客户端的写入操作全部集中在表的某一个region(也即,单个节点),然后转换到下一个region,一直这样。

使用单向递增的rowkey时(例如,使用时间戳),这就会发生。参考IKai Lan的连载,关于为什么在BigTable类的数据库中单向递增的rowkey会是问题:monotonically increasing values are bad。


If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase. The key format in OpenTSDB is effectively metric_type, which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.


See schema.casestudies for some rowkey design examples.


37.3. 尽可能最小化row和column大小(Try to minimize row and column sizes)

In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (StoreFile (HFile)) to facilitate random access may end up occupying large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list.

在HBase中,value总是带有其坐标;cell的value在系统中处理时总是携带着row,column名称,以及时间戳。如果你的row和column名称很大,尤其是相对于value来说,那么你可能会碰到一些有意思的情景。在HBASE-3551的末尾Marc Limotte描述了这样的一个案例。
其中,由于cell的value坐标过大,storefiles中存储的用来加速随机访问的索引数据占用了大量的HBase可用内存。在之前的回复中,Mark建议增加block的大小,使得store file中能以更大的间隔产生index,或者修改表设计,使用更小的row和column名称。压缩也能够带来较大的索引。查看用户邮件列表中的这个主题:a question storefileIndexSize。

Most of the time small inefficiencies don’t matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data.


See keyvalue for more information on HBase stores data internally to see why this is important.


37.3.1. 列族(Column Families)

Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).

See KeyValue for more information on HBase stores data internally to see why this is important.


37.3.2. 属性(Attributes)

Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase.

See keyvalue for more information on HBase stores data internally to see why this is important.


37.3.3. 行键长度(Rowkey Length)

Keep them as short as is reasonable such that they can still be useful for required data access (e.g. Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs when designing rowkeys.


37.3.4. 字节模式(Byte Patterns)

A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String — presuming a byte per character — you need nearly 3x the bytes.


Not convinced? Below is some sample code that you can run on your own.


// long
long l = 1234567890L;
byte[] lb = Bytes.toBytes(l);
System.out.println("long bytes length: " + lb.length);   // returns 8

String s = String.valueOf(l);
byte[] sb = Bytes.toBytes(s);
System.out.println("long as string length: " + sb.length);    // returns 10

// hash
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(Bytes.toBytes(s));
System.out.println("md5 digest bytes length: " + digest.length);    // returns 16

String sDigest = new String(digest);
byte[] sbDigest = Bytes.toBytes(sDigest);
System.out.println("md5 digest as string length: " + sbDigest.length);    // returns 26

Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example, this is what you will see in the shell when you increment a value:


hbase(main):001:0> incr 't', 'r', 'f:q', 1

hbase(main):002:0> get 't', 'r'
COLUMN                                        CELL
 f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
1 row(s) in 0.0310 seconds

The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will happen to your row keys inside the region names. It can be okay if you know what’s being stored, but it might also be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.


37.4.反转时间戳(Reverse Timestamps)

反向scan接口(Reverse Scan API)

HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See Scan.setReversed() for more information.

HBASE-4811实现了一个可以反向scan表或其中一个范围的接口,减少你因为正向或反向扫描而进行模式优化的需要。该功能在HBase 0.98或更高版本中可用。更多信息可查看Scan.setReversed()。

A common problem in database processing is quickly finding the most recent version of a value. A technique using reverse timestamps as a part of the key can help greatly with a special case of this problem. Also found in the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly), the technique involves appending (Long.MAX_VALUE - timestamp) to the end of any key, e.g. key.

The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record. Since HBase keys are in sorted order, this key sorts before any older row-keys for [key] and thus is first.

This technique would be used instead of using Number of Versions where the intent is to hold onto all versions "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.

数据库处理中有这样一个常见的问题,快速找到最新版本的一个值。在特定的与此有关的案例中,把反转时间戳作为key的一部分,会有很大的帮助。Tom White的hadoop书籍的HBase章节:权威指南,关于在任意key的末尾添加(Long.MAX_VALUE - timestamp)的技巧。



37.5. 行键和列族(Rowkeys and ColumnFamilies)

Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.


37.6. 行键的不变性(Immutability of Rowkeys)

Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you’ve inserted a lot of data).


37.7. 行键和region分片的关系(Relationship Between RowKeys and Region Splits)

If you pre-split your table, it is critical to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split (which is the split strategy used when creating regions in Admin.createTable(byte[] startKey, byte[] endKey, numRegions) for 10 regions will generate the following splits…​

如果你预拆分你的表,理解你的行键在region边界如何分布非常重要。考虑这个使用可见十六进制字符作为先导位的行键(比如,"0000000000000000" to "ffffffffffffffff")的例子,用来说明为什么这很重要。运行Bytes.split(使用Admin.createTable(byte[] startKey, byte[] endKey, numRegions )创建region时使用的分片策略)将该范围的行键分为10个region,会得到下面这些分片。

(note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', everything is great, right? Not so fast.


The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer to an ASCII Table. '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions will never be used. To make pre-splitting work with this example keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) is required.

问题在于,所有数据会集中在前2个region以及最后1个region,因而带来了热点region问题。参考ASCII表来理解为什么。'0' 对应字节的值为48,'f'对应字节的值为102,但由于可能的值只有[0-9]和[a-f],其中有很大一部分字节值(58-96)不会出现在行键区间中,此时需要一个自定义分片策略(比如,不依赖内置的分片方法)。

Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen with any keyspace. Know your data.


Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split tables as long as all the created regions are accessible in the keyspace.


To conclude this example, the following is an example of how appropriate splits can be pre-created for hex-keys:.


public static boolean createTable(Admin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;

public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i < numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  return splits;

38. Number of Versions

38.1. 最大版本数(Maximum Number of Versions)

The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 1. This is an important parameter because as described in Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.


It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.


38.2. 最小版本数(Minimum Number of Versions)

Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via HColumnDescriptor. The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, but keep at least M versions around" (where M is the value for minimum number of row versions, M


39. Supported Datatypes

HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.


There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.


39.1. Counters

One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in Table.

特别值得一提的一种数据类型是"计数器"(用于实现数值原子递增)。See Increment in Table.

Synchronization on counters are done on the RegionServer, not in the client.


40. Joins

If you have multiple tables, don’t forget to factor in the potential for Joins into the schema design.


41. Time To Live (TTL)

ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.


Store files which contains only expired rows are deleted on minor compaction. Setting hbase.store.delete.expired.storefile to false disables this feature. Setting minimum number of versions to other than 0 also disables this.

See HColumnDescriptor for more information.

只包含已过期数据的Store files会在minor合并的时候 被删除。可将hbase.store.delete.expired.storefile设置为false来禁用此功能。也可以将最小版本数设置为大于0的值来禁用。

Recent versions of HBase also support setting time to live on a per cell basis. See HBASE-10560 for more information. Cell TTLs are submitted as an attribute on mutation requests (Appends, Increments, Puts, etc.) using Mutation#setTTL. If the TTL attribute is set, it will be applied to all cells updated on the server by the operation. There are two notable differences between cell TTL handling and ColumnFamily TTLs:

Cell TTLs are expressed in units of milliseconds instead of seconds.

A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting.




42. Keeping Deleted Cells

By default, delete markers extend back to the beginning of time. Therefore, Get or Scan operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time range before the delete marker was placed.


ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be retrieved, as long as these operations specify a time range that ends before the timestamp of any delete that would affect the cells. This allows for point-in-time queries even in the presence of deletes.


Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete markers.



hbase> hbase> alter ‘t1′, NAME => ‘f1′, KEEP_DELETED_CELLS => true



Let us illustrate the basic effect of setting the KEEP_DELETED_CELLS attribute on a table.
First, without:


create 'test', {NAME=>'e', VERSIONS=>2147483647}
put 'test', 'r1', 'e:c1', 'value', 10
put 'test', 'r1', 'e:c1', 'value', 12
put 'test', 'r1', 'e:c1', 'value', 14
delete 'test', 'r1', 'e:c1',  11

hbase(main):017:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
 r1                                              column=e:c1, timestamp=11, type=DeleteColumn
 r1                                              column=e:c1, timestamp=10, value=value
1 row(s) in 0.0120 seconds

hbase(main):018:0> flush 'test'
0 row(s) in 0.0350 seconds

hbase(main):019:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
 r1                                              column=e:c1, timestamp=11, type=DeleteColumn
1 row(s) in 0.0120 seconds

hbase(main):020:0> major_compact 'test'
0 row(s) in 0.0260 seconds

hbase(main):021:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
1 row(s) in 0.0120 seconds

Notice how delete cells are let go.


Now let’s run the same test only with KEEP_DELETED_CELLS set on the table (you can do table or per-column-family):


hbase(main):005:0> create 'test', {NAME=>'e', VERSIONS=>2147483647, KEEP_DELETED_CELLS => true}
0 row(s) in 0.2160 seconds

=> Hbase::Table - test
hbase(main):006:0> put 'test', 'r1', 'e:c1', 'value', 10
0 row(s) in 0.1070 seconds

hbase(main):007:0> put 'test', 'r1', 'e:c1', 'value', 12
0 row(s) in 0.0140 seconds

hbase(main):008:0> put 'test', 'r1', 'e:c1', 'value', 14
0 row(s) in 0.0160 seconds

hbase(main):009:0> delete 'test', 'r1', 'e:c1',  11
0 row(s) in 0.0290 seconds

hbase(main):010:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0550 seconds

hbase(main):011:0> flush 'test'
0 row(s) in 0.2780 seconds

hbase(main):012:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0620 seconds

hbase(main):013:0> major_compact 'test'
0 row(s) in 0.0530 seconds

hbase(main):014:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0650 seconds

KEEP_DELETED_CELLS is to avoid removing Cells from HBase when the only reason to remove them is the delete marker. So with KEEP_DELETED_CELLS enabled deleted cells would get removed if either you write more versions than the configured max, or you have a TTL and Cells are in excess of the configured timeout, etc.


43. Secondary Indexes and Alternate Query Paths

This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.


There is no single answer on the best way to handle this because it depends on…​


  • Number of users
  • Data size and data arrival rate
  • Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)
  • Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)
  • 用户数量
  • 数据大小和到达速率
  • 报表需求的复杂度(比如,完全自由的日期选择 vs 预先配置范围)
  • 查询所需的执行速度(比如,对于一个ad-hoc报表,90秒可能是合理的,但是对于其它情况就太久了)

and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.


It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RDBMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.


Pay attention to Apache HBase Performance Tuning when implementing any of these approaches.


Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase

此外,可查看David Butler在问题列表中的回复,HBase, mail # user - Stargate+hbase。

43.1. (过滤器查询)Filter Query

Depending on the case, it may be appropriate to use Client Request Filters. In this case, no secondary index is created. However, don’t try a full-scan on a large table like this from an application (i.e., single-threaded client).


43.2. (周期性更新二级索引)Periodic-Update Secondary Index

A secondary index could be created in another table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.

See mapreduce.example.readwrite for more information.



43.3. (多写二级索引)Dual-Write Secondary Index

Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see secondary.indexes.periodic).


43.4. (汇总表)Summary Tables

Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.

See mapreduce.example.summary for more information.



43.5. (协处理器二级索引)Coprocessor Secondary Index

Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, see coprocessors


44. Constraints

HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (e.g. make sure values are in the range 1-10). Constraints could also be used to enforce referential integrity, but this is strongly discouraged as it will dramatically decrease the write throughput of the tables where integrity checking is enabled. Extensive documentation on using Constraints can be found at Constraint since version 0.94.


45. Schema Design Case Studies

The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, not an exhaustive list. Know your data, and know your processing requirements.


It is highly recommended that you read the rest of the HBase and Schema Design first, before reading these case studies.

强烈推荐你在阅读这些学习案例之前,先读一读HBase and Schema Design的剩余内容。

The following case studies are described:

  • Log Data / Timeseries Data
  • Log Data / Timeseries on Steroids
  • Customer/Order
  • Tall/Wide/Middle Schema Design
  • List Data


  • 日志数据 / 时序数据
  • 日志数据 / 聚合时序数据
  • 客户/订单
  • 高/宽/中等 模式设计
  • 列表数据

45.1. 案例学习-日志和时序数据(Case Study - Log Data and Timeseries Data)

Assume that the following data elements are being collected.

  • Hostname
  • Timestamp
  • Log event
  • Value/message


  • 主机名
  • 时间戳
  • 日志事件
  • 值/消息

We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event - but what specifically?


45.1.1. 时间戳位于前导位(Timestamp In The Rowkey Lead Position)

The rowkey timestamp[log-event] suffers from the monotonically increasing rowkey problem described in Monotonically Increasing Row Keys/Timeseries Data.

timestamp[log-event]组成的行键会遇到Monotonically Increasing Row Keys/Timeseries Data中所描述的单调递增行键问题。

There is another pattern frequently mentioned in the dist-lists about "bucketing" timestamps, by performing a mod operation on the timestamp. If time-oriented scans are important, this could be a useful approach. Attention must be paid to the number of buckets, because this will require the same number of scans to return results.


long bucket = timestamp % numBuckets;  
to construct:  

As stated above, to select data for a particular timerange, a Scan will need to be performed for each bucket. 100 buckets, for example, will provide a wide distribution in the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there are trade-offs.


45.1.2. 主机名位于前导位(Host In The Rowkey Lead Position)

The rowkey hostname[timestamp] is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace. This approach would be useful if scanning by hostname was a priority.


45.1.3. 时间戳,或反转时间戳(Timestamp, or Reverse Timestamp?)

If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., timestamp = Long.MAX_VALUE – timestamp) will create the property of being able to do a Scan on hostname to obtain the most recently captured events.

如果最重要的访问方式是得到最新的事件,那么以反转时间戳的方式存储的话(e.g., timestamp = Long.MAX_VALUE – timestamp),将产生这样的特性:在对hostname进行scan时可以获取最近得到的事件。

Neither approach is wrong, it just depends on what is most appropriate for the situation.


Reverse Scan API
HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See Scan.setReversed() for more information.

HBASE-4811实现了一个接口,用来反向扫描一个表或其中一个范围,以减少为能够反向扫描而所需的设计优化。在HBase 0.98及其后版本可用。See Scan.setReversed() for more information。

45.1.4. 变长 或 定长行键(Variable Length or Fixed Length Rowkeys?)

It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is a and the event type is e1 then the resulting rowkey would be quite small. However, what if the ingested hostname is myserver1.mycompany.com and the event type is com.package1.subpackage2.subsubpackage3.ImportantService?


It might make sense to use some substitution in the rowkey. There are at least two approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it might look like this:


Composite Rowkey With Hashes:

  • [MD5 hash of hostname] = 16 bytes
  • [MD5 hash of event-type] = 16 bytes
  • [timestamp] = 8 bytes


  • [MD5 hash of hostname] = 16 bytes
  • [MD5 hash of event-type] = 16 bytes
  • [timestamp] = 8 bytes

Composite Rowkey With Numeric Substitution:

For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES. The rowkey of LOG_TYPES would be:

  • type
  • [bytes] variable length bytes for raw hostname or event-type.

A column for this rowkey could be a long with an assigned number, which could be obtained by using an HBase counter

So the resulting composite rowkey would be:

  • [substituted long for hostname] = 8 bytes
  • [substituted long for event type] = 8 bytes
  • [timestamp] = 8 bytes



  • type
  • [bytes] 代表原始主机名和事件的定长字节数组



  • [substituted long for hostname] = 8 bytes
  • [substituted long for event type] = 8 bytes
  • [timestamp] = 8 bytes

In either the Hash or Numeric substitution approach, the raw values for hostname and event-type can be stored as columns.


45.2. 案例学习 - 日志数据和聚合时序数据(Case Study - Log Data and Timeseries Data on Steroids)

This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.

这实际上就是OpenTSDB采用的方法。它把数据进行重写并按照一定的时间周期将行打包成列。对其细节的解释, see: http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.

But this is how the general concept works: data is ingested, for example, in this manner…​

with separate rowkeys for each detailed event, but is re-written like this…​

and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.






45.3. (案例学习 - 客户/订单)Case Study - Customer/Order

Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.

The Customer record type would include all the things that you’d typically expect:



  • Customer number
  • Customer name
  • Address (e.g., city, state, zip)
  • Phone numbers, etc.


  • Customer number
  • Order number
  • Sales date
  • A series of nested objects for shipping locations and line-items (see Order Object Design for details)

Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:


[customer number][order number]

for an ORDER table.

However, there are more design decisions to make: are the raw values the best choices for rowkeys?


The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:


Composite Rowkey With Hashes:

  • [MD5 of customer number] = 16 bytes
  • [MD5 of order number] = 16 bytes

Composite Numeric/Hash Combo Rowkey:

  • [substituted long for customer number] = 8 bytes
  • [MD5 of order number] = 16 bytes


  • [MD5 of customer number] = 16 bytes
  • [MD5 of order number] = 16 bytes


  • [substituted long for customer number] = 8 bytes
  • [MD5 of order number] = 16 bytes

45.3.1. (单个表?多个表?)Single Table? Multiple Tables?

A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).


Customer Record Type Rowkey:


[type] = type indicating `1' for customer record type

Order Record Type Rowkey:


[type] = type indicating `2' for order record type


The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it’s not as easy to scan for a particular record-type.


45.3.2. (订单对象设计)Order Object Design

Now we need to address how to model the Order object. Assume that the class structure is as follows:

(an Order can have multiple ShippingLocations

(a ShippingLocation can have multiple LineItems

there are multiple options on storing this data.


完全标准化(Completely Normalized)

With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.


The ORDER table’s rowkey was described above: schema.casestudies.custorder

The SHIPPING_LOCATION’s composite rowkey would be something like this:


shipping location number

The LINE_ITEM table’s composite rowkey would be something like this:


shipping location number

line item number




[shipping location number](e.g., 1st location, 2nd, etc.)



[shipping location number](e.g., 1st location, 2nd, etc.)

[line item number](e.g., 1st lineitem, 2nd, etc.)

Such a normalized model is likely to be the approach with an RDBMS, but that’s not your only option with HBase. The cons of such an approach is that to retrieve information about any Order, you will need:

Get on the ORDER table for the Order

Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances

Scan on the LINE_ITEM for each ShippingLocation

granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you’re just more aware of this fact.






带有记录类型的单个表(Single Table With Record Types)

With this approach, there would exist a single table ORDER that would contain


The Order rowkey was described above: schema.casestudies.custorder


[ORDER record type]

The ShippingLocation composite rowkey would be something like this:


[SHIPPING record type]

shipping location number

The LineItem composite rowkey would be something like this:


[LINE record type]

shipping location number

line item number



[ORDER record type]



[SHIPPING record type]

[shipping location number](e.g., 1st location, 2nd, etc.)



[LINE record type]

[shipping location number](e.g., 1st location, 2nd, etc.)

[line item number](e.g., 1st lineitem, 2nd, etc.)


A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.




[LINE record type]

[line item number](e.g., 1st lineitem, 2nd, etc., care must be taken that there are unique across the entire order)





shipToLine1 (denormalized from ShippingLocation)

shipToLine2 (denormalized from ShippingLocation)

shipToCity (denormalized from ShippingLocation)

shipToState (denormalized from ShippingLocation)

shipToZip (denormalized from ShippingLocation)

The pros of this approach include a less complex object hierarchy, but one of the cons is that updating gets more complicated in case any of this information changes.


Object BLOB

With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the ORDER table’s rowkey was described above: schema.casestudies.custorder, and a single column called "order" would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.

这个方法中,整个订单对象图,以这样或那样的方式,处理为BLOB。例如,订单表的行键如上所述:schema.casestudies.custorder,然后单个的称为order的列会包含一个可被反序列化的对象,包含Order, ShippingLocations, and LineItems.

There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward compatibility in case the object model changes such that older persisted structures can still be read back out of HBase.

有多种选项:JSON, XML, Java Serialization, Avro, Hadoop Writables, 等等。它们都可以做到:将对象图编码为字节数组。对于该方法,需要注意的是,确保向后兼容,旧的数据结构在对象模型变化之后仍然能够从HBase中读取。

Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatibility of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.

优点是可以通过很小的IO管理复杂的对象图(比如, 在该例中单个get请求就可以获取整个订单信息 ),但缺点如前所述,需要小心序列化方面的向后兼容,序列化的语言依赖(比如,java的序列化只能通过java的客户端),获取一点点数据也需要反序列化整个对象,以及类似Hive这样的框架难以处理此类自定义对象。

45.4. Case Study - "Tall/Wide/Middle" Schema Design Smackdown

This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.

这个章节将描述出现在dist-list中的另外一些设计问题,特别是关于高表和宽表。这些是一般性的指南而不是法律 - 每个应用必须考虑其自身所需。

45.4.1. 行 vs 版本(Rows vs. Versions)

A common question is whether one should prefer rows or HBase’s built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwrite with each successive update.

Preference: Rows (generally speaking).



45.4.2. 行 vs 列(Rows vs. Columns)

Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.

Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two options, and that is "Rows as Columns."



45.4.3. 行作为列(Rows as Columns)

The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see schema.casestudies.log-steroids.

行vs列的中间选择是针对一些特定的行,将其数据打包作为列。 OpenTSDB 就是一个最好的例子,单个行表示一个既定的时间范围,而离散的事件作为列。这种方法通常会更复杂,并且需要额外的复杂度去重写你的数据,但在I/O性能上有优势。对方法的概要说明,查看schema.casestudies.log-steroids

45.5. 案例学习 - 列表数据(Case Study - List Data)

The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.



We’re looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:


<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we had was to do this entirely using:



where each row would contain multiple values. So in one case reading the first thirty values would be:


scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}

And in the second case it would be


get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have ⇐ 30 total values in these lists, and some users would have millions (i.e. power-law distribution)


The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?


My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we’ll always need the same page size. I’ve ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we’d need to update all subsequent rows).

Thanks for help / suggestions / follow-up questions.


  • ANSWER *

If I understand you correctly, you’re ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:

如果我理解的没错,你本质上是想存储"user, valueid, value"的元组?类似这样:

"user123, firstname, Paul",
"user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).


And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?

The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you’re really sure it is needed.



Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you’re giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn’t sound like you need that. Doing it this way is generally recommended (see here https://hbase.apache.org/book.html#schema.smackdown).

你的两个选项反映了人们在设计HBase模式时的一个常见问题:应该用高表还是宽表?你第一个模式时高表:每一行代表一个用户的一个值;行键是user + valueid,且只有一个列限定符叫做"the value"。如果你想基于有序行键进行扫描的话,这很不错。你可以从任意的user+valueid开始一个scan,读取接下来的30行,就可以了。你所放弃的是对于某个用户所有行的事务保证方面的能力,但貌似你并不需要这个。这是通常所推荐的方式(看这里:https://hbase.apache.org/book.html#schema.smackdown)。

Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I’m guessing you jumped to the "paginated" version because you’re assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you’re not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn’t be fundamentally worse. The client has methods that allow you to get specific slices of columns.


Note that neither case fundamentally uses more disk space than the other; you’re just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George’s excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).

注意,没有哪个选项会占用更多的空间;你只是将值的标识信息放在左边(行键中)或右边(列限定符)。在底层,每个键值对仍然会存储整个行键和列名称。(如果有一些困惑,花一个小时看下Lars George关于理解HBase模式设计的视频:http://www.youtube.com/watch?v=_HLoH_PgrLk)

A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don’t have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! )


46. Operational and Performance Configuration Options

46.1. 优化HBase 服务端RPC处理(Tune HBase Server RPC Handling)

  • Set hbase.regionserver.handler.count (in hbase-site.xml) to cores x spindles for concurrency.
  • Optionally, split the call queues into separate read and write queues for differentiated service. The parameter hbase.ipc.server.callqueue.handler.factor specifies the number of call queues:

    • 0 means a single shared queue
    • 1 means one queue for each handler.
    • A value between 0 and 1 allocates the number of queues proportionally to the number of handlers. For instance, a value of .5 shares one queue between each two handlers.
  • Use hbase.ipc.server.callqueue.read.ratio (hbase.ipc.server.callqueue.read.share in 0.98) to split the call queues into read and write queues:

    • 0.5 means there will be the same number of read and write queues
    • < 0.5 for more read than write
    • > 0.5 for more write than read
  • Set hbase.ipc.server.callqueue.scan.ratio (HBase 1.0+) to split read call queues into small-read and long-read queues:

    • 0.5 means that there will be the same number of short-read and long-read queues
    • < 0.5 for more short-read
    • > 0.5 for more long-read
  • 将hbase.regionserver.handler.count设置为cpu数量的倍数.
  • 可选的,针对不同服务将请求队列进行隔离,hbase.ipc.server.callqueue.handler.factor参数定义了请求队列的数量:

    • 0 代表共用1个队列。
    • 1 代表每个handler对应1个队列。
    • 0-1中间的值,代表根据handler的数量,按比例分配队列。比如,0.5意味着2个handler共用1个队列。
  • 使用hbase.ipc.server.callqueue.read.ratio将请求队列拆分为读和写队列:

    • 0.5 代表读队列和写队列数量一样
    • < 0.5 代表读队列更多
    • > 0.5 代表写队列更多
  • 配置hbase.ipc.server.callqueue.scan.ratio (HBase 1.0+) 将读队列拆分为short-read和long-read队列:

    • 0.5 代表short-read和long-read队列数量一样
    • < 0.5 代表short-read队列更多
    • > 0.5 代表long-read队列更多

46.2. 对RPC禁用Nagle(Disable Nagle for RPC)

Disable Nagle’s algorithm. Delayed ACKs can add up to ~200ms to RPC round trip time. Set the following parameters:

  • In Hadoop’s core-site.xml:

    • ipc.server.tcpnodelay = true
    • ipc.client.tcpnodelay = true
  • In HBase’s hbase-site.xml:

    • hbase.ipc.client.tcpnodelay = true
    • hbase.ipc.server.tcpnodelay = true

禁用Nagle算法. 延迟的ACKs会将RPC往返时间最多增加到200ms。 Set the following parameters:

  • In Hadoop’s core-site.xml:

    • ipc.server.tcpnodelay = true
    • ipc.client.tcpnodelay = true
  • In HBase’s hbase-site.xml:

    • hbase.ipc.client.tcpnodelay = true
    • hbase.ipc.server.tcpnodelay = true

46.3. 限制服务端错误影响(Limit Server Failure Impact)

Detect regionserver failure as fast as reasonable. Set the following parameters:

  • In hbase-site.xml, set zookeeper.session.timeout to 30 seconds or less to bound failure detection (20-30 seconds is a good start).

    • Notice: the sessionTimeout of zookeeper is limited between 2 times and 20 times the tickTime(the basic time unit in milliseconds used by ZooKeeper.the default value is 2000ms.It is used to do heartbeats and the minimum session timeout will be twice the tickTime).
  • Detect and avoid unhealthy or failed HDFS DataNodes: in hdfs-site.xml and hbase-site.xml, set the following parameters:

    • dfs.namenode.avoid.read.stale.datanode = true
    • dfs.namenode.avoid.write.stale.datanode = true

在合理范围内尽快发现regionserver的错误. 配置以下参数:

  • 在hbase-site.xml中, 将zookeeper.session.timeout设置为30秒或更少 (20-30秒是个不错的开始)。

    • 注意: zookeeper的会话超时时间被限制为tickTime的2倍到20倍之间(ZooKeeper使用的一个基本时间单位.默认值是2000ms.它被用来发送心跳,且最小的会话过期时间应2倍于此值)。
  • 发现和避免非健康或失败的HDFS节点: in hdfs-site.xml and hbase-site.xml, set the following parameters:

    • dfs.namenode.avoid.read.stale.datanode = true
    • dfs.namenode.avoid.write.stale.datanode = true

46.4. Optimize on the Server Side for Low Latency

Skip the network for local blocks when the RegionServer goes to read from HDFS by exploiting HDFS’s Short-Circuit Local Reads facility. Note how setup must be done both at the datanode and on the dfsclient ends of the conneciton — i.e. at the RegionServer and how both ends need to have loaded the hadoop native .so library. After configuring your hadoop setting dfs.client.read.shortcircuit to true and configuring the dfs.domain.socket.path path for the datanode and dfsclient to share and restarting, next configure the regionserver/dfsclient side.


  • In hbase-site.xml, set the following parameters:

    • dfs.client.read.shortcircuit = true
    • dfs.client.read.shortcircuit.skip.checksum = true so we don’t double checksum (HBase does its own checksumming to save on i/os. See hbase.regionserver.checksum.verify for more on this.
    • dfs.domain.socket.path to match what was set for the datanodes.
    • dfs.client.read.shortcircuit.buffer.size = 131072 Important to avoid OOME — hbase has a default it uses if unset, see hbase.dfs.client.read.shortcircuit.buffer.size; its default is 131072.
  • Ensure data locality. In hbase-site.xml, set hbase.hstore.min.locality.to.skip.major.compact = 0.7 (Meaning that 0.7 <= n <= 1)
  • Make sure DataNodes have enough handlers for block transfers. In hdfs-site.xml, set the following parameters:

    • dfs.datanode.max.xcievers >= 8192
    • dfs.datanode.handler.count = number of spindles

Check the RegionServer logs after restart. You should only see complaint if misconfiguration. Otherwise, shortcircuit read operates quietly in background. It does not provide metrics so no optics on how effective it is but read latencies should show a marked improvement, especially if good data locality, lots of random reads, and dataset is larger than available cache.


Other advanced configurations that you might play with, especially if shortcircuit functionality is complaining in the logs, include dfs.client.read.shortcircuit.streams.cache.size and dfs.client.socketcache.capacity. Documentation is sparse on these options. You’ll have to read source code.

另一个你可能需要处理的高级配置,尤其是当日志里出现关于短路功能异常时,包含dfs.client.read.shortcircuit.streams.cache.size 和 dfs.client.socketcache.capacity。它们的配置文档比较分散。你可能需要阅读源码。

For more on short-circuit reads, see Colin’s old blog on rollout, How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop. The HDFS-347 issue also makes for an interesting read showing the HDFS community at its best (caveat a few comments).

更多关于短路读的信息,可以查看Colin的旧博客,How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop。有兴趣的话可以阅读HDFS-347,其展示了HDFS社区在这上面的努力(一些评论值得关注)。

46.5. JVM Tuning

46.5.1. Tune JVM GC for low collection latencies

Use the CMS collector: -XX:+UseConcMarkSweepGC

Keep eden space as small as possible to minimize average collection time. Example:

Optimize for low collection latency rather than throughput: -Xmn512m

Collect eden in parallel: -XX:+UseParNewGC

Avoid collection under pressure: -XX:+UseCMSInitiatingOccupancyOnly

Limit per request scanner result sizing so everything fits into survivor space but doesn’t tenure. In hbase-site.xml, set hbase.client.scanner.max.result.size to 1/8th of eden space (with -Xmn512m this is ~51MB )

Set max.result.size x handler.count less than survivor space






限制单个请求的结果大小,从而都可以放到survivor区而不是tenure区。在hbase-site.xml中,配置hbase.client.scanner.max.result.size为eden区的八分之一(with -Xmn512m this is ~51MB)

使max.result.size x handler.count小于survivor区。

46.5.2. OS-Level Tuning

Turn transparent huge pages (THP) off:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Set vm.swappiness = 0

Set vm.min_free_kbytes to at least 1GB (8GB on larger memory systems)

Disable NUMA zone reclaim with vm.zone_reclaim_mode = 0x

47. Special Cases

47.1. 对于那些希望快速失败而非等待的应用(For applications where failing quickly is better than waiting)

In hbase-site.xml on the client side, set the following parameters:

Set hbase.client.pause = 1000

Set hbase.client.retries.number = 3

If you want to ride over splits and region moves, increase hbase.client.retries.number substantially (>= 20)

Set the RecoverableZookeeper retry count: zookeeper.recovery.retry = 1 (no retry)

In hbase-site.xml on the server side, set the Zookeeper session timeout for detecting server failures: zookeeper.session.timeout ⇐ 30 seconds (20-30 is good).

47.2. 对于那些能够容忍稍微过时信息的应用(For applications that can tolerate slightly out of date information)

HBase timeline consistency (HBASE-10070) With read replicas enabled, read-only copies of regions (replicas) are distributed over the cluster. One RegionServer services the default or primary replica, which is the only replica that can service writes. Other RegionServers serve the secondary replicas, follow the primary RegionServer, and only see committed updates. The secondary replicas are read-only, but can serve reads immediately while the primary is failing over, cutting read availability blips from seconds to milliseconds. Phoenix supports timeline consistency as of 4.4.0 Tips:

  • Deploy HBase 1.0.0 or later.
  • Enable timeline consistent replicas on the server side.
  • Use one of the following methods to set timeline consistency:

    • Set the connection property Consistency to timeline in the JDBC connect string

HBase时间线一致性(HBase -10070)在启用读副本的情况下,region的只读副本分布在集群中。 一个RegionServer提供默认的或主副本服务, 写服务只能由该副本提供. 其它RegionServers提供从副本服务, 跟进主RegionServer, 只对已提交的更新可见.从副本是只读的,但当主副本挂掉时,能够立即提供读服务,将读不可用的时间从秒级减少到毫秒级。 Phoenix从4.4.0开始支持时间线一致性:

  • 部署HBase 1.0.0之后的版本。
  • 在服务端启用时间线一致性.
  • 使用下述的方法之一来配置时间线一致性:

    • Set the connection property Consistency to timeline in the JDBC connect string
登录 后评论
2017-08-01 13:27:00
HBase read replicas 功能介绍系列
2018-03-31 13:32:43
hbase2.0 vs hbase1.x 延时比较
2018-05-13 12:16:45
Phoenix 构建二级索引
2018-06-13 09:50:00