开发者社区> 问答> 正文

hbase DroppedSnapshotException 异常

Hbase 出现故障,查看hmaster 日志,发现异常导致regionserver挂掉了:
ERROR org.apache.hadoop.hbase.master.MasterRpcServices:Region Server * report a fatal error :
ABORTING region server * : Replay of WAL required.Forcing server shutduwn
Cause:
org.apache.hadoop.hbase.DroppedSnapshotException: region ...
...
Causeed by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for ringBufferSequence=75627217.WAL system stuck?
...
请问下这个故障的原因,以及如何避免?

展开
收起
hbase小能手 2018-11-06 16:52:27 4166 0
2 条回答
写回答
取消 提交回答
  • 前一个帐号wangccsy@126.com不知道怎么的就成了企业帐号,改不成个人。所以重新注册了一个个人帐号。老程序员。精通JAVA,C#,数据库,对软件开发过程和流程熟悉。考取系统分析师,项目管理师和系统架构设计师等软件资格考试认证。愿意和大家一起前进。

    同步超时,很明确的错误。是不是同步调用中有耗时的地方。

    2019-07-17 23:12:27
    赞同 展开评论 打赏
  • 社区管理员

    DroppedSnapshotException这个异常一般是flushRegion中出现的异常,关于这个异常可以查看jira中这个异常的修复HBASE-644
    此外,出现这种异常,查看memstore是否超过阈值,以及GC等

    private boolean flushRegion(final Region region, final boolean emergencyFlush,

      boolean forceFlushAllStores) {
    long startTime = 0;
    synchronized (this.regionsInQueue) {
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      // Use the start time of the FlushRegionEntry if available
      if (fqe != null) {
        startTime = fqe.createTime;
      }
      if (fqe != null && emergencyFlush) {
        // Need to remove from region from delay queue.  When NOT an
        // emergencyFlush, then item was removed via a flushQueue.poll.
        flushQueue.remove(fqe);
     }
    }
    if (startTime == 0) {
      // Avoid getting the system time unless we don't have a FlushRegionEntry;
      // shame we can't capture the time also spent in the above synchronized
      // block
      startTime = EnvironmentEdgeManager.currentTime();
    }
    lock.readLock().lock();
    try {
      notifyFlushRequest(region, emergencyFlush);
      FlushResult flushResult = region.flush(forceFlushAllStores);
      boolean shouldCompact = flushResult.isCompactionNeeded();
      // We just want to check the size
      boolean shouldSplit = ((HRegion)region).checkSplit() != null;
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region);
      } else if (shouldCompact) {
        server.compactSplitThread.requestSystemCompaction(
            region, Thread.currentThread().getName());
      }
      if (flushResult.isFlushSucceeded()) {
        long endTime = EnvironmentEdgeManager.currentTime();
        server.metricsRegionServer.updateFlushTime(endTime - startTime);
      }
    } catch (DroppedSnapshotException ex) {
      // Cache flush can fail in a few places. If it fails in a critical
      // section, we get a DroppedSnapshotException and a replay of wal
      // is required. Currently the only way to do this is a restart of
      // the server. Abort because hdfs is probably bad (HBASE-644 is a case
      // where hdfs was bad but passed the hdfs check).
      server.abort("Replay of WAL required. Forcing server shutdown", ex);
      return false;
    } catch (IOException ex) {
      LOG.error("Cache flush failed" + (region != null ? (" for region " +
          Bytes.toStringBinary(region.getRegionInfo().getRegionName())) : ""),
        RemoteExceptionHandler.checkIOException(ex));
      if (!server.checkFileSystem()) {
        return false;
      }
    } finally {
      lock.readLock().unlock();
      wakeUpIfBlocking();
    }
    return true;

    }

    此处官方对于DroppedSnapshotException给出的建议是重启服务,这个有可能是HDFS异常,而在健康检查的时候没有检测出来;以上是个人看法

    2019-07-17 23:12:27
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
大数据时代的存储 ——HBase的实践与探索 立即下载
Hbase在滴滴出行的应用场景和最佳实践 立即下载
阿里云HBase主备双活 立即下载