1. 背景
今早来公司发现HBase集群异常,看日志发现HMaster与一个RegionServer连接失败,于是找一下什么原因。
2. 问题日志
RegionServer 日志
2019-11-28 10:01:26,051 INFO [regionserver/node3:16020] hbase.ChoreService: Chore service for: regionserver/node3:16020 had [[ScheduledChore: Name: CompactedHFilesCleaner Period: 120000 Unit: MILLISEC
ONDS], [ScheduledChore: Name: CompactionThroughputTuner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region node3,16020,1551923919360 Period: 120000 Unit: MI
LLISECONDS], [ScheduledChore: Name: MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS]] on shutdown
2019-11-28 10:01:26,051 INFO [regionserver/node3:16020.logRoller] regionserver.LogRoller: LogRoller exiting.
2019-11-28 10:01:26,051 INFO [regionserver/node3:16020] regionserver.CompactSplit: Waiting for Split Thread to finish...
2019-11-28 10:01:26,052 INFO [regionserver/node3:16020] regionserver.CompactSplit: Waiting for Large Compaction Thread to finish...
2019-11-28 10:01:26,052 INFO [regionserver/node3:16020] regionserver.CompactSplit: Waiting for Small Compaction Thread to finish...
2019-11-28 10:01:26,053 INFO [regionserver/node3:16020] ipc.NettyRpcServer: Stopping server on /ip:16020
2019-11-28 10:01:26,106 INFO [regionserver/node3:16020] zookeeper.ZooKeeper: Session: 0x36952745d40005a closed
2019-11-28 10:01:26,106 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2019-11-28 10:01:26,106 INFO [regionserver/node3:16020] regionserver.HRegionServer: Exiting; stopping=node3,16020,1551923919360; zookeeper connection closed.
2019-11-28 10:01:26,106 INFO [shutdown-hook-0] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2019-11-28 10:01:26,107 INFO [shutdown-hook-0] regionserver.ShutdownHook: Shutdown hook finished.
HMaster日志如下
2019-11-28 10:01:23,193 INFO [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=20, retries=106, started=229940 ms ago, cancelled=false, ms
g=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on node3,16020,1551923919360
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446)
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
2019-11-28 10:01:26,078 INFO [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [node4,16020,1551922317034]
......
......
2019-11-28 10:04:42,834 INFO [RpcServer.default.FPBQ.Fifo.handler=99,queue=9,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=106, started=4349 ms ago, cancelled=false, msg=C
all to node3/ip:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: nod
e3/ip:16020, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
2019-11-28 10:04:46,867 INFO [RpcServer.default.FPBQ.Fifo.handler=99,queue=9,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=7, retries=106, started=8382 ms ago, cancelled=false, msg=C
all to node3/ip:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: nod
e3/ip:16020, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
2019-11-28 10:04:56,955 INFO [RpcServer.default.FPBQ.Fifo.handler=99,queue=9,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=8, retries=106, started=18470 ms ago, cancelled=false, msg=
Call to node3/ip:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: no
de3/ip:16020, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
3. 问题定位
经过排查确定是HBase节点间时间不同步导致的,可能是当初部署集群的时候忘记做时间同步了,那么我们加上时间同步即可。
4. 解决办法
需要将各节点时间同步即可,那面列出使用ntp将节点时间同步的方法。
- 安装ntp服务
yum install ntp
- 设置ntp为开机启动
chkconfig ntpd on
3.启动ntp服务
service ntpd start
- 查看ntpd的状态
service ntpd status
5.联网情况:
同步互联网的时间(可自行找一个时间服务器)在所有节点下执行下面命令
ntpdate ntp1.aliyun.com
6.离线情况
以其中一台最接近当前网络时间的服务器作为时间服务器,然后其他机器将时间同步到与该机器一致。
6.1 作为时间服务器的那台机器需要开启ntpd服务,其他机器不用开启,命令如下
service ntpd start
6.2 其它机器依次执行同步命令
ntpdate 时间服务器的ip
执行完上述步骤便完成时间同步了。