MHA failover GTID 专题

这里以masterha_master_switch为背景详解各种可能遇到的场景

假定环境(经典三节点)

host_1(host_1:3306) (current master)
 +--host_2(host_2:3306 slave[candidate master])
 +--host_3(host_3:3306 etl)

一、Master : MySQL down

1.1 etl 延迟8小时

配置文件中加上no_check_delay=0 即可忽略报错

1.2 slave(候选master)比etl还要落后更多

1.2.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB gtid状态

* master  host_2

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_2.000002 |     2885 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)


* slave (candidate master)  host_1

           Retrieved_Gtid_Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446353
                Auto_Position: 1

* etl (other slave)  host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353-446356
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446356
                Auto_Position: 1



### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 10:43:49 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov  9 10:43:49 2017 - [info] Starting master failover.
Thu Nov  9 10:43:49 2017 - [info]
Thu Nov  9 10:43:49 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov  9 10:43:49 2017 - [info]
Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Thu Nov  9 10:43:50 2017 - [info] Binlog server  host_2 is reachable.
Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Thu Nov  9 10:43:50 2017 - [info] Binlog server  host_1 is reachable.
Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to  host_3 is reachable.
Thu Nov  9 10:43:50 2017 - [info] Binlog server  host_3 is reachable.
Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)
Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Thu Nov  9 10:43:51 2017 - [info] GTID failover mode = 1
Thu Nov  9 10:43:51 2017 - [info] Dead Servers:
Thu Nov  9 10:43:51 2017 - [info]    host_2( host_2:3306)
Thu Nov  9 10:43:51 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov  9 10:43:51 2017 - [info]  ok.
Thu Nov  9 10:43:51 2017 - [info] Alive Servers:
Thu Nov  9 10:43:51 2017 - [info]    host_1( host_1:3306)
Thu Nov  9 10:43:51 2017 - [info]    host_3( host_3:3306)
Thu Nov  9 10:43:51 2017 - [info] Alive Slaves:
Thu Nov  9 10:43:51 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:43:51 2017 - [info]     GTID ON
Thu Nov  9 10:43:51 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 10:43:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 10:43:51 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:43:51 2017 - [info]     GTID ON
Thu Nov  9 10:43:51 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 10:43:51 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..
Thu Nov  9 10:43:51 2017 - [info]   done.
Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Thu Nov  9 10:43:51 2017 - [info]   done.
Thu Nov  9 10:43:51 2017 - [info] Starting GTID based failover.
Thu Nov  9 10:43:51 2017 - [info]
Thu Nov  9 10:43:51 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov  9 10:43:51 2017 - [info]
Thu Nov  9 10:43:51 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov  9 10:43:51 2017 - [info]
Thu Nov  9 10:43:51 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Thu Nov  9 10:43:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov  9 10:43:51 2017 - [info] Executing master IP deactivation script:
Thu Nov  9 10:43:51 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Nov  9 10:43:53 2017 - [info]  done.
Thu Nov  9 10:43:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov  9 10:43:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info] The latest binary log file/position on all slaves is  host_2.000002:1115
Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,
Thu Nov  9 10:43:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov  9 10:43:53 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:43:53 2017 - [info]     GTID ON
Thu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 10:43:53 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000002:230
Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353
Thu Nov  9 10:43:53 2017 - [info] Oldest slaves:
Thu Nov  9 10:43:53 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:43:53 2017 - [info]     GTID ON
Thu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info] Searching new master from slaves..
Thu Nov  9 10:43:53 2017 - [info]  Candidate masters from the configuration file:
Thu Nov  9 10:43:53 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:43:53 2017 - [info]     GTID ON
Thu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 10:43:53 2017 - [info]  Non-candidate masters:
Thu Nov  9 10:43:53 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:43:53 2017 - [info]     GTID ON
Thu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 10:43:53 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov  9 10:43:53 2017 - [info]   Not found.
Thu Nov  9 10:43:53 2017 - [info]  Searching from all candidate_master slaves..
Thu Nov  9 10:43:53 2017 - [info] New master is  host_1( host_1:3306)
Thu Nov  9 10:43:53 2017 - [info] Starting master failover..
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov  9 10:43:53 2017 - [info]
Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied..
Thu Nov  9 10:43:53 2017 - [info]   done.
Thu Nov  9 10:43:53 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..
Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied on the latest slave..
Thu Nov  9 10:43:53 2017 - [info]  Resetting slave  host_1( host_1:3306) and starting replication from the new master  host_3( host_3:3306)..
Thu Nov  9 10:43:53 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 10:43:54 2017 - [info]  Slave started.
Thu Nov  9 10:43:54 2017 - [info]  Waiting to execute all relay logs on  host_1( host_1:3306)..
Thu Nov  9 10:43:54 2017 - [info]  master_pos_wait( host_3.000049:18041) completed on  host_1( host_1:3306). Executed 0 events.
Thu Nov  9 10:43:54 2017 - [info]   done.
Thu Nov  9 10:43:54 2017 - [info]   done.
Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host  host_2 started, pid: 150294
Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host  host_1 started, pid: 150295
Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host  host_3 started, pid: 150297
Thu Nov  9 10:43:54 2017 - [info]
Thu Nov  9 10:43:54 2017 - [info] Log messages from  host_1 ...
Thu Nov  9 10:43:54 2017 - [info]
Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server  host_1..
Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov  9 10:43:54 2017 - [info] End of log messages from  host_1.
Thu Nov  9 10:43:54 2017 - [warning] Got error from  host_1.
Thu Nov  9 10:43:54 2017 - [info]
Thu Nov  9 10:43:54 2017 - [info] Log messages from  host_3 ...
Thu Nov  9 10:43:54 2017 - [info]
Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server  host_3..
Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov  9 10:43:54 2017 - [info] End of log messages from  host_3.
Thu Nov  9 10:43:54 2017 - [warning] Got error from  host_3.
Thu Nov  9 10:43:55 2017 - [info]
Thu Nov  9 10:43:55 2017 - [info] Log messages from  host_2 ...
Thu Nov  9 10:43:55 2017 - [info]
Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server  host_2..
Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 10:43:55 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109104349.binlog succeeded.
Thu Nov  9 10:43:55 2017 - [info] End of log messages from  host_2.
Thu Nov  9 10:43:55 2017 - [info] Saved mysqlbinlog size from  host_2 is 6047 bytes.
Thu Nov  9 10:43:55 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109104349.binlog ..
Thu Nov  9 10:43:55 2017 - [info] Differential log apply from binlog server succeeded.
Thu Nov  9 10:43:55 2017 - [info] Getting new master's binlog name and position..
Thu Nov  9 10:43:55 2017 - [info]   host_1.000053:3624
Thu Nov  9 10:43:55 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov  9 10:43:55 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000053, 3624, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Thu Nov  9 10:43:55 2017 - [info] Executing master IP activate script:
Thu Nov  9 10:43:55 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Thu Nov  9 10:43:57 2017 - [info]  OK.
Thu Nov  9 10:43:57 2017 - [info] Setting read_only=0 on  host_1( host_1:3306)..
Thu Nov  9 10:43:57 2017 - [info]  ok.
Thu Nov  9 10:43:57 2017 - [info] ** Finished master recovery successfully.
Thu Nov  9 10:43:57 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov  9 10:43:57 2017 - [info]
Thu Nov  9 10:43:57 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov  9 10:43:57 2017 - [info]
Thu Nov  9 10:43:57 2017 - [info]
Thu Nov  9 10:43:57 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov  9 10:43:57 2017 - [info]
Thu Nov  9 10:43:57 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 155162. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109104349.log if it takes time..
Thu Nov  9 10:43:58 2017 - [info]
Thu Nov  9 10:43:58 2017 - [info] Log messages from  host_3 ...
Thu Nov  9 10:43:58 2017 - [info]
Thu Nov  9 10:43:57 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..
Thu Nov  9 10:43:57 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 10:43:58 2017 - [info]  Slave started.
Thu Nov  9 10:43:58 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Thu Nov  9 10:43:58 2017 - [info] End of log messages from  host_3.
Thu Nov  9 10:43:58 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Thu Nov  9 10:43:58 2017 - [info] All new slave servers recovered successfully.
Thu Nov  9 10:43:58 2017 - [info]
Thu Nov  9 10:43:58 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov  9 10:43:58 2017 - [info]
Thu Nov  9 10:43:58 2017 - [info] Resetting slave info on the new master..
Thu Nov  9 10:43:58 2017 - [info]   host_1: Resetting slave info succeeded.
Thu Nov  9 10:43:58 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.
Thu Nov  9 10:43:58 2017 - [info]
Thu Nov  9 10:43:58 2017 - [info] Sending mail..

1.2.2 当master的所有日志已经传递到1个etl，这时候master 上的MySQL挂了


### 模拟现场，现场的3台DB gtid状态

* master  host_1

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_1.000053 |     5229 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)



* slave (candidate master)  host_2

           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362
                Auto_Position: 1

* etl (other slave)  host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446357-446362
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362
                Auto_Position: 1


### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 10:59:14 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov  9 10:59:14 2017 - [info] Starting master failover.
Thu Nov  9 10:59:14 2017 - [info]
Thu Nov  9 10:59:14 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov  9 10:59:14 2017 - [info]
Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Thu Nov  9 10:59:15 2017 - [info] Binlog server  host_2 is reachable.
Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Thu Nov  9 10:59:15 2017 - [info] Binlog server  host_1 is reachable.
Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to  host_3 is reachable.
Thu Nov  9 10:59:16 2017 - [info] Binlog server  host_3 is reachable.
Thu Nov  9 10:59:16 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)
Thu Nov  9 10:59:16 2017 - [info] GTID failover mode = 1
Thu Nov  9 10:59:16 2017 - [info] Dead Servers:
Thu Nov  9 10:59:16 2017 - [info]    host_1( host_1:3306)
Thu Nov  9 10:59:16 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov  9 10:59:16 2017 - [info]  ok.
Thu Nov  9 10:59:16 2017 - [info] Alive Servers:
Thu Nov  9 10:59:16 2017 - [info]    host_2( host_2:3306)
Thu Nov  9 10:59:16 2017 - [info]    host_3( host_3:3306)
Thu Nov  9 10:59:16 2017 - [info] Alive Slaves:
Thu Nov  9 10:59:16 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:59:16 2017 - [info]     GTID ON
Thu Nov  9 10:59:16 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 10:59:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 10:59:16 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:59:16 2017 - [info]     GTID ON
Thu Nov  9 10:59:16 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 10:59:16 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 10:59:16 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..
Thu Nov  9 10:59:16 2017 - [info]   done.
Thu Nov  9 10:59:16 2017 - [info] Starting GTID based failover.
Thu Nov  9 10:59:16 2017 - [info]
Thu Nov  9 10:59:16 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov  9 10:59:16 2017 - [info]
Thu Nov  9 10:59:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov  9 10:59:16 2017 - [info]
Thu Nov  9 10:59:16 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Thu Nov  9 10:59:16 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov  9 10:59:16 2017 - [info] Executing master IP deactivation script:
Thu Nov  9 10:59:16 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Nov  9 10:59:20 2017 - [info]  done.
Thu Nov  9 10:59:20 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov  9 10:59:20 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info] The latest binary log file/position on all slaves is  host_1.000053:5229
Thu Nov  9 10:59:20 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,
Thu Nov  9 10:59:20 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov  9 10:59:20 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:59:20 2017 - [info]     GTID ON
Thu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 10:59:20 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000053:3624
Thu Nov  9 10:59:20 2017 - [info] Oldest slaves:
Thu Nov  9 10:59:20 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:59:20 2017 - [info]     GTID ON
Thu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info] Searching new master from slaves..
Thu Nov  9 10:59:20 2017 - [info]  Candidate masters from the configuration file:
Thu Nov  9 10:59:20 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:59:20 2017 - [info]     GTID ON
Thu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 10:59:20 2017 - [info]  Non-candidate masters:
Thu Nov  9 10:59:20 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 10:59:20 2017 - [info]     GTID ON
Thu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 10:59:20 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov  9 10:59:20 2017 - [info]   Not found.
Thu Nov  9 10:59:20 2017 - [info]  Searching from all candidate_master slaves..
Thu Nov  9 10:59:20 2017 - [info] New master is  host_2( host_2:3306)
Thu Nov  9 10:59:20 2017 - [info] Starting master failover..
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov  9 10:59:20 2017 - [info]
Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied..
Thu Nov  9 10:59:20 2017 - [info]   done.
Thu Nov  9 10:59:20 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..
Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied on the latest slave..
Thu Nov  9 10:59:20 2017 - [info]  Resetting slave  host_2( host_2:3306) and starting replication from the new master  host_3( host_3:3306)..
Thu Nov  9 10:59:20 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 10:59:21 2017 - [info]  Slave started.
Thu Nov  9 10:59:21 2017 - [info]  Waiting to execute all relay logs on  host_2( host_2:3306)..
Thu Nov  9 10:59:21 2017 - [info]  master_pos_wait( host_3.000049:22035) completed on  host_2( host_2:3306). Executed 0 events.
Thu Nov  9 10:59:21 2017 - [info]   done.
Thu Nov  9 10:59:21 2017 - [info]   done.
Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host  host_2 started, pid: 184482
Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host  host_1 started, pid: 184483
Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host  host_3 started, pid: 184487
Thu Nov  9 10:59:21 2017 - [info]
Thu Nov  9 10:59:21 2017 - [info] Log messages from  host_2 ...
Thu Nov  9 10:59:21 2017 - [info]
Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server  host_2..
Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov  9 10:59:21 2017 - [info] End of log messages from  host_2.
Thu Nov  9 10:59:21 2017 - [warning] Got error from  host_2.
Thu Nov  9 10:59:21 2017 - [info]
Thu Nov  9 10:59:21 2017 - [info] Log messages from  host_3 ...
Thu Nov  9 10:59:21 2017 - [info]
Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server  host_3..
Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov  9 10:59:21 2017 - [info] End of log messages from  host_3.
Thu Nov  9 10:59:21 2017 - [warning] Got error from  host_3.
Thu Nov  9 10:59:22 2017 - [info]
Thu Nov  9 10:59:22 2017 - [info] Log messages from  host_1 ...
Thu Nov  9 10:59:22 2017 - [info]
Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server  host_1..
Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 10:59:22 2017 - [info] scp from root@ host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171109105914.binlog succeeded.
Thu Nov  9 10:59:22 2017 - [info] End of log messages from  host_1.
Thu Nov  9 10:59:22 2017 - [info] Saved mysqlbinlog size from  host_1 is 800 bytes.
Thu Nov  9 10:59:22 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171109105914.binlog ..
Thu Nov  9 10:59:22 2017 - [info] Differential log apply from binlog server succeeded.
Thu Nov  9 10:59:22 2017 - [info] Getting new master's binlog name and position..
Thu Nov  9 10:59:22 2017 - [info]   host_2.000003:1680
Thu Nov  9 10:59:22 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov  9 10:59:22 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000003, 1680, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
Thu Nov  9 10:59:22 2017 - [info] Executing master IP activate script:
Thu Nov  9 10:59:22 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Thu Nov  9 10:59:24 2017 - [info]  OK.
Thu Nov  9 10:59:24 2017 - [info] Setting read_only=0 on  host_2( host_2:3306)..
Thu Nov  9 10:59:24 2017 - [info]  ok.
Thu Nov  9 10:59:24 2017 - [info] ** Finished master recovery successfully.
Thu Nov  9 10:59:24 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov  9 10:59:24 2017 - [info]
Thu Nov  9 10:59:24 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov  9 10:59:24 2017 - [info]
Thu Nov  9 10:59:24 2017 - [info]
Thu Nov  9 10:59:24 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov  9 10:59:24 2017 - [info]
Thu Nov  9 10:59:24 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 189393. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109105914.log if it takes time..
Thu Nov  9 10:59:25 2017 - [info]
Thu Nov  9 10:59:25 2017 - [info] Log messages from  host_3 ...
Thu Nov  9 10:59:25 2017 - [info]
Thu Nov  9 10:59:24 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..
Thu Nov  9 10:59:24 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 10:59:25 2017 - [info]  Slave started.
Thu Nov  9 10:59:25 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
Thu Nov  9 10:59:25 2017 - [info] End of log messages from  host_3.
Thu Nov  9 10:59:25 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Thu Nov  9 10:59:25 2017 - [info] All new slave servers recovered successfully.
Thu Nov  9 10:59:25 2017 - [info]
Thu Nov  9 10:59:25 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov  9 10:59:25 2017 - [info]
Thu Nov  9 10:59:25 2017 - [info] Resetting slave info on the new master..
Thu Nov  9 10:59:25 2017 - [info]   host_2: Resetting slave info succeeded.
Thu Nov  9 10:59:25 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.
Thu Nov  9 10:59:25 2017 - [info]
Thu Nov  9 10:59:25 2017 - [info] Sending mail..

1.3 slave(候选master)的日志是最新的，比etl要多

1.3.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Tue Nov  7 17:11:29 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov  7 17:11:29 2017 - [info] Starting master failover.
Tue Nov  7 17:11:29 2017 - [info]
Tue Nov  7 17:11:29 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov  7 17:11:29 2017 - [info]
Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Tue Nov  7 17:11:29 2017 - [info] Binlog server  host_2 is reachable.
Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Tue Nov  7 17:11:30 2017 - [info] Binlog server  host_1 is reachable.
Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to  host_3 is reachable.
Tue Nov  7 17:11:30 2017 - [info] Binlog server  host_3 is reachable.
Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)
Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Tue Nov  7 17:11:30 2017 - [info] GTID failover mode = 1
Tue Nov  7 17:11:30 2017 - [info] Dead Servers:
Tue Nov  7 17:11:30 2017 - [info]    host_1( host_1:3306)
Tue Nov  7 17:11:30 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov  7 17:11:30 2017 - [info]  ok.
Tue Nov  7 17:11:30 2017 - [info] Alive Servers:
Tue Nov  7 17:11:30 2017 - [info]    host_2( host_2:3306)
Tue Nov  7 17:11:30 2017 - [info]    host_3( host_3:3306)
Tue Nov  7 17:11:30 2017 - [info] Alive Slaves:
Tue Nov  7 17:11:30 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 17:11:30 2017 - [info]     GTID ON
Tue Nov  7 17:11:30 2017 - [info]     Replicating from  host_1( host_1:3306)
Tue Nov  7 17:11:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov  7 17:11:30 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 17:11:30 2017 - [info]     GTID ON
Tue Nov  7 17:11:30 2017 - [info]     Replicating from  host_1( host_1:3306)
Tue Nov  7 17:11:30 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..
Tue Nov  7 17:11:30 2017 - [info]   done.
Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Tue Nov  7 17:11:30 2017 - [info]   done.
Tue Nov  7 17:11:30 2017 - [info] Starting GTID based failover.
Tue Nov  7 17:11:30 2017 - [info]
Tue Nov  7 17:11:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov  7 17:11:30 2017 - [info]
Tue Nov  7 17:11:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov  7 17:11:30 2017 - [info]
Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Tue Nov  7 17:11:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov  7 17:11:31 2017 - [info] Executing master IP deactivation script:
Tue Nov  7 17:11:31 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Tue Nov  7 17:11:33 2017 - [info]  done.
Tue Nov  7 17:11:33 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov  7 17:11:33 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] The latest binary log file/position on all slaves is  host_1.000051:13508
Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:3-8
Tue Nov  7 17:11:33 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov  7 17:11:33 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 17:11:33 2017 - [info]     GTID ON
Tue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)
Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov  7 17:11:33 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000051:11918
Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:2-3,
Tue Nov  7 17:11:33 2017 - [info] Oldest slaves:
Tue Nov  7 17:11:33 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 17:11:33 2017 - [info]     GTID ON
Tue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)
Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Searching new master from slaves..
Tue Nov  7 17:11:33 2017 - [info]  Candidate masters from the configuration file:
Tue Nov  7 17:11:33 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 17:11:33 2017 - [info]     GTID ON
Tue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)
Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov  7 17:11:33 2017 - [info]  Non-candidate masters:
Tue Nov  7 17:11:33 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 17:11:33 2017 - [info]     GTID ON
Tue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)
Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov  7 17:11:33 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov  7 17:11:33 2017 - [info] New master is  host_2( host_2:3306)
Tue Nov  7 17:11:33 2017 - [info] Starting master failover..
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info]  Waiting all logs to be applied..
Tue Nov  7 17:11:33 2017 - [info]   done.
Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host  host_2 started, pid: 54677
Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host  host_1 started, pid: 54681
Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host  host_3 started, pid: 54683
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Log messages from  host_3 ...
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server  host_3..
Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov  7 17:11:33 2017 - [info] End of log messages from  host_3.
Tue Nov  7 17:11:33 2017 - [warning] Got error from  host_3.
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Log messages from  host_2 ...
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server  host_2..
Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov  7 17:11:33 2017 - [info] End of log messages from  host_2.
Tue Nov  7 17:11:33 2017 - [warning] Got error from  host_2.
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Log messages from  host_1 ...
Tue Nov  7 17:11:33 2017 - [info]
Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server  host_1..
Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 17:11:33 2017 - [info] scp from root@ host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171107171129.binlog succeeded.
Tue Nov  7 17:11:33 2017 - [info] End of log messages from  host_1.
Tue Nov  7 17:11:33 2017 - [info] Saved mysqlbinlog size from  host_1 is 8578 bytes.
Tue Nov  7 17:11:33 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171107171129.binlog ..
Tue Nov  7 17:11:33 2017 - [info] Differential log apply from binlog server succeeded.
Tue Nov  7 17:11:33 2017 - [info] Getting new master's binlog name and position..
Tue Nov  7 17:11:33 2017 - [info]   host_2.000001:5048
Tue Nov  7 17:11:33 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov  7 17:11:33 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000001, 5048, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Tue Nov  7 17:11:33 2017 - [info] Executing master IP activate script:
Tue Nov  7 17:11:33 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Tue Nov  7 17:11:36 2017 - [info]  OK.
Tue Nov  7 17:11:36 2017 - [info] Setting read_only=0 on  host_2( host_2:3306)..
Tue Nov  7 17:11:36 2017 - [info]  ok.
Tue Nov  7 17:11:36 2017 - [info] ** Finished master recovery successfully.
Tue Nov  7 17:11:36 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov  7 17:11:36 2017 - [info]
Tue Nov  7 17:11:36 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov  7 17:11:36 2017 - [info]
Tue Nov  7 17:11:36 2017 - [info]
Tue Nov  7 17:11:36 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Tue Nov  7 17:11:36 2017 - [info]
Tue Nov  7 17:11:36 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 58422. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171107171129.log if it takes time..
Tue Nov  7 17:11:37 2017 - [info]
Tue Nov  7 17:11:37 2017 - [info] Log messages from  host_3 ...
Tue Nov  7 17:11:37 2017 - [info]
Tue Nov  7 17:11:36 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..
Tue Nov  7 17:11:36 2017 - [info]  Executed CHANGE MASTER.
Tue Nov  7 17:11:37 2017 - [info]  Slave started.
Tue Nov  7 17:11:37 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Tue Nov  7 17:11:37 2017 - [info] End of log messages from  host_3.
Tue Nov  7 17:11:37 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Tue Nov  7 17:11:37 2017 - [info] All new slave servers recovered successfully.
Tue Nov  7 17:11:37 2017 - [info]
Tue Nov  7 17:11:37 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov  7 17:11:37 2017 - [info]
Tue Nov  7 17:11:37 2017 - [info] Resetting slave info on the new master..
Tue Nov  7 17:11:37 2017 - [info]   host_2: Resetting slave info succeeded.
Tue Nov  7 17:11:37 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.
Tue Nov  7 17:11:37 2017 - [info]
Tue Nov  7 17:11:37 2017 - [info] Sending mail..

1.3.2 当master的所有日志已经传递slave，这时候master 上的MySQL挂了

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Tue Nov  7 15:56:11 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov  7 15:56:11 2017 - [info] Starting master failover.
Tue Nov  7 15:56:11 2017 - [info]
Tue Nov  7 15:56:11 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov  7 15:56:11 2017 - [info]
Tue Nov  7 15:56:11 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Tue Nov  7 15:56:12 2017 - [info] Binlog server  host_2 is reachable.
Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Tue Nov  7 15:56:12 2017 - [info] Binlog server  host_1 is reachable.
Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to  host_3 is reachable.
Tue Nov  7 15:56:13 2017 - [info] Binlog server  host_3 is reachable.
Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)
Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Tue Nov  7 15:56:13 2017 - [info] GTID failover mode = 1
Tue Nov  7 15:56:13 2017 - [info] Dead Servers:
Tue Nov  7 15:56:13 2017 - [info]    host_2( host_2:3306)
Tue Nov  7 15:56:13 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov  7 15:56:13 2017 - [info]  ok.
Tue Nov  7 15:56:13 2017 - [info] Alive Servers:
Tue Nov  7 15:56:13 2017 - [info]    host_1( host_1:3306)
Tue Nov  7 15:56:13 2017 - [info]    host_3( host_3:3306)
Tue Nov  7 15:56:13 2017 - [info] Alive Slaves:
Tue Nov  7 15:56:13 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 15:56:13 2017 - [info]     GTID ON
Tue Nov  7 15:56:13 2017 - [info]     Replicating from  host_2( host_2:3306)
Tue Nov  7 15:56:13 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov  7 15:56:13 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 15:56:13 2017 - [info]     GTID ON
Tue Nov  7 15:56:13 2017 - [info]     Replicating from  host_2( host_2:3306)
Tue Nov  7 15:56:13 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..
Tue Nov  7 15:56:13 2017 - [info]   done.
Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Tue Nov  7 15:56:13 2017 - [info]   done.
Tue Nov  7 15:56:13 2017 - [info] Starting GTID based failover.
Tue Nov  7 15:56:13 2017 - [info]
Tue Nov  7 15:56:13 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov  7 15:56:13 2017 - [info]
Tue Nov  7 15:56:13 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov  7 15:56:13 2017 - [info]
Tue Nov  7 15:56:13 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Tue Nov  7 15:56:13 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov  7 15:56:13 2017 - [info] Executing master IP deactivation script:
Tue Nov  7 15:56:13 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
Tue Nov  7 15:56:16 2017 - [info]  done.
Tue Nov  7 15:56:16 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov  7 15:56:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] The latest binary log file/position on all slaves is  host_2.000049:11291
Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446352
Tue Nov  7 15:56:16 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov  7 15:56:16 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 15:56:16 2017 - [info]     GTID ON
Tue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)
Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov  7 15:56:16 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000049:10703
Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446350
Tue Nov  7 15:56:16 2017 - [info] Oldest slaves:
Tue Nov  7 15:56:16 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 15:56:16 2017 - [info]     GTID ON
Tue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)
Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Searching new master from slaves..
Tue Nov  7 15:56:16 2017 - [info]  Candidate masters from the configuration file:
Tue Nov  7 15:56:16 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 15:56:16 2017 - [info]     GTID ON
Tue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)
Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov  7 15:56:16 2017 - [info]  Non-candidate masters:
Tue Nov  7 15:56:16 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov  7 15:56:16 2017 - [info]     GTID ON
Tue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)
Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov  7 15:56:16 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov  7 15:56:16 2017 - [info] New master is  host_1( host_1:3306)
Tue Nov  7 15:56:16 2017 - [info] Starting master failover..
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Tue Nov  7 15:56:16 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info]  Waiting all logs to be applied..
Tue Nov  7 15:56:16 2017 - [info]   done.
Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host  host_2 started, pid: 79759
Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host  host_1 started, pid: 79768
Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host  host_3 started, pid: 79770
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_1 ...
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_1..
Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_1.
Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_1.
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_3 ...
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_3..
Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_3.
Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_3.
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_2 ...
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_2..
Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 15:56:17 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog succeeded.
Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_2.
Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from  host_2 is 768 bytes.
Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog ..
Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.
Tue Nov  7 15:56:17 2017 - [info] Getting new master's binlog name and position..
Tue Nov  7 15:56:17 2017 - [info]   host_1.000051:11449
Tue Nov  7 15:56:17 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov  7 15:56:17 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000051, 11449, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,
Tue Nov  7 15:56:17 2017 - [info] Executing master IP activate script:
Tue Nov  7 15:56:17 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Tue Nov  7 15:56:20 2017 - [info]  OK.
Tue Nov  7 15:56:20 2017 - [info] Setting read_only=0 on  host_1( host_1:3306)..
Tue Nov  7 15:56:20 2017 - [info]  ok.
Tue Nov  7 15:56:20 2017 - [info] ** Finished master recovery successfully.
Tue Nov  7 15:56:20 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov  7 15:56:20 2017 - [info]
Tue Nov  7 15:56:20 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov  7 15:56:20 2017 - [info]
Tue Nov  7 15:56:20 2017 - [info]
Tue Nov  7 15:56:20 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Tue Nov  7 15:56:20 2017 - [info]
Tue Nov  7 15:56:20 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 85941. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171107155611.log if it takes time..
Tue Nov  7 15:56:21 2017 - [info]
Tue Nov  7 15:56:21 2017 - [info] Log messages from  host_3 ...
Tue Nov  7 15:56:21 2017 - [info]
Tue Nov  7 15:56:20 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..
Tue Nov  7 15:56:20 2017 - [info]  Executed CHANGE MASTER.
Tue Nov  7 15:56:21 2017 - [info]  Slave started.
Tue Nov  7 15:56:21 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,
Tue Nov  7 15:56:21 2017 - [info] End of log messages from  host_3.
Tue Nov  7 15:56:21 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Tue Nov  7 15:56:21 2017 - [info] All new slave servers recovered successfully.
Tue Nov  7 15:56:21 2017 - [info]
Tue Nov  7 15:56:21 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov  7 15:56:21 2017 - [info]
Tue Nov  7 15:56:21 2017 - [info] Resetting slave info on the new master..
Tue Nov  7 15:56:21 2017 - [info]   host_1: Resetting slave info succeeded.
Tue Nov  7 15:56:21 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.
Tue Nov  7 15:56:21 2017 - [info]
Tue Nov  7 15:56:21 2017 - [info] Sending mail..

1.4 slave(候选master）上面有大事务在跑

1000s的大查询

无影响，正常切换

flush tables with readlock


dba:(none)> show processlist;
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host                 | db   | Command | Time | State                        | Info                                                                                                 |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| 63 | dba  | localhost            | NULL | Query   |    0 | starting                     | show processlist                                                                                     |
| 65 | dba  | xx:11164   | NULL | Sleep   |  121 |                              | NULL                                                                                                 |
| 83 | dba  | new master:49022 | NULL | Query   |  176 | Waiting for global read lock | BINLOG '
GpAKWhNYUy1LMAAAAGYHAAAAAG0AAAAAAAEAAmxjAAh0X2NoYXJfMgACAw8CLAEC
GpAKWh5YUy1LJwAAAI0HAAAAAG |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

1.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl （这是最严苛的情况）

binlog server 写3台

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_1 ...
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_1..
Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_1.
Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_1.
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_3 ...
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_3..
Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_3.
Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_3.
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_2 ...
Tue Nov  7 15:56:17 2017 - [info]
Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_2..
Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Tue Nov  7 15:56:17 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog succeeded.
Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_2.
Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from  host_2 is 768 bytes.
Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog ..
Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.

binlog server 只写master

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Thu Nov  9 11:20:04 2017 - [info] -- Saving binlog from host  host_2 started, pid: 117389
Thu Nov  9 11:20:05 2017 - [info]
Thu Nov  9 11:20:05 2017 - [info] Log messages from  host_2 ...
Thu Nov  9 11:20:05 2017 - [info]
Thu Nov  9 11:20:04 2017 - [info] Fetching binary logs from binlog server  host_2..
Thu Nov  9 11:20:04 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000004  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 11:20:05 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109111957.binlog succeeded.
Thu Nov  9 11:20:05 2017 - [info] End of log messages from  host_2.
Thu Nov  9 11:20:05 2017 - [info] Saved mysqlbinlog size from  host_2 is 4444 bytes.
Thu Nov  9 11:20:05 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109111957.binlog ..
Thu Nov  9 11:20:05 2017 - [info] Differential log apply from binlog server succeeded.

binlog server 只写slave

### 3台服务器的GTID状态

* master   host_1

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_1.000055 |     6016 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

* slave  host_2


            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
                Auto_Position: 1


* etl  host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446366-446369
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
                Auto_Position: 1


### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 15:00:09 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov  9 15:00:09 2017 - [info] Starting master failover.
Thu Nov  9 15:00:09 2017 - [info]
Thu Nov  9 15:00:09 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov  9 15:00:09 2017 - [info]
Thu Nov  9 15:00:09 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Thu Nov  9 15:00:09 2017 - [info] Binlog server  host_2 is reachable.
Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)
Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Thu Nov  9 15:00:10 2017 - [info] GTID failover mode = 1
Thu Nov  9 15:00:10 2017 - [info] Dead Servers:
Thu Nov  9 15:00:10 2017 - [info]    host_1( host_1:3306)
Thu Nov  9 15:00:10 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov  9 15:00:10 2017 - [info]  ok.
Thu Nov  9 15:00:10 2017 - [info] Alive Servers:
Thu Nov  9 15:00:10 2017 - [info]    host_2( host_2:3306)
Thu Nov  9 15:00:10 2017 - [info]    host_3( host_3:3306)
Thu Nov  9 15:00:10 2017 - [info] Alive Slaves:
Thu Nov  9 15:00:10 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 15:00:10 2017 - [info]     GTID ON
Thu Nov  9 15:00:10 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 15:00:10 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 15:00:10 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 15:00:10 2017 - [info]     GTID ON
Thu Nov  9 15:00:10 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 15:00:10 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..
Thu Nov  9 15:00:10 2017 - [info]   done.
Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Thu Nov  9 15:00:10 2017 - [info]   done.
Thu Nov  9 15:00:10 2017 - [info] Starting GTID based failover.
Thu Nov  9 15:00:10 2017 - [info]
Thu Nov  9 15:00:10 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov  9 15:00:10 2017 - [info]
Thu Nov  9 15:00:10 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov  9 15:00:10 2017 - [info]
Thu Nov  9 15:00:10 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Thu Nov  9 15:00:10 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov  9 15:00:10 2017 - [info] Executing master IP deactivation script:
Thu Nov  9 15:00:10 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Nov  9 15:00:17 2017 - [info]  done.
Thu Nov  9 15:00:17 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov  9 15:00:17 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info] The latest binary log file/position on all slaves is  host_1.000055:4090
Thu Nov  9 15:00:17 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,
Thu Nov  9 15:00:17 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov  9 15:00:17 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 15:00:17 2017 - [info]     GTID ON
Thu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 15:00:17 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000055:2806
Thu Nov  9 15:00:17 2017 - [info] Oldest slaves:
Thu Nov  9 15:00:17 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 15:00:17 2017 - [info]     GTID ON
Thu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info] Searching new master from slaves..
Thu Nov  9 15:00:17 2017 - [info]  Candidate masters from the configuration file:
Thu Nov  9 15:00:17 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 15:00:17 2017 - [info]     GTID ON
Thu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 15:00:17 2017 - [info]  Non-candidate masters:
Thu Nov  9 15:00:17 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 15:00:17 2017 - [info]     GTID ON
Thu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)
Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 15:00:17 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov  9 15:00:17 2017 - [info]   Not found.
Thu Nov  9 15:00:17 2017 - [info]  Searching from all candidate_master slaves..
Thu Nov  9 15:00:17 2017 - [info] New master is  host_2( host_2:3306)
Thu Nov  9 15:00:17 2017 - [info] Starting master failover..
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov  9 15:00:17 2017 - [info]
Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied..
Thu Nov  9 15:00:17 2017 - [info]   done.
Thu Nov  9 15:00:17 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..
Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied on the latest slave..
Thu Nov  9 15:00:17 2017 - [info]  Resetting slave  host_2( host_2:3306) and starting replication from the new master  host_3( host_3:3306)..
Thu Nov  9 15:00:17 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 15:00:18 2017 - [info]  Slave started.
Thu Nov  9 15:00:18 2017 - [info]  Waiting to execute all relay logs on  host_2( host_2:3306)..
Thu Nov  9 15:00:18 2017 - [info]  master_pos_wait( host_3.000049:25843) completed on  host_2( host_2:3306). Executed 0 events.
Thu Nov  9 15:00:18 2017 - [info]   done.
Thu Nov  9 15:00:18 2017 - [info]   done.
Thu Nov  9 15:00:18 2017 - [info] -- Saving binlog from host  host_2 started, pid: 175683
Thu Nov  9 15:00:18 2017 - [info]
Thu Nov  9 15:00:18 2017 - [info] Log messages from  host_2 ...
Thu Nov  9 15:00:18 2017 - [info]
Thu Nov  9 15:00:18 2017 - [info] Fetching binary logs from binlog server  host_2..
Thu Nov  9 15:00:18 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000055  --start_pos=4090 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109150009.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Thu Nov  9 15:00:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov  9 15:00:18 2017 - [info] End of log messages from  host_2.
Thu Nov  9 15:00:18 2017 - [warning] Got error from  host_2.
Thu Nov  9 15:00:18 2017 - [info] Getting new master's binlog name and position..
Thu Nov  9 15:00:18 2017 - [info]   host_2.000005:1390
Thu Nov  9 15:00:18 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov  9 15:00:18 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000005, 1390, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
Thu Nov  9 15:00:18 2017 - [info] Executing master IP activate script:
Thu Nov  9 15:00:18 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Thu Nov  9 15:00:22 2017 - [info]  OK.
Thu Nov  9 15:00:22 2017 - [info] Setting read_only=0 on  host_2( host_2:3306)..
Thu Nov  9 15:00:22 2017 - [info]  ok.
Thu Nov  9 15:00:22 2017 - [info] ** Finished master recovery successfully.
Thu Nov  9 15:00:22 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov  9 15:00:22 2017 - [info]
Thu Nov  9 15:00:22 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov  9 15:00:22 2017 - [info]
Thu Nov  9 15:00:22 2017 - [info]
Thu Nov  9 15:00:22 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov  9 15:00:22 2017 - [info]
Thu Nov  9 15:00:22 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 180681. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109150009.log if it takes time..
Thu Nov  9 15:00:23 2017 - [info]
Thu Nov  9 15:00:23 2017 - [info] Log messages from  host_3 ...
Thu Nov  9 15:00:23 2017 - [info]
Thu Nov  9 15:00:22 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..
Thu Nov  9 15:00:22 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 15:00:23 2017 - [info]  Slave started.
Thu Nov  9 15:00:23 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
Thu Nov  9 15:00:23 2017 - [info] End of log messages from  host_3.
Thu Nov  9 15:00:23 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Thu Nov  9 15:00:23 2017 - [info] All new slave servers recovered successfully.
Thu Nov  9 15:00:23 2017 - [info]
Thu Nov  9 15:00:23 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov  9 15:00:23 2017 - [info]
Thu Nov  9 15:00:23 2017 - [info] Resetting slave info on the new master..
Thu Nov  9 15:00:23 2017 - [info]   host_2: Resetting slave info succeeded.
Thu Nov  9 15:00:23 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.
Thu Nov  9 15:00:23 2017 - [info]
Thu Nov  9 15:00:23 2017 - [info] Sending mail..

结论：由于binlog server没有配置master，所以会丢失master没有传递过来的事务日志
好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

binlog server 啥都不写

### 3台DB的GTID状态

* master  host_2

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_2.000005 |     5785 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446378 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)


* slave  host_1

           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
                Auto_Position: 1

* etl  host_3


           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372
                Auto_Position: 1


### 切换日志

Thu Nov  9 16:22:41 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov  9 16:22:41 2017 - [info] Starting master failover.
Thu Nov  9 16:22:41 2017 - [info]
Thu Nov  9 16:22:41 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov  9 16:22:41 2017 - [info]
Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)
Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Thu Nov  9 16:22:41 2017 - [info] GTID failover mode = 1
Thu Nov  9 16:22:41 2017 - [info] Dead Servers:
Thu Nov  9 16:22:41 2017 - [info]    host_2( host_2:3306)
Thu Nov  9 16:22:41 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov  9 16:22:41 2017 - [info]  ok.
Thu Nov  9 16:22:41 2017 - [info] Alive Servers:
Thu Nov  9 16:22:41 2017 - [info]    host_1( host_1:3306)
Thu Nov  9 16:22:41 2017 - [info]    host_3( host_3:3306)
Thu Nov  9 16:22:41 2017 - [info] Alive Slaves:
Thu Nov  9 16:22:41 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 16:22:41 2017 - [info]     GTID ON
Thu Nov  9 16:22:41 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 16:22:41 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 16:22:41 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 16:22:41 2017 - [info]     GTID ON
Thu Nov  9 16:22:41 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 16:22:41 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..
Thu Nov  9 16:22:41 2017 - [info]   done.
Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Thu Nov  9 16:22:41 2017 - [info]   done.
Thu Nov  9 16:22:41 2017 - [info] Starting GTID based failover.
Thu Nov  9 16:22:41 2017 - [info]
Thu Nov  9 16:22:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov  9 16:22:41 2017 - [info]
Thu Nov  9 16:22:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov  9 16:22:41 2017 - [info]
Thu Nov  9 16:22:42 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Thu Nov  9 16:22:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov  9 16:22:42 2017 - [info] Executing master IP deactivation script:
Thu Nov  9 16:22:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
===================    swift vip :  tgw_vip from  host_2 is deleted  ==============================


--2017-11-09 16:22:42--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        9.79M=0s

2017-11-09 16:22:44 (9.79 MB/s) - 已写入标准输出 [38]

Thu Nov  9 16:22:44 2017 - [info]  done.
Thu Nov  9 16:22:44 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov  9 16:22:44 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info] The latest binary log file/position on all slaves is  host_2.000005:4015
Thu Nov  9 16:22:44 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372
Thu Nov  9 16:22:44 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov  9 16:22:44 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 16:22:44 2017 - [info]     GTID ON
Thu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 16:22:44 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000005:3130
Thu Nov  9 16:22:44 2017 - [info] Oldest slaves:
Thu Nov  9 16:22:44 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 16:22:44 2017 - [info]     GTID ON
Thu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info] Searching new master from slaves..
Thu Nov  9 16:22:44 2017 - [info]  Candidate masters from the configuration file:
Thu Nov  9 16:22:44 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 16:22:44 2017 - [info]     GTID ON
Thu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov  9 16:22:44 2017 - [info]  Non-candidate masters:
Thu Nov  9 16:22:44 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov  9 16:22:44 2017 - [info]     GTID ON
Thu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)
Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov  9 16:22:44 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov  9 16:22:44 2017 - [info]   Not found.
Thu Nov  9 16:22:44 2017 - [info]  Searching from all candidate_master slaves..
Thu Nov  9 16:22:44 2017 - [info] New master is  host_1( host_1:3306)
Thu Nov  9 16:22:44 2017 - [info] Starting master failover..
Thu Nov  9 16:22:44 2017 - [info]
From:
 host_2( host_2:3306) (current master)
 +-- host_1( host_1:3306)
 +-- host_3( host_3:3306)

To:
 host_1( host_1:3306) (new master)
 +-- host_3( host_3:3306)
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov  9 16:22:44 2017 - [info]
Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied..
Thu Nov  9 16:22:44 2017 - [info]   done.
Thu Nov  9 16:22:44 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..
Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied on the latest slave..
Thu Nov  9 16:22:44 2017 - [info]  Resetting slave  host_1( host_1:3306) and starting replication from the new master  host_3( host_3:3306)..
Thu Nov  9 16:22:44 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 16:22:45 2017 - [info]  Slave started.
Thu Nov  9 16:22:45 2017 - [info]  Waiting to execute all relay logs on  host_1( host_1:3306)..
Thu Nov  9 16:22:45 2017 - [info]  master_pos_wait( host_3.000049:28663) completed on  host_1( host_1:3306). Executed 0 events.
Thu Nov  9 16:22:45 2017 - [info]   done.
Thu Nov  9 16:22:45 2017 - [info]   done.
Thu Nov  9 16:22:45 2017 - [info] Getting new master's binlog name and position..
Thu Nov  9 16:22:45 2017 - [info]   host_1.000056:1170
Thu Nov  9 16:22:45 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov  9 16:22:45 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000056, 1170, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372
Thu Nov  9 16:22:45 2017 - [info] Executing master IP activate script:
Thu Nov  9 16:22:45 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to  host_1  is added  ==============================


Thu Nov  9 16:22:47 2017 - [info]  OK.
Thu Nov  9 16:22:47 2017 - [info] Setting read_only=0 on  host_1( host_1:3306)..
Thu Nov  9 16:22:47 2017 - [info]  ok.
Thu Nov  9 16:22:47 2017 - [info] ** Finished master recovery successfully.
Thu Nov  9 16:22:47 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov  9 16:22:47 2017 - [info]
Thu Nov  9 16:22:47 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov  9 16:22:47 2017 - [info]
Thu Nov  9 16:22:47 2017 - [info]
Thu Nov  9 16:22:47 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov  9 16:22:47 2017 - [info]
Thu Nov  9 16:22:47 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 112317. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109162241.log if it takes time..
Thu Nov  9 16:22:48 2017 - [info]
Thu Nov  9 16:22:48 2017 - [info] Log messages from  host_3 ...
Thu Nov  9 16:22:48 2017 - [info]
Thu Nov  9 16:22:47 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..
Thu Nov  9 16:22:47 2017 - [info]  Executed CHANGE MASTER.
Thu Nov  9 16:22:48 2017 - [info]  Slave started.
Thu Nov  9 16:22:48 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372) completed on  host_3( host_3:3306). Executed 0 events.
Thu Nov  9 16:22:48 2017 - [info] End of log messages from  host_3.
Thu Nov  9 16:22:48 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Thu Nov  9 16:22:48 2017 - [info] All new slave servers recovered successfully.
Thu Nov  9 16:22:48 2017 - [info]
Thu Nov  9 16:22:48 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov  9 16:22:48 2017 - [info]
Thu Nov  9 16:22:48 2017 - [info] Resetting slave info on the new master..
Thu Nov  9 16:22:49 2017 - [info]   host_1: Resetting slave info succeeded.
Thu Nov  9 16:22:49 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.
Thu Nov  9 16:22:49 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover  host_2( host_2:3306) to  host_1( host_1:3306) succeeded

Master  host_2( host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on  host_2( host_2:3306)
Selected  host_1( host_1:3306) as a new master.
 host_1( host_1:3306): OK: Applying all logs succeeded.
 host_1( host_1:3306): OK: Activated master IP address.
 host_3( host_3:3306): OK: Slave started, replicating from  host_1( host_1:3306)
 host_1( host_1:3306): Resetting slave info succeeded.
Master failover to  host_1( host_1:3306) completed successfully.

1.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

99%的场景都是可以重新执行的
1%的场景不能再次执行，执行会报错

一般这种场景就是：已经failover到最后的change master阶段，这样主从结构已经变更，MHA无法重新走一遍。
不过，即便到这步骤失败了，表示master的日志已经补完，由于是gtid模式，自己再让slave change master到最新的master即可，最后ACTIVE new ip和readonly=1就好了

Thu Nov  9 16:49:39 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov  9 16:49:39 2017 - [info] Starting master failover.
Thu Nov  9 16:49:39 2017 - [info]
Thu Nov  9 16:49:39 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov  9 16:49:39 2017 - [info]
Thu Nov  9 16:49:39 2017 - [info] GTID failover mode = 1
Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master  host_1( host_1:3306) does not match with specified dead master  host_2( host_2:3306)!
Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53

1.7 Master：MySQL down小结

1. failover最终命令

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


2. binlog server建议

配置master就可以了
[binlog1]
$master_ip

只配置slave，或者没有配置，会导致丢失部分没有从master传递过来的日志事务

二、Master : Server down

2.1 etl 延迟8小时

 同1.1 结论

2.2 slave(候选master)比etl还要落后更多

2.2.1 当master的部分日志还没传递两个slave，这时候master server挂了

### 3台DB的GTID状态

* master  host_2

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_2.000008 |     5445 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

* slave  host_1

           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446385
                Auto_Position: 1



* etl  host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388
                Auto_Position: 1




### 模拟故障场景

* 隔离master的网络，让其等同于down机

master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP

### 切换日志


masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Fri Nov 10 11:12:38 2017 - [info] MHA::MasterFailover version 0.56.
Fri Nov 10 11:12:38 2017 - [info] Starting master failover.
Fri Nov 10 11:12:38 2017 - [info]
Fri Nov 10 11:12:38 2017 - [info] * Phase 1: Configuration Check Phase..
Fri Nov 10 11:12:38 2017 - [info]
Fri Nov 10 11:13:28 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 11:13:28 2017 - [warning] Failed to SSH to binlog server  host_2
Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to  host_1 is reachable.
Fri Nov 10 11:13:29 2017 - [info] Binlog server  host_1 is reachable.
Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to  host_3 is reachable.
Fri Nov 10 11:13:29 2017 - [info] Binlog server  host_3 is reachable.
Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)
Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Fri Nov 10 11:13:29 2017 - [info] GTID failover mode = 1
Fri Nov 10 11:13:29 2017 - [info] Dead Servers:
Fri Nov 10 11:13:29 2017 - [info]    host_2( host_2:3306)
Fri Nov 10 11:13:29 2017 - [info] Checking master reachability via MySQL(double check)...
Fri Nov 10 11:13:30 2017 - [info]  ok.
Fri Nov 10 11:13:30 2017 - [info] Alive Servers:
Fri Nov 10 11:13:30 2017 - [info]    host_1( host_1:3306)
Fri Nov 10 11:13:30 2017 - [info]    host_3( host_3:3306)
Fri Nov 10 11:13:30 2017 - [info] Alive Slaves:
Fri Nov 10 11:13:30 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:13:30 2017 - [info]     GTID ON
Fri Nov 10 11:13:30 2017 - [info]     Replicating from  host_2( host_2:3306)
Fri Nov 10 11:13:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 11:13:30 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:13:30 2017 - [info]     GTID ON
Fri Nov 10 11:13:30 2017 - [info]     Replicating from  host_2( host_2:3306)
Fri Nov 10 11:13:30 2017 - [info]     Not candidate for the new Master (no_master is set)
Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..
Fri Nov 10 11:13:30 2017 - [info]   done.
Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Fri Nov 10 11:13:30 2017 - [info]   done.
Fri Nov 10 11:13:30 2017 - [info] Starting GTID based failover.
Fri Nov 10 11:13:30 2017 - [info]
Fri Nov 10 11:13:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Nov 10 11:13:30 2017 - [info]
Fri Nov 10 11:13:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Nov 10 11:13:30 2017 - [info]
Fri Nov 10 11:14:20 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 11:14:20 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Nov 10 11:14:20 2017 - [info] Executing master IP deactivation script:
Fri Nov 10 11:14:20 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stop
ssh: connect to host  host_2 port 22: Connection timed out
===================    swift vip :  tgw_vip from  host_2 is deleted  ==============================


--2017-11-10 11:14:27--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        11.4M=0s

2017-11-10 11:16:27 (11.4 MB/s) - 已写入标准输出 [38]

Fri Nov 10 11:16:27 2017 - [info]  done.
Fri Nov 10 11:16:27 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Nov 10 11:16:27 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3: Master Recovery Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] The latest binary log file/position on all slaves is  host_2.000008:4265
Fri Nov 10 11:16:27 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388
Fri Nov 10 11:16:27 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Nov 10 11:16:27 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info]     GTID ON
Fri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)
Fri Nov 10 11:16:27 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000008:3380
Fri Nov 10 11:16:27 2017 - [info] Oldest slaves:
Fri Nov 10 11:16:27 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info]     GTID ON
Fri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: Determining New Master Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] Searching new master from slaves..
Fri Nov 10 11:16:27 2017 - [info]  Candidate masters from the configuration file:
Fri Nov 10 11:16:27 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info]     GTID ON
Fri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 11:16:27 2017 - [info]  Non-candidate masters:
Fri Nov 10 11:16:27 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info]     GTID ON
Fri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)
Fri Nov 10 11:16:27 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Fri Nov 10 11:16:27 2017 - [info]   Not found.
Fri Nov 10 11:16:27 2017 - [info]  Searching from all candidate_master slaves..
Fri Nov 10 11:16:27 2017 - [info] New master is  host_1( host_1:3306)
Fri Nov 10 11:16:27 2017 - [info] Starting master failover..
Fri Nov 10 11:16:27 2017 - [info]
From:
 host_2( host_2:3306) (current master)
 +-- host_1( host_1:3306)
 +-- host_3( host_3:3306)

To:
 host_1( host_1:3306) (new master)
 +-- host_3( host_3:3306)
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied..
Fri Nov 10 11:16:27 2017 - [info]   done.
Fri Nov 10 11:16:27 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..
Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied on the latest slave..
Fri Nov 10 11:16:27 2017 - [info]  Resetting slave  host_1( host_1:3306) and starting replication from the new master  host_3( host_3:3306)..
Fri Nov 10 11:16:27 2017 - [info]  Executed CHANGE MASTER.
Fri Nov 10 11:16:28 2017 - [info]  Slave started.
Fri Nov 10 11:16:28 2017 - [info]  Waiting to execute all relay logs on  host_1( host_1:3306)..
Fri Nov 10 11:16:28 2017 - [info]  master_pos_wait( host_3.000049:40136) completed on  host_1( host_1:3306). Executed 0 events.
Fri Nov 10 11:16:28 2017 - [info]   done.
Fri Nov 10 11:16:28 2017 - [info]   done.
Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host  host_2 started, pid: 43038
Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host  host_1 started, pid: 43039
Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host  host_3 started, pid: 43041
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Log messages from  host_2 ...
Fri Nov 10 11:16:28 2017 - [info] End of log messages from  host_2.
Fri Nov 10 11:16:28 2017 - [warning] SSH is not reachable on  host_2. Skipping
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Log messages from  host_1 ...
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server  host_1..
Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
 at /usr/bin/save_binary_logs line 123
    eval {...} called at /usr/bin/save_binary_logs line 70
    main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 11:16:28 2017 - [info] End of log messages from  host_1.
Fri Nov 10 11:16:28 2017 - [warning] Got error from  host_1.
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Log messages from  host_3 ...
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server  host_3..
Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
 at /usr/bin/save_binary_logs line 123
    eval {...} called at /usr/bin/save_binary_logs line 70
    main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 11:16:28 2017 - [info] End of log messages from  host_3.
Fri Nov 10 11:16:28 2017 - [warning] Got error from  host_3.
Fri Nov 10 11:16:28 2017 - [info] Getting new master's binlog name and position..
Fri Nov 10 11:16:28 2017 - [info]   host_1.000058:4059
Fri Nov 10 11:16:28 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Fri Nov 10 11:16:28 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000058, 4059, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388
Fri Nov 10 11:16:28 2017 - [info] Executing master IP activate script:
Fri Nov 10 11:16:28 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to  host_1  is added  ==============================


Fri Nov 10 11:16:30 2017 - [info]  OK.
Fri Nov 10 11:16:30 2017 - [info] ** Finished master recovery successfully.
Fri Nov 10 11:16:30 2017 - [info] * Phase 3: Master Recovery Phase completed.
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] * Phase 4: Slaves Recovery Phase..
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 46878. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171110111238.log if it takes time..
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:31 2017 - [info] Log messages from  host_3 ...
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..
Fri Nov 10 11:16:30 2017 - [info]  Executed CHANGE MASTER.
Fri Nov 10 11:16:31 2017 - [info]  Slave started.
Fri Nov 10 11:16:31 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388) completed on  host_3( host_3:3306). Executed 0 events.
Fri Nov 10 11:16:31 2017 - [info] End of log messages from  host_3.
Fri Nov 10 11:16:31 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Fri Nov 10 11:16:31 2017 - [info] All new slave servers recovered successfully.
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:31 2017 - [info] * Phase 5: New master cleanup phase..
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:31 2017 - [info] Resetting slave info on the new master..
Fri Nov 10 11:16:31 2017 - [info]   host_1: Resetting slave info succeeded.
Fri Nov 10 11:16:31 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.
Fri Nov 10 11:16:31 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover  host_2( host_2:3306) to  host_1( host_1:3306) succeeded

Master  host_2( host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on  host_2( host_2:3306)
Selected  host_1( host_1:3306) as a new master.
 host_1( host_1:3306): OK: Applying all logs succeeded.
 host_1( host_1:3306): OK: Activated master IP address.
 host_3( host_3:3306): OK: Slave started, replicating from  host_1( host_1:3306)
 host_1( host_1:3306): Resetting slave info succeeded.
Master failover to  host_1( host_1:3306) completed successfully.
Fri Nov 10 11:16:31 2017 - [info] Sending mail..


### 最后一步很重要

如果dead master之后又活过来了，那么这一步要做

dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md  ==》 tgw章节详细描述

结论：由于master 已挂，然而最后的日志没有传递到其他服务器，所以会丢失master没有传递过来的事务日志
好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

2.2.2 当master的所有日志已经传递到1个etl，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到etl，所以最后是不会丢失master上任何数据的。

2.3 slave(候选master)的日志是最新的，比etl要多

2.3.1 当master的部分日志还没传递两个slave，这时候master server挂了

测试省略，和2.2.1基本一样

2.3.2 当master的所有日志已经传递slave，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到slave，所以最后是不会丢失master上任何数据的。

2.4 slave(候选master）上面有大事务在跑

1000s的大查询

同1.4结论

flush tables with readlock


同1.4结论

2.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl （这是最严苛的情况）

binlog server 写3台

2.2.1 测试的就是这种情况，详细日志切换请看2.2.1

结论：由于binlog server配置了3台，但是由于master server已经挂掉，无法从master的binlog server上获取日志，所以会丢失master上没有传递的日志事务

binlog server 只写master


### 3台DB的gtid 状态

* master  host_1

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_1.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)


* slave  host_2

           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                Auto_Position: 1


* etl  host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                Auto_Position: 1



### 模拟故障

master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP


### 故障切换

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Fri Nov 10 14:15:51 2017 - [info] MHA::MasterFailover version 0.56.
Fri Nov 10 14:15:51 2017 - [info] Starting master failover.
Fri Nov 10 14:15:51 2017 - [info]
Fri Nov 10 14:15:51 2017 - [info] * Phase 1: Configuration Check Phase..
Fri Nov 10 14:15:51 2017 - [info]
Fri Nov 10 14:16:41 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 14:16:41 2017 - [warning] Failed to SSH to binlog server  host_1
Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln239] Binlog Server is defined but there is no alive server.
Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/share/perl5/vendor_perl/MHA/MasterFailover.pm line 2082

结论： binlog server 必须要配置一个活的 server，如果只配置master，如果master挂了，那么就等于一个都没有，MHA不会切换

binlog server 只写slave


### 3台DB的gtid 状态

* master  host_1

dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
|  host_1.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)


* slave  host_2

           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                Auto_Position: 1


* etl  host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                Auto_Position: 1



### 模拟故障

master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP


### 故障切换

Fri Nov 10 14:29:50 2017 - [info] MHA::MasterFailover version 0.56.
Fri Nov 10 14:29:50 2017 - [info] Starting master failover.
Fri Nov 10 14:29:50 2017 - [info]
Fri Nov 10 14:29:50 2017 - [info] * Phase 1: Configuration Check Phase..
Fri Nov 10 14:29:50 2017 - [info]
Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to  host_2 is reachable.
Fri Nov 10 14:29:50 2017 - [info] Binlog server  host_2 is reachable.
Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to  host_3 is reachable.
Fri Nov 10 14:29:50 2017 - [info] Binlog server  host_3 is reachable.
Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)
Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)
Fri Nov 10 14:29:50 2017 - [info] GTID failover mode = 1
Fri Nov 10 14:29:50 2017 - [info] Dead Servers:
Fri Nov 10 14:29:50 2017 - [info]    host_1( host_1:3306)
Fri Nov 10 14:29:50 2017 - [info] Checking master reachability via MySQL(double check)...
Fri Nov 10 14:29:51 2017 - [info]  ok.
Fri Nov 10 14:29:51 2017 - [info] Alive Servers:
Fri Nov 10 14:29:51 2017 - [info]    host_2( host_2:3306)
Fri Nov 10 14:29:51 2017 - [info]    host_3( host_3:3306)
Fri Nov 10 14:29:51 2017 - [info] Alive Slaves:
Fri Nov 10 14:29:51 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:29:51 2017 - [info]     GTID ON
Fri Nov 10 14:29:51 2017 - [info]     Replicating from  host_1( host_1:3306)
Fri Nov 10 14:29:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 14:29:51 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:29:51 2017 - [info]     GTID ON
Fri Nov 10 14:29:51 2017 - [info]     Replicating from  host_1( host_1:3306)
Fri Nov 10 14:29:51 2017 - [info]     Not candidate for the new Master (no_master is set)
Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..
Fri Nov 10 14:29:51 2017 - [info]   done.
Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..
Fri Nov 10 14:29:52 2017 - [info]   done.
Fri Nov 10 14:29:52 2017 - [info] Starting GTID based failover.
Fri Nov 10 14:29:52 2017 - [info]
Fri Nov 10 14:29:52 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Nov 10 14:29:52 2017 - [info]
Fri Nov 10 14:29:52 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Nov 10 14:29:52 2017 - [info]
Fri Nov 10 14:30:42 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 14:30:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Nov 10 14:30:42 2017 - [info] Executing master IP deactivation script:
Fri Nov 10 14:30:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stop
ssh: connect to host  host_1 port 22: Connection timed out
===================    swift vip :  tgw_vip from  host_1 is deleted  ==============================


--2017-11-10 14:30:49--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        12.1M=0s

2017-11-10 14:32:47 (12.1 MB/s) - 已写入标准输出 [38]

Fri Nov 10 14:32:47 2017 - [info]  done.
Fri Nov 10 14:32:47 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Nov 10 14:32:47 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3: Master Recovery Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] The latest binary log file/position on all slaves is  host_1.000058:6912
Fri Nov 10 14:32:47 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
Fri Nov 10 14:32:47 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Nov 10 14:32:47 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info]     GTID ON
Fri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)
Fri Nov 10 14:32:47 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000058:5307
Fri Nov 10 14:32:47 2017 - [info] Oldest slaves:
Fri Nov 10 14:32:47 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info]     GTID ON
Fri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: Determining New Master Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] Searching new master from slaves..
Fri Nov 10 14:32:47 2017 - [info]  Candidate masters from the configuration file:
Fri Nov 10 14:32:47 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info]     GTID ON
Fri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 14:32:47 2017 - [info]  Non-candidate masters:
Fri Nov 10 14:32:47 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info]     GTID ON
Fri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)
Fri Nov 10 14:32:47 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Fri Nov 10 14:32:47 2017 - [info]   Not found.
Fri Nov 10 14:32:47 2017 - [info]  Searching from all candidate_master slaves..
Fri Nov 10 14:32:47 2017 - [info] New master is  host_2( host_2:3306)
Fri Nov 10 14:32:47 2017 - [info] Starting master failover..
Fri Nov 10 14:32:47 2017 - [info]
From:
 host_1( host_1:3306) (current master)
 +-- host_2( host_2:3306)
 +-- host_3( host_3:3306)

To:
 host_2( host_2:3306) (new master)
 +-- host_3( host_3:3306)
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied..
Fri Nov 10 14:32:47 2017 - [info]   done.
Fri Nov 10 14:32:47 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..
Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied on the latest slave..
Fri Nov 10 14:32:47 2017 - [info]  Resetting slave  host_2( host_2:3306) and starting replication from the new master  host_3( host_3:3306)..
Fri Nov 10 14:32:47 2017 - [info]  Executed CHANGE MASTER.
Fri Nov 10 14:32:48 2017 - [info]  Slave started.
Fri Nov 10 14:32:48 2017 - [info]  Waiting to execute all relay logs on  host_2( host_2:3306)..
Fri Nov 10 14:32:48 2017 - [info]  master_pos_wait( host_3.000049:42954) completed on  host_2( host_2:3306). Executed 0 events.
Fri Nov 10 14:32:48 2017 - [info]   done.
Fri Nov 10 14:32:48 2017 - [info]   done.
Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host  host_2 started, pid: 76664
Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host  host_3 started, pid: 76665
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Log messages from  host_2 ...
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server  host_2..
Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
 at /usr/bin/save_binary_logs line 123
    eval {...} called at /usr/bin/save_binary_logs line 70
    main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 14:32:48 2017 - [info] End of log messages from  host_2.
Fri Nov 10 14:32:48 2017 - [warning] Got error from  host_2.
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Log messages from  host_3 ...
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server  host_3..
Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
 at /usr/bin/save_binary_logs line 123
    eval {...} called at /usr/bin/save_binary_logs line 70
    main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 14:32:48 2017 - [info] End of log messages from  host_3.
Fri Nov 10 14:32:48 2017 - [warning] Got error from  host_3.
Fri Nov 10 14:32:48 2017 - [info] Getting new master's binlog name and position..
Fri Nov 10 14:32:48 2017 - [info]   host_2.000008:6895
Fri Nov 10 14:32:48 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Fri Nov 10 14:32:48 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000008, 6895, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
Fri Nov 10 14:32:48 2017 - [info] Executing master IP activate script:
Fri Nov 10 14:32:48 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to  host_2  is added  ==============================


Fri Nov 10 14:32:51 2017 - [info]  OK.
Fri Nov 10 14:32:51 2017 - [info] ** Finished master recovery successfully.
Fri Nov 10 14:32:51 2017 - [info] * Phase 3: Master Recovery Phase completed.
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] * Phase 4: Slaves Recovery Phase..
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 80398. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171110142950.log if it takes time..
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:52 2017 - [info] Log messages from  host_3 ...
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..
Fri Nov 10 14:32:51 2017 - [info]  Executed CHANGE MASTER.
Fri Nov 10 14:32:52 2017 - [info]  Slave started.
Fri Nov 10 14:32:52 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392) completed on  host_3( host_3:3306). Executed 0 events.
Fri Nov 10 14:32:52 2017 - [info] End of log messages from  host_3.
Fri Nov 10 14:32:52 2017 - [info] -- Slave on host  host_3( host_3:3306) started.
Fri Nov 10 14:32:52 2017 - [info] All new slave servers recovered successfully.
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:52 2017 - [info] * Phase 5: New master cleanup phase..
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:52 2017 - [info] Resetting slave info on the new master..
Fri Nov 10 14:32:52 2017 - [info]   host_2: Resetting slave info succeeded.
Fri Nov 10 14:32:52 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.
Fri Nov 10 14:32:52 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover  host_1( host_1:3306) to  host_2( host_2:3306) succeeded

Master  host_1( host_1:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on  host_1( host_1:3306)
Selected  host_2( host_2:3306) as a new master.
 host_2( host_2:3306): OK: Applying all logs succeeded.
 host_2( host_2:3306): OK: Activated master IP address.
 host_3( host_3:3306): OK: Slave started, replicating from  host_2( host_2:3306)
 host_2( host_2:3306): Resetting slave info succeeded.
Master failover to  host_2( host_2:3306) completed successfully.
Fri Nov 10 14:32:52 2017 - [info] Sending mail..

结论： binlog server 配置成多台slave，这是正确的方案。由于master 挂了，master没有传递过来的binlog会丢失，这是没办法的. 好在，其余slave自动补齐现有日志

binlog server 啥都不写


会切换成功，由于master 挂了，master没有传递过来的binlog会丢失

好在，其余slave自动补齐现有日志

2.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

同1.6结论

三、遇到的坑

3.1 交互模式下，如果没有及时敲'YES'，则终止切换

四、总结

MHA + GTID 模式，重点配置和用法如下：

1. command
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


2. binlog server

在配置文件中对 master，slave，etl 都写在binlog server中。对MySQL down 和 DB server down  综合考虑下，建议这样配置。


3. tgw 清理

dead master 如果还可以起来，那么必须在上面执行： /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

原因可参看：http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节

五、流程简介


* Phase 1: Configuration Check Phase..
    HealthCheck: SSH  N台DB是否reachable
    Binlog server： N台DB 是否reachable
    GTID failover mode = ？
    Dead Servers is ？
    Primary candidate for the new Master (candidate_master is set) ？

* Phase 2: Dead Master Shutdown Phase..
    Executing master IP deactivation script：
        TGW-vip delete操作
    shutdown_script： ？

* Phase 3: Master Recovery Phase..
    * Phase 3.1: Getting Latest Slaves Phase..
        Latest slaves ，file position ？
        Oldest slaves ， file position ？

    * Phase 3.3: Determining New Master Phase..
        选择哪个slave为new master

    * Phase 3.3: New Master Recovery Phase..
        Replicating from the latest slave and waiting to apply..  --让new master change master 到 latest slave
        Waiting all logs to be applied on the latest slave..  --让new master跟 latest slave的日志保持一致
        Saving binlog from binlog server。。。  --根据配置的binlog server，开始生成latest slave和dead master之间的diff日志
        Applying differential binlog  --apply这些差异日志到new master，让new master执行完所有缺失的日志
        Getting new master's binlog name and position..  --获取new master现在的binlog file pos
            All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST, MASTER_PORT=3306, MASTER_AUTO_POSITION=1
        Executing master IP activate script:  --TGW-vip 激活操作，并且设置readonly=0


* Phase 4: Slaves Recovery Phase..  (并行操作)
        Resetting slave  and starting replication from the new master


* Phase 5: New master cleanup phase..
    Resetting slave info on the new master.

MHA failover GTID 专题

MHA failover GTID 专题

假定环境(经典三节点)

一、Master : MySQL down

1.1 etl 延迟8小时

1.2 slave(候选master)比etl还要落后更多

1.3 slave(候选master)的日志是最新的，比etl要多

1.4 slave(候选master）上面有大事务在跑

1.5 binlog server 不同场景的测试

1.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

1.7 Master：MySQL down小结

二、Master : Server down

2.1 etl 延迟8小时

2.2 slave(候选master)比etl还要落后更多

2.3 slave(候选master)的日志是最新的，比etl要多

2.4 slave(候选master）上面有大事务在跑

2.5 binlog server 不同场景的测试

2.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

三、遇到的坑

3.1 交互模式下，如果没有及时敲'YES'，则终止切换

四、总结

五、流程简介

热门文章

最新文章

相关电子书