sql_slave_skip_counter,你真的用对了吗?
写在前面:
最近一段时间都在做传统主从复制相关的测试,思考了很多上线主从复制架构后,可能会发生的问题,然后针对性设置了这些故障,再然后思考如何在保证业务可用,或者对业务冲击尽可能的小的前提下,进行故障的恢复,也算是小有所得,现在呢,笔者就主从复制故障时,需要跳过故障点时所用到的 sql_slave_skip_counte 变量进行记录描述
sql_slave_skip_counter 介绍:
摘自MySQL官方的解释(强烈建议阅读英文原文。中文版,是笔者自己的理解,只能说仁者见仁)
SET GLOBAL sql_slave_skip_counter Syntax:
SET GLOBAL sql_slave_skip_counter = N
This statement skips the next N events from the master. This is useful for recovering from replication stops caused by a statement.
跳过N个events。注意:以event为单位,而不是以事务为单位,只有在由单条语句组成的事务时,两者才等价。
如:一个事务由多个EVENT组成,BEGIN;INSERT;UPDATE;DELETE;COMMOIT; 这种情况下,两者绝不相等
This statement is valid only when the slave threads are not running. Otherwise, it produces an error.
When using this statement, it is important to understand that the binary log is actually organized as a sequence of groups known as event groups. Each event group consists of a sequence of events.
For transactional tables, an event group corresponds to a transaction.
对于事务表,一个event group对应一个事务
or nontransactional tables, an event group corresponds to a single SQL statement.
对于非事务表,一个event group对应一条SQL
When you use SET GLOBAL sql_slave_skip_counter to skip events and the result is in the middle of a group, the slave continues to skip events until it reaches the end of the group. Execution then starts with the next event group
当你跳过event的时候,如果N的值,处于event group之中,那么slave会继续跳过event,直至跳过这个event group,从下一个event group开始
对于事务表使用sql_slave_skip_counter的情况:
1、跳过1032复制错误(update/delete error)
跳过由单条SQL组成的事务:
在Slave主机上人为的删除两条数据:
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '9';
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '11';
而Master在变更上述两条记录的时候会报错,导致复制中断:
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('孙权', '吴国', '妹妹');
UPDATE `edusoho_e`.`t1` SET xname='游戏' WHERE id=7;
UPDATE `edusoho_e`.`t1` SET age=40 WHERE id=11; #报错
DELETE FROM `edusoho_e`.`t1` WHERE age=40; #报错
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('曹丕', '魏国', '甄姬');
DELETE FROM `edusoho_e`.`t1` WHERE id=1;
UPDATE `edusoho_e`.`t1` SET hobby='Games' WHERE id=3;
在Slave查看主从复制状态时,就会发现报错信息:
mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 2176
Exec_Master_Log_Pos: 874
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 1127
Slave_IO_Running: Yes
Slave_SQL_Running: No
在Master主机上查看position做了什么操作:
mysql> show binlog events in 'mysql-bin.000002' from 874;
+------------------+------+-------------+-----------+-------------+---------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+------------------+------+-------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 874 | Query | 2 | 956 | BEGIN |
| mysql-bin.000002 | 956 | Table_map | 2 | 1017 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 1017 | Update_rows | 2 | 1127 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1127 | Xid | 2 | 1158 | COMMIT /* xid=437 */ |
| mysql-bin.000002 | 1158 | Query | 2 | 1240 | BEGIN |
| mysql-bin.000002 | 1240 | Table_map | 2 | 1301 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 1301 | Delete_rows | 2 | 1407 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1407 | Xid | 2 | 1438 | COMMIT /* xid=446 */ |
| mysql-bin.000002 | 1438 | Query | 2 | 1520 | BEGIN |
| mysql-bin.000002 | 1520 | Table_map | 2 | 1581 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 1581 | Write_rows | 2 | 1644 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1644 | Xid | 2 | 1675 | COMMIT /* xid=455 */ |
| mysql-bin.000002 | 1675 | Query | 2 | 1757 | BEGIN |
| mysql-bin.000002 | 1757 | Table_map | 2 | 1818 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 1818 | Delete_rows | 2 | 1880 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1880 | Xid | 2 | 1911 | COMMIT /* xid=464 */ |
| mysql-bin.000002 | 1911 | Query | 2 | 1993 | BEGIN |
| mysql-bin.000002 | 1993 | Table_map | 2 | 2054 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 2054 | Update_rows | 2 | 2145 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 2145 | Xid | 2 | 2176 | COMMIT /* xid=473 */ |
+------------------+------+-------------+-----------+-------------+---------------------------------+
在Slave跳过第一个Update_rows event复制报错:
mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
Slave_IO_Running: Yes
Slave_SQL_Running: No
Exec_Master_Log_Pos: 1158
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 1407
成功跳过第一个events group
在Slave继续跳过第二个Delete_rows event复制报错:
mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_SQL_Errno: 0
Last_SQL_Error:
成功跳过第二个events group
注意:
虽然主从复制出现的故障成功跳过了,但只是暂时恢复了正常的主从复制状态,需要尽快的对Slave缺失的数据进行补齐,不然Master对Slave不存在的数据做的变更,仍然会重复导致主从复制故障,笔者觉得如果你的数据量差异不是太大的话,可以考虑使用pt-table-checksum和pt-table-sync工具进行恢复,如果你的数据量很大且数据差异很多,还是建议重做Slave较好,因为使用工具会锁表,会对线上业务造成一定的影响,具体情况,请自行考量。
跳过由多条SQL(event)组成的事务:
在Slave主机上人为的删除一条数据:
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '7';
在Master主机上产生一个由多条SQL组成的事务:
BEGIN;
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '7';
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('懒死', '不知道', '吃了睡睡了吃');
COMMIT;
因为Slave主机上已经删除id=7的数据,在Slave查看主从复制状态时,就会发现报错信息:
mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 7219
Exec_Master_Log_Pos: 6840
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 7049
在Master主机上查看position做了什么操作:
mysql> show binlog events in 'mysql-bin.000002' from 6840;
+------------------+------+-------------+-----------+-------------+---------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+------------------+------+-------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 6840 | Query | 2 | 6922 | BEGIN |
| mysql-bin.000002 | 6922 | Table_map | 2 | 6983 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 6983 | Delete_rows | 2 | 7049 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7049 | Table_map | 2 | 7110 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 7110 | Write_rows | 2 | 7188 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7188 | Xid | 2 | 7219 | COMMIT /* xid=825 */ |
+------------------+------+-------------+-----------+-------------+---------------------------------+
可以看到,这个事务是由两个SQL(event)组成的
如果使用 sql_slave_skip_counter=N 跳过由多条SQL组成的事务会怎样呢?
mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 7219
Exec_Master_Log_Pos: 7219
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
发现问题没有,在使用sql_slave_skip_counter跳过由多条SQL(event)组成的事务时,从在Master上执行的 show binlog events 可以看到,如果只是跳过出报错SQL语句,那么 Exec_Master_Log_Pos 值应该为7110,但是现在为7219,说明将整个event group跳过了,但是7110的SQL数据是我们需要的,所以,和单条SQL组成的事务一样,主从复制状态虽然恢复,但是数据仍处于不一致状态,要抓紧时间补齐数据或重做Slave
2、由多条SQL(event)组成的事务时,仅跳过一个event,而不是一个event group:
在Slave主机上人为的删除一条数据:
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '17';
在Master主机上产生一个由多条SQL组成的事务:
BEGIN;
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '17';
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('我是谁', '不知道', '吃了睡睡了吃');
COMMIT;
因为Slave主机上已经删除id=17的数据,在Slave查看主从复制状态时,就会发现报错信息:
Exec_Master_Log_Pos: 120
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 341
在Master主机上查看position做了什么操作:
mysqlbinlog -v --base64-output=decode --start-position=120 mysql-bin.000004
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 120
#190507 13:52:05 server id 2 end_log_pos 202 CRC32 0x0ca0c280 Query thread_id=3 exec_time=0 error_code=0
SET TIMESTAMP=1557208325/*!*/;
SET @@session.pseudo_thread_id=3/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 202
#190507 13:52:05 server id 2 end_log_pos 263 CRC32 0x20d2e89d Table_map: `edusoho_e`.`t1` mapped to number 216
# at 263
#190507 13:52:05 server id 2 end_log_pos 341 CRC32 0xbec6fd45 Delete_rows: table id 216 flags: STMT_END_F
### DELETE FROM `edusoho_e`.`t1`
### WHERE
### @1=17
### @2='懒死'
### @3='不知道'
### @4=1
### @5='吃了睡睡了吃'
### @6=18
# at 341
#190507 13:52:05 server id 2 end_log_pos 402 CRC32 0xa37bc5c9 Table_map: `edusoho_e`.`t1` mapped to number 216
# at 402
#190507 13:52:05 server id 2 end_log_pos 483 CRC32 0x0d774707 Write_rows: table id 216 flags: STMT_END_F
### INSERT INTO `edusoho_e`.`t1`
### SET
### @1=21
### @2='我是谁'
### @3='不知道'
### @4=1
### @5='吃了睡睡了吃'
### @6=18
# at 483
#190507 13:52:05 server id 2 end_log_pos 514 CRC32 0x8c333b30 Xid = 411
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
可以看到,绿色的部分就是我们需要跳过的,而第二个event是需要我们保留的
这个时候,就需要用到slave_exec_mode这个变量了,至于slave_exec_mode详细介绍,还是请参考MySQL官网资料
mysql> set global slave_exec_mode='IDEMPOTENT';
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 514
去Slave上edusoho_e.t1表上查看,数据id=21的数据已经过去了,此时,数据处于一致性状态
3、跳过主键冲突1062错误(Duplicate entry):
在Slave主键上先插入一条id值:
INSERT INTO `edusoho_e`.`t1` (`id`,`xname`, `address`, `hobby`, `age`) VALUES (19,'小玩子', '明朝', '皇后', '25');
因为Slave已经占用了Master要自动产生的主键值id=19,所以Slave主机会报错:
INSERT INTO `edusoho_e`.`t1` (`id`,`xname`, `address`, `hobby`, `age`) VALUES (19,'朱棣', '明朝', '皇帝', '36');
查看Slave主从复制状态发现已经发生了主从复制报错:
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1062
Last_Error: Could not execute Write_rows event on table edusoho_e.t1; Duplicate entry '19' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000002, end_log_pos 7425
Exec_Master_Log_Pos: 7219
查看Master binlog:
mysql> show binlog events in 'mysql-bin.000002' from 7219;
+------------------+------+------------+-----------+-------------+---------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+------------------+------+------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 7219 | Query | 2 | 7301 | BEGIN |
| mysql-bin.000002 | 7301 | Table_map | 2 | 7362 | table_id: 213 (edusoho_e.t1) |
| mysql-bin.000002 | 7362 | Write_rows | 2 | 7425 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7425 | Xid | 2 | 7456 | COMMIT /* xid=893 */ |
+------------------+------+------------+-----------+-------------+---------------------------------+
思考:
因为Slave这条数据已经存在,如果在Slave主机上把这条数据删除了,Slave会不会直接同步过来?(答案是:不会。需要重启Slave thread):
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '19';
mysql> stop slave;
mysql> start slave user='repliter' password='123456';
验证的时候,发现数据已经同步过去了
题外:
以上是笔者对于单条SQL组成的事务、多条SQL组成的事务,及在这些单/多条SQL组成的事务下,人为设置的1032和1062复制错误和解决方法,还有sql_slave_skip_counter和slave_exec_mode各自的用法和跳过的范围,当然了,笔者呢,做的只是线上应用前的部署测试,并没有经过任何的实战检测。一方面,仅为广大同行做个参考;另一方面,记录笔者自己的心得和针对问题解决的思路做个总结,当问题真正发生的时候,有个方向可以进行参考,而不至于手忙脚乱,不知所措,所以,对其中有误之处和理解不到位的地方,望请下方留言指正,不胜感激!
还有,笔者做的,只是针对事务表,做的sql_slave_skip_counter和slave_exec_mode测试,对于非事务表,sql_slave_skip_counter和slave_exec_mode用途会稍有不同,请自行百度吧