From e0b1b53f21095707af87a095934e971d788a90c7 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Wed, 11 Dec 2019 22:01:13 +0100 Subject: Document further possible problems with MySQL replication --- docs/troubleshooting.txt | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) (limited to 'docs/troubleshooting.txt') diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt index a47d133..fd57150 100644 --- a/docs/troubleshooting.txt +++ b/docs/troubleshooting.txt @@ -319,11 +319,40 @@ MySQL ===== - MySQL may stop replicating from the master. There is some kind of deadlock in multi-threaded SLAVE SQL. This can be seen by exexuting (which should show a lot of slave threads waiting on coordinator to provide - load). + the load). SHOW PROCESSLIST; The remedy is to restart slave MySQL with 'slave_parallel_workers=0', give it a time to go, and then - restart back in the standard multithreading mode. + restart back in the standard multithreading mode. This can be achieved by editing 'statefulset/mysql-slave-0' + and setting environmental vairable 'MYSQL_SLAVE_WORKERS' to 0 and, then, back to original value (16 currently). +- This could be not end of this. The execution of statments from the log could 'stuck' because of the some "bad" + transaction. This can be detected by looking into the replication status on the slave database: + SHOW SLAVE STATUS\G + It may happen what while 'Retrieved_Gtid_Set' increments (binary log is successfully transfered from the master + and contains new transactions), the 'Executed_Gtid_Set' staying constant, i.e. new transactions are not applied + to the slave server. Also, in this case 'Relay_Log_Pos' and 'Exec_Master_Log_Pos' will not advance. And the + 'Slave_SQL_Running_State' will likely stuck in the 'Reading event from the relay log' state. The solution is to + skip misbehaving transaction. Basically, we need to skip problematic transaction and to continue from the next + one. The following considers that GTID_MODE is 'on' as we have it on the current KaaS platform. + * First, we need to find the next transaction. Likely it the just an increment of the 'Executed_Gtid_Set', i.e. + if 'Executed_Gtid_Set=3beaee24-2e55-11e8-9612-0a580a80000c:1-953787306' on the slave, the next GTID would be: + 'Executed_Gtid_Set=3beaee24-2e55-11e8-9612-0a580a80000c:953787307' (yes "1-" also goes away). But this is not + guaranteed. The more reliable way to find the misbehaving transaction is to look into the 'bin-log' on the master + server (information is taken from SHOW SLAVE STATUS executed on the slave). In '/var/lib/mysql/data' run + mysqlbinlog --start-position= | head -n 50 + Here you will find which transaction has likely caused the problem. Furhermore, there will be the line looking like + SET @@SESSION.GTID_NEXT='4ab8feff-5272-11e8-9320-08002715584a:201840' + This is the gtid of the next transaction. + * So, the following commands should be executed on the slave MySQL server (see details, https://www.thegeekdiary.com/how-to-skip-a-transaction-on-mysql-replication-slave-when-gtids-are-enabled/) + SLAVE STOP; + SET @@SESSION.GTID_NEXT=''; + BEGIN; + COMMIT; + SET GTID_NEXT='AUTOMATIC'; + SLAVE START; + * It is also possible to review the stuck transaction on the slave mysql node. In the '/var/lib/mysql/data' run + mysqlbinlog --start-position= + Administration ============== - Some management tasks may require to login on ipekatrin* nodes. Thereafter, the password-less execution of -- cgit v1.2.3