Thursday, July 4, 2019

Troubleshooting MySQL InnoDB Cluster GTID Inconsistency

This tutorial is showing how manual fix of GTID inconsistency but it can be VERY RISKY to mess up the InnoDB Cluster.   It purely serves as a tutorial how to troubleshoot and identify potential issues.

Environment
MySQL InnoDB Cluster being setup on Single Machine with 3 Nodes
Node 1 : Port 3306
Node 2 : Port 3316
Node 3 : Port 3326





The InnoDB Cluster Member Status is as follows





Retrieving the GTID executed on each member "SELECT @@gtid_executed;"






The 3 sets of GTIDs are the same and the InnoDB Cluster of 3 nodes is running properly.

Let to simulate a breakdown of Server (Kill the Instance with PORT 3326)
1.  Change the persisted variables group_replication_start_on_boot = off first
     mysql > set persist  group_replication_start_on_boot = off;
     mysql > restart;

2. Check the status of the InnoDB Cluster


The node (3326) is down.

Now  make a FLUSH TABLES on the Server Instance "3326".
1. SELECT @@GTID_EXECUTED;
2. FLUSH TABLES;
3. SELECT @@GTID_EXECUTED;
4. SHOW VARIABLES LIKE '%READ_ONLY%';





It clearly shows the GTID has been changed in this Server "3326".

Change the "group_replication_start_on_boot=on" so that when restarting the Node 3326, it will automatically rejoin.  However, at this time, the server is failed to REJOIN because the GTID has inconsistency with the InnoDB Cluster.


Checking the Error Log with the Nodes.  The Node 3326 error log shows the Error as follows :
2019-07-04T14:34:11.465132Z 0 [ERROR] [MY-011526] [Repl] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: 0e632a11-9e24-11e9-a2ca-0800271b198a:1,
115217b5-9e24-11e9-8686-0800271b198a:1-17,
a7a6cc05-9e0f-11e9-ad3e-0800271b198a:1-20034 > Group transactions: 115217b5-9e24-11e9-8686-0800271b198a:1-17,
a7a6cc05-9e0f-11e9-ad3e-0800271b198a:1-20034'
2019-07-04T14:34:11.465213Z 0 [ERROR] [MY-011522] [Repl] Plugin group_replication reported: 'The member contains transactions not present in the group. The member will now exit the group.'

It clearly shows the Error is due to the "This member has more executed transactions than those present in the group"

To identify the GTID, compare the @@GTID_EXECUTED between all nodes, it shows that the following transaction is the EXTRA
Local transactions: 0e632a11-9e24-11e9-a2ca-0800271b198a:1

Connecting to the Server 3326 and deduce the action due to the GTID,
1.  mysql -uroot -h127.0.0.1 -P3326 -pmysql -e "show master status\G"

 

2. The binlog file name is number as mysqllog.00006, we can use the following command to show the binlog events


mysql> show binlog events in 'mysqllog.000006';
+-----------------+-----+----------------+-----------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
| Log_name        | Pos | Event_type     | Server_id | End_log_pos | Info                                                                                                                            |
+-----------------+-----+----------------+-----------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
| mysqllog.000006 |   4 | Format_desc    |       103 |         124 | Server ver: 8.0.16-commercial, Binlog ver: 4                                                                                    |
| mysqllog.000006 | 124 | Previous_gtids |       103 |         271 | 0e632a11-9e24-11e9-a2ca-0800271b198a:1,
115217b5-9e24-11e9-8686-0800271b198a:1-17,
a7a6cc05-9e0f-11e9-ad3e-0800271b198a:1-20034 |
+-----------------+-----+----------------+-----------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)



It does not tell anything about the GTID : 0e632a11-9e24-11e9-a2ca-0800271b198a:1

Continue to check the binlog file backwards - 'mysqllog.00005'

mysql> show binlog events in 'mysqllog.000005';
+-----------------+-----+----------------+-----------+-------------+-----------------------------------------------------------------------------------------+
| Log_name        | Pos | Event_type     | Server_id | End_log_pos | Info                                                                                    |
+-----------------+-----+----------------+-----------+-------------+-----------------------------------------------------------------------------------------+
| mysqllog.000005 |   4 | Format_desc    |       103 |         124 | Server ver: 8.0.16-commercial, Binlog ver: 4                                            |
| mysqllog.000005 | 124 | Previous_gtids |       103 |         231 | 115217b5-9e24-11e9-8686-0800271b198a:1-17,
a7a6cc05-9e0f-11e9-ad3e-0800271b198a:1-20034 |
| mysqllog.000005 | 231 | Gtid           |       103 |         304 | SET @@SESSION.GTID_NEXT= '0e632a11-9e24-11e9-a2ca-0800271b198a:1'                       |
| mysqllog.000005 | 304 | Query          |       103 |         383 | flush tables                                                                            |
| mysqllog.000005 | 383 | Stop           |       103 |         402 |                                                                                         |
+-----------------+-----+----------------+-----------+-------------+-----------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)


The GTID '0e632a11-9e24-11e9-a2ca-0800271b198a:1'  is located and the "Query" is "flush tables".   

The issue of inconsistency is identified due to the exact "flush tables" on the Node 3326.




To recovery this, because this 'flush table' is simple gap, we can fix it by adding this GTID to the existing INNODB CLUSTER.

Login to the Node 3306 (which is the PRIMARY node), adding back the transaction :
mysql -uroot -h127.0.0.1 -P3306 -pmysql
mysql >SET @@SESSION.GTID_NEXT= '0e632a11-9e24-11e9-a2ca-0800271b198a:1';
mysql>  flush tables;
mysql> exit

And now RESTART the node 3326 
 

Check the InnoDB Cluster status again - It comes back as NORMAL


IMPORTANT : Please note that, this is something manual fix but it can be VERY RISKY.    No guarantee to work because situations can be different from this tutorial.

Enjoy.

2 comments:

  1. i know this is risky
    can u suggest a better way
    also if using innodb cluster can we do things manually
    or will it effect cluster

    ReplyDelete
  2. Doing a clone (backup & restore) if there is one being trusted as the true source of data. However, there are cases where certain data set is in instance1 and certain data set is in instance2, it has to manually identify any missing data across each member and fix it manually.

    ReplyDelete