Mysql事务忘记rollback导致的死锁分析

原创内容,转载请注明出处

Posted by Weakyon Blog on December 21, 2018

做分布式事务的时候用到了存储过程加事务,由于忘记在存储过程中捕获异常rollback,导致了死锁

不过我select … for update锁的是A表,实际线上却是B表被锁死

sql一块之前理解不深入,debug的过程重新复习一下

一 bug重现

简化逻辑重现核心bug

1.1 锁A表

client1:
mysql> start transaction;
Query OK, 0 rows affected (0.01 sec)

mysql> select count(1) from xids where xid = "xid_{1825}_wtf" and status = 0 for update;
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (0.01 sec)

查看innodb_trx

mysql> use information_schema
Database changed
mysql> select * from innodb_trx \G;
*************************** 1. row ***************************
                    trx_id: 21059318
                 trx_state: RUNNING
               trx_started: 2018-12-21 17:12:16
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 2
       trx_mysql_thread_id: 22129
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 1
          trx_lock_structs: 2
     trx_lock_memory_bytes: 1136
           trx_rows_locked: 1
         trx_rows_modified: 0
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.01 sec)

通过trx_tables_locked字段,可以看到锁住了一个表,也就是我说的表A

1.2 事务执行失败

client 1:
mysql> update users set balance = balance - convert(1,unsigned) where user_id = 100000000003;
ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in '(`dsp_adlib`.`users`.`balance` - cast(1 as unsigned))'

client 2:
update users set balance = 0 where user_id = 100000000003;

由于balance是有符号数,和一个无符号相减并且结果小于0导致溢出(之前触发器bug也遇到过。。同一个坑踩了两次)

这个时候client2的update语句等待锁,也就是client1死锁了

查看下事务和锁情况

mysql> select * from innodb_trx \G;
*************************** 1. row ***************************
                    trx_id: 21059551
                 trx_state: LOCK WAIT
               trx_started: 2018-12-21 17:19:41
     trx_requested_lock_id: 21059551:566:3:43
          trx_wait_started: 2018-12-21 17:19:41
                trx_weight: 2
       trx_mysql_thread_id: 22131
                 trx_query: update users set balance = 0 where user_id = 100000000003
       trx_operation_state: starting index read
         trx_tables_in_use: 1
         trx_tables_locked: 1
          trx_lock_structs: 2
     trx_lock_memory_bytes: 1136
           trx_rows_locked: 1
         trx_rows_modified: 0
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 2. row ***************************
                    trx_id: 21059318
                 trx_state: RUNNING
               trx_started: 2018-12-21 17:12:16
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 4
       trx_mysql_thread_id: 22129
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 2
          trx_lock_structs: 4
     trx_lock_memory_bytes: 1136
           trx_rows_locked: 2
         trx_rows_modified: 0
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
2 rows in set (0.01 sec)

ERROR:
No query specified

mysql> select * from innodb_locks \G
*************************** 1. row ***************************
    lock_id: 21059551:566:3:43
lock_trx_id: 21059551
  lock_mode: X
  lock_type: RECORD
 lock_table: `dsp_adlib`.`users`
 lock_index: PRIMARY
 lock_space: 566
  lock_page: 3
   lock_rec: 43
  lock_data: 100000000003
*************************** 2. row ***************************
    lock_id: 21059318:566:3:43
lock_trx_id: 21059318
  lock_mode: X
  lock_type: RECORD
 lock_table: `dsp_adlib`.`users`
 lock_index: PRIMARY
 lock_space: 566
  lock_page: 3
   lock_rec: 43
  lock_data: 100000000003
2 rows in set, 1 warning (0.01 sec)

mysql> select * from innodb_lock_waits \G
*************************** 1. row ***************************
requesting_trx_id: 21059551
requested_lock_id: 21059551:566:3:43
  blocking_trx_id: 21059318
 blocking_lock_id: 21059318:566:3:43
1 row in set, 1 warning (0.01 sec)

可以看到client2事务id 21059551,由于client1事务id 21059318死锁无法继续进行。

client1的事务trx_tables_locked字段由1变为2,即表B被锁死

二 死锁分析

再次简化流程以后,尝试update成功是否会死锁

client1:
start transaction;
update users set balance = balance - 1 where user_id = 100000000003;

然后client2执行:
update users set balance = 0 where user_id = 100000000003;

发现balance表依然被锁死,那么这个和for update锁没有关系了。

只要事务过程中修改的表,就会被锁

重新阅读了美团的文章Innodb中的事务隔离级别和锁的关系

快照读:就是select
select * from table ....;
当前读:特殊的读操作,插入/更新/删除操作,属于当前读,处理的都是当前的数据,需要加锁。
select * from table where ? lock in share mode;
select * from table where ? for update;
insert;
update ;
delete;
...
RR级别中,事务A在update后加锁,事务B无法插入新数据,这样事务A在update前后读的数据保持一致,避免了幻读。这个锁,就是Gap锁。

之前只理解了这篇文章关于RC和RR模式下MVCC的分析(快照读的部分),没有仔细关注过GAP锁。

现在才发现原来事务中的update会导致锁表,而且锁也分很多种,

record锁,单条索引记录上加锁,如果没有索引会导致锁表

gap锁,在索引记录之间的间隙中加锁,或者是在某一条索引记录之前或者之后加锁,并不包括该索引记录本身。

gap lock的机制主要是解决RR模式下的幻读问题

next-key锁,每个next-key锁由record锁和2个gap锁组成

MySQL InnoDB中唯一索引和非唯一索引时的加锁情况

唯一索引和非唯一索引的范围查询会加多个next-key锁

非唯一索引的精确查询会加一个next-key锁

唯一索引的精确查询会加一个record锁

参考资料

查看事务锁:innodb_trx + innodb_locks + innodb_lock_waits

Innodb中的事务隔离级别和锁的关系

MySQL InnoDB中唯一索引和非唯一索引时的加锁情况

21 Dec 2018