Mysql事务忘记rollback导致的死锁分析

做分布式事务的时候用到了存储过程加事务,由于忘记在存储过程中捕获异常rollback,导致了死锁

不过我select ... for update锁的是A表,实际线上却是B表被锁死

sql一块之前理解不深入,debug的过程重新复习一下

bug重现

简化逻辑重现核心bug

锁A表

1
2
3
4
5
6
7
8
9
10
11
client1:
mysql> start transaction;
Query OK, 0 rows affected (0.01 sec)

mysql> select count(1) from xids where xid = "xid_{1825}_wtf" and status = 0 for update;
+----------+
| count(1) |
+----------+
| 1 |
+----------+
1 row in set (0.01 sec)

查看innodb_trx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
mysql> use information_schema
Database changed
mysql> select * from innodb_trx \G;
*************************** 1. row ***************************
trx_id: 21059318
trx_state: RUNNING
trx_started: 2018-12-21 17:12:16
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 2
trx_mysql_thread_id: 22129
trx_query: NULL
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 1
trx_lock_structs: 2
trx_lock_memory_bytes: 1136
trx_rows_locked: 1
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.01 sec)

通过trx_tables_locked字段,可以看到锁住了一个表,也就是我说的表A

事务执行失败

1
2
3
4
5
6
client 1:
mysql> update users set balance = balance - convert(1,unsigned) where user_id = 100000000003;
ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in '(`dsp_adlib`.`users`.`balance` - cast(1 as unsigned))'

client 2:
update users set balance = 0 where user_id = 100000000003;

由于balance是有符号数,和一个无符号相减并且结果小于0导致溢出(之前触发器bug也遇到过。。同一个坑踩了两次)

这个时候client2的update语句等待锁,也就是client1死锁了

查看下事务和锁情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
mysql> select * from innodb_trx \G;
*************************** 1. row ***************************
trx_id: 21059551
trx_state: LOCK WAIT
trx_started: 2018-12-21 17:19:41
trx_requested_lock_id: 21059551:566:3:43
trx_wait_started: 2018-12-21 17:19:41
trx_weight: 2
trx_mysql_thread_id: 22131
trx_query: update users set balance = 0 where user_id = 100000000003
trx_operation_state: starting index read
trx_tables_in_use: 1
trx_tables_locked: 1
trx_lock_structs: 2
trx_lock_memory_bytes: 1136
trx_rows_locked: 1
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 2. row ***************************
trx_id: 21059318
trx_state: RUNNING
trx_started: 2018-12-21 17:12:16
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 4
trx_mysql_thread_id: 22129
trx_query: NULL
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 2
trx_lock_structs: 4
trx_lock_memory_bytes: 1136
trx_rows_locked: 2
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
2 rows in set (0.01 sec)

ERROR:
No query specified

mysql> select * from innodb_locks \G
*************************** 1. row ***************************
lock_id: 21059551:566:3:43
lock_trx_id: 21059551
lock_mode: X
lock_type: RECORD
lock_table: `dsp_adlib`.`users`
lock_index: PRIMARY
lock_space: 566
lock_page: 3
lock_rec: 43
lock_data: 100000000003
*************************** 2. row ***************************
lock_id: 21059318:566:3:43
lock_trx_id: 21059318
lock_mode: X
lock_type: RECORD
lock_table: `dsp_adlib`.`users`
lock_index: PRIMARY
lock_space: 566
lock_page: 3
lock_rec: 43
lock_data: 100000000003
2 rows in set, 1 warning (0.01 sec)

mysql> select * from innodb_lock_waits \G
*************************** 1. row ***************************
requesting_trx_id: 21059551
requested_lock_id: 21059551:566:3:43
blocking_trx_id: 21059318
blocking_lock_id: 21059318:566:3:43
1 row in set, 1 warning (0.01 sec)

可以看到client2事务id 21059551,由于client1事务id 21059318死锁无法继续进行。

client1的事务trx_tables_locked字段由1变为2,即表B被锁死

死锁分析

再次简化流程以后,尝试update成功是否会死锁

1
2
3
4
5
6
client1:
start transaction;
update users set balance = balance - 1 where user_id = 100000000003;

然后client2执行:
update users set balance = 0 where user_id = 100000000003;

发现balance表依然被锁死,那么这个和for update锁没有关系了。

只要事务过程中修改的表,就会被锁

重新阅读了美团的文章Innodb中的事务隔离级别和锁的关系

1
2
3
4
5
6
7
8
9
10
快照读:就是select
select * from table ....;
当前读:特殊的读操作,插入/更新/删除操作,属于当前读,处理的都是当前的数据,需要加锁。
select * from table where ? lock in share mode;
select * from table where ? for update;
insert;
update ;
delete;
...
RR级别中,事务A在update后加锁,事务B无法插入新数据,这样事务A在update前后读的数据保持一致,避免了幻读。这个锁,就是Gap锁。

之前只理解了这篇文章关于RC和RR模式下MVCC的分析(快照读的部分),没有仔细关注过GAP锁。

现在才发现原来事务中的update会导致锁表,而且锁也分很多种,

record锁,单条索引记录上加锁,如果没有索引会导致锁表

gap锁,在索引记录之间的间隙中加锁,或者是在某一条索引记录之前或者之后加锁,并不包括该索引记录本身。

gap lock的机制主要是解决RR模式下的幻读问题

next-key锁,每个next-key锁由record锁和2个gap锁组成

MySQL InnoDB中唯一索引和非唯一索引时的加锁情况

唯一索引和非唯一索引的范围查询会加多个next-key锁

非唯一索引的精确查询会加一个next-key锁

唯一索引的精确查询会加一个record锁

参考资料

查看事务锁:innodb_trx + innodb_locks + innodb_lock_waits

Innodb中的事务隔离级别和锁的关系

MySQL InnoDB中唯一索引和非唯一索引时的加锁情况