分析一个由于线程私有变量生命周期导致的coredump

最近实在太忙啦,好几个月没写博客了,趁着五一放假补一篇

最近运维同学调整了告警策略,将连续coredump才告警,改成了每次coredump必告警

业务部门顿时向我报障了taf框架的coredump

一开始core在了tcmalloc,因为tcmalloc不会第一时间coredump,所以内存问题会跑一段时间才出现

让业务重编了不带tcmalloc的taf框架以后,反馈每次压测以后再重启程序,必现coredump,core在了程序退出的时候

上层的堆栈依然跑崩了,不是core的第一现场,事后回想应该是读取到了已经被析构的变量,这个变量没有太离谱,所以依然能够运行一段时间

必现的coredump就很好解决了,我重编了不带tcmalloc并且启用asan的taf框架来复现问题

报错如下(和谐掉了敏感信息):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
=================================================================
==5514==ERROR: AddressSanitizer: heap-use-after-free on address 0x621000006500 at pc 0x7fd764a7458d bp 0x7fff0b4c7d00 sp 0x7fff0b4c74a8
WRITE of size 24 at 0x621000006500 thread T0
#0 0x7fd764a7458c in __interceptor_memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790
#1 0x7fd76457e647 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0x141647)
#2 0x7fd7645709b3 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0x1339b3)
#3 0x5644733a870c in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) /usr/include/c++/9/ostream:570
#4 0x5644733a870c in taf::TC_Logger<taf::TimeWriteT, taf::TC_RollByTime>::stream(int) /root/taf/include/util/tc_logger.h:880
#5 0x5644733aac6f in taf::TC_Logger<taf::TimeWriteT, taf::TC_RollByTime>::log(int) /root/taf/include/util/tc_logger.h:792
#6 0x5644721b8a89 in EventMQHelper::terminate() /home/tafjce/和谐/和谐Server/EventMQHelper.h:1593
#7 0x5644721b79ab in EventMQHelper::~EventMQHelper() /home/tafjce/和谐/和谐Server/EventMQHelper.h:1533
#8 0x5644721b7b07 in EventMQHelper::~EventMQHelper() /home/tafjce/和谐/和谐Server/EventMQHelper.h:1534
#9 0x5644727cb1fa in taf::CreateUsingNew<EventMQHelper>::destroy(EventMQHelper*) (/data/app/taf/tafnode/data/和谐SZ.和谐Server/bin/和谐Server+0x21681fa)
#10 0x5644726c0155 in taf::TC_Singleton<EventMQHelper, taf::CreateUsingNew, taf::DefaultLifetime>::destroySingleton() (/data/app/taf/tafnode/data/和谐SZ.和谐Server/bin/和谐Server+0x205d155)
#11 0x7fd76412aa26 (/lib/x86_64-linux-gnu/libc.so.6+0x49a26)
#12 0x7fd76412abdf in exit (/lib/x86_64-linux-gnu/libc.so.6+0x49bdf)
#13 0x7fd7641080b9 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b9)
#14 0x564470ac1edd in _start (/data/app/taf/tafnode/data/和谐SZ.和谐Server/bin/和谐Server+0x45eedd)

0x621000006500 is located 0 bytes inside of 4096-byte region [0x621000006500,0x621000007500)
freed by thread T0 here:
#0 0x7fd764ae96ef in operator delete[](void*) ../../../../src/libsanitizer/asan/asan_new_delete.cc:168
#1 0x564473d57bd3 in taf::LoggerBuffer::~LoggerBuffer() /root/taf/src/libutil/tc_logger.cpp:217
#2 0x7fd76412b43e in __call_tls_dtors (/lib/x86_64-linux-gnu/libc.so.6+0x4a43e)

previously allocated by thread T0 here:
#0 0x7fd764ae8787 in operator new[](unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:107
#1 0x564473d47789 in taf::LoggerBuffer::reserve(long) /root/taf/src/libutil/tc_logger.cpp:262
#2 0x564473d47789 in taf::LoggerBuffer::reserve(long) /root/taf/src/libutil/tc_logger.cpp:248
#3 0x564473d47a7a in taf::LoggerBuffer::overflow(int) /root/taf/src/libutil/tc_logger.cpp:288
#4 0x7fd76457e679 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0x141679)

tc_logger.h:880是对一个日志类所拥有的LoggerBuffer的写入操作

再根据下面的信息freed by thread T0,可知是这个变量已经被析构了

初步定位问题

所以问题大概是在程序退出的时候,EventMQHelper这个单例类在析构函数里面打印了一行日志,但此时日志类已经提前析构了,导致了coredump

这就有点不太符合我对C++的认知了

以我的理解,C++的全局变量(包括只是可见性不同的函数内静态局部变量),应该是越早构造的越晚析构

日志单例类这个全局变量,显然会比业务单例类的全局变量早构造的多,自然应该是更晚析构不应该出问题

打日志定位根本问题

既然是生命周期问题,那么把所有涉及到的变量打印一下生命周期就行了

简单的日志工具类

但问题出在日志类相关上,所以只能轮一个简单的日志类来打印日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include <fstream>
#include <mutex>
#include <map>
#include <util/tc_timeprovider.h>
#include <syscall.h>
#include <unistd.h>
#include "util/tc_unwind.h"

class SafeAppendFile {
public:
SafeAppendFile(const string& filename) : filename(filename) {
file.open(filename, ios_base::app);
if (!file) {
throw runtime_error("Could not open file: " + filename);
}
}

template <typename T>
SafeAppendFile& operator<<(const T& v) {
lock_guard<mutex> guard(mutex_);
file << v;
return *this;
}

SafeAppendFile& operator<<(ostream& (*f)(ostream&)) {
lock_guard<mutex> guard(mutex_);
file << f;
file.flush();
return *this;
}

~SafeAppendFile() {
file.close();
}

private:
string filename;
ofstream file;
mutex mutex_;
};

// Usage function
inline SafeAppendFile& getFile(const string& filename) {
// Static to use the same files and mutexes across calls
static map<string, SafeAppendFile> files;
static mutex mutex_;
lock_guard<mutex> guard(mutex_);
auto iter = files.find(filename);
if (iter == files.end()) { // if file not exists
files.emplace(
piecewise_construct, forward_as_tuple(filename),
forward_as_tuple(filename + "_" + to_string(TNOWMS)));
iter = files.find(filename);
}
return iter->second;
}

加日志

我在单例类的构造和析构上加上了如下日志

1
2
3
4
5
6
7
8
9
TC_Singleton(){
getFile("singleton")
<< TNOWMS << "|" << syscall(SYS_gettid) << "|" << typeid(T).name()
<< "|created|" << getBacktraceInternal() << endl;
}
virtual ~TC_Singleton(){
getFile("singleton") << TNOWMS << "|" << syscall(SYS_gettid) << "|"
<< typeid(T).name() << "|removed" << endl;
};

依次打印了当前时间,线程id,类名,创建或者删除事件

最后如果是创建事件,还打印了当前堆栈以供查看如果业务单例类太早构造以至于太晚析构,是因为什么逻辑从哪里开始构造的

LoggerBuffer类也是一样的逻辑

1
2
3
4
5
6
7
8
9
LoggerBuffer() {
getFile("singleton")
<< TNOWMS << "|" << syscall(SYS_gettid) << "|LoggerBuffer created|"
<< getBacktraceInternal() << endl;
}
~LoggerBuffer() {
getFile("singleton") << TNOWMS << "|" << syscall(SYS_gettid)
<< "|LoggerBuffer removed" << endl;
}

再复现

打印日志如下:

由于最终问题不是由于业务的单例类太早构造导致的,因此略过了打印堆栈信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
1714483932866|553036|N3taf13TracerManagerE|created|
1714483932870|553036|N3taf19CommunicatorFactoryE|created|
1714483932875|553036|N3taf13TafTimeLoggerE|created|
1714483932880|553036|N3taf8AppCacheE|created|
1714483932886|553036|N3taf13TafRollLoggerE|created|
1714483932896|553050|LoggerBuffer created|
1714483932913|553036|LoggerBuffer created|
1714483932920|553036|N3taf15TafLoggerThreadE|created|
1714483932926|553036|N3taf13TafSegvMemLogE|created|
1714483932930|553036|N3taf15TafRemoteConfigE|created|
1714483932936|553036|N3taf15TafRemoteNotifyE|created|
1714483932942|553036|N3taf14TafNodeFHelperE|created|
1714483932950|553036|N3taf20ServantHelperManagerE|created|
1714483932960|553036|N3taf14NotifyObserverE|created|
...略过一些created事件
1714483938364|553099|LoggerBuffer created|
1714483938365|553102|LoggerBuffer created|
1714483938455|553098|LoggerBuffer created|
1714483938455|553105|LoggerBuffer created|
1714483938455|553103|LoggerBuffer created|
1714483938456|553097|LoggerBuffer created|
1714483938458|553108|LoggerBuffer created|
1714483938458|553107|LoggerBuffer created|
1714483938459|553104|LoggerBuffer created|
1714483938459|553111|LoggerBuffer created|
1714483938460|553109|LoggerBuffer created|
1714483938463|553100|LoggerBuffer created|
1714483938467|553106|LoggerBuffer created|
1714483938467|553110|LoggerBuffer created|
1714483938469|553101|LoggerBuffer created|
1714483938475|553114|LoggerBuffer created|
...略过一些created事件
1714483984300|553106|13EventMQHelper|created|
...略过一些created事件
1714483993042|553059|LoggerBuffer created|
1714483998802|553122|LoggerBuffer created|
1714484001783|553169|LoggerBuffer created|
1714484035917|553115|LoggerBuffer created|
1714484053963|553255|LoggerBuffer created|
1714484053964|553255|LoggerBuffer removed
1714484053964|553115|LoggerBuffer removed
1714484053964|553097|LoggerBuffer removed
1714484053966|553099|LoggerBuffer removed
1714484053968|553107|LoggerBuffer removed
1714484053969|553101|LoggerBuffer removed
1714484053970|553111|LoggerBuffer removed
1714484053970|553110|LoggerBuffer removed
1714484053971|553114|LoggerBuffer removed
1714484053972|553103|LoggerBuffer removed
1714484053972|553109|LoggerBuffer removed
1714484053973|553106|LoggerBuffer removed
1714484053973|553104|LoggerBuffer removed
1714484053974|553100|LoggerBuffer removed
1714484053974|553098|LoggerBuffer removed
1714484053974|553102|LoggerBuffer removed
1714484053974|553105|LoggerBuffer removed
1714484053974|553108|LoggerBuffer removed
1714484053976|553036|LoggerBuffer removed
1714484054055|553036|13EventMQHelper|removed

可以发现上面的日志中,LoggerBuffer是一个线程局部变量,所以每个线程都有创建和析构

而查看coredump信息core-和谐-553036-1714484054,所以是core在了553036上

聚焦于553036的关键事件如下

1
2
3
4
5
1714483932875|553036|N3taf13TafTimeLoggerE|created|
1714483932913|553036|LoggerBuffer created|
1714483984300|553106|13EventMQHelper|created|
1714484053976|553036|LoggerBuffer removed
1714484054055|553036|13EventMQHelper|removed

可以看到早于EventMQHelper就创建了日志单例类TafTimeLogger,在core的时候他也确实没有析构,符合预期

但是早于EventMQHelper创建的LoggerBuffer这个线程局部变量,却在EventMQHelper析构前就析构了,导致了coredump

再根据大量的LoggerBuffer析构事件可以大胆推测

线程局部变量总是会先于全局变量析构(事后回看一下,对了一半,不太准确)

写demo验证问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <iostream>
#include <thread>
#include <unistd.h>
#include <syscall.h>

using namespace std;

template <typename T>
struct Singleton {
static T& getInstance() {
static T v;
return v;
}
};

struct A : public Singleton<A> {
A() {
cout << syscall(SYS_gettid) << "|A" << endl;
}
~A() {
cout << syscall(SYS_gettid) << "|~A" << endl;
}
};

struct B {
B() {
cout << syscall(SYS_gettid) << "|B" << endl;
}
~B() {
cout << syscall(SYS_gettid) << "|~B" << endl;
}
};

static B& getB() {
static thread_local B c;
return c;
}

int main() {
getB();
std::thread([]() {
getB();
A::getInstance();
}).join();
}

在gcc 9.0和clang 14.0下编译均输出

1
2
3
4
5
6
146052|B
146058|B
146058|A
146058|~B
146052|~B
146052|~A

看起来似乎是线程局部变量总是会先于全局变量析构么,严谨起见再构造一个线程不退出的例子

1
2
3
4
5
6
7
8
9
int main() {
getB();
std::thread([]() {
getB();
A::getInstance();
sleep(2); //这里sleep了2秒,因此主线程退出的时候子线程还未退出
}).detach();
sleep(1);
}

输出

1
2
3
4
5
146693|B
146699|B
146699|A
146693|~B
146693|~A

可以发现子线程的线程私有变量根本没有析构,也就是必须线程析构了,子线程的线程变量才会析构

从这两个例子可以总结:

  • 全局变量不管在哪个线程构造,总是在主线程析构

  • 线程局部变量在线程退出时才退出

    查阅https://en.cppreference.com/w/cpp/language/storage_duration可以看到

    Storage duration

    thread storage duration. The storage for the object is allocated when the thread begins and deallocated when the thread ends. Each thread has its own instance of the object. Only objects declared thread_local have this storage duration. thread_local can appear together with static or extern to adjust linkage. See non-local variables and static local variables for details on initialization of objects with this storage duration.

    线程(thread)存储期。这类对象的存储在线程开始时分配,并在线程结束时解分配。每个线程拥有它自身的对象实例。只有声明为 thread_local 的对象拥有此存储期。thread_local 能与 static 或 extern 一同出现,它们用于调整链接。关于具有此存储期的对象的初始化的细节,见非局部变量静态局部变量

当所有子线程如果先于主线程结束,那么所有线程私有变量会先析构,随后才到主线程去析构全局变量

因此准确的说法,是子线程如果先于主线程结束,那么线程局部变量总是会先于全局变量析构

在一个严谨的程序里面,子线程如果先于主线程结束应当是成立的,所以大部分情况下后面的结论也能成立

修复demo

根据刚才得到的结论,很容易可以想到,记录下线程局部变量的析构事件即可

当发生线程局部变量的析构事件,对其降级,不再依赖他即可

修复demo的代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <iostream>
#include <thread>
#include <unistd.h>
#include <syscall.h>

using namespace std;

template <typename T>
struct Singleton {
static T& getInstance() {
static T v;
return v;
}
};

class TLsDestoryRecorder : public Singleton<TLsDestoryRecorder> {
class Recorder {
TLsDestoryRecorder *owner_;

public:
Recorder(TLsDestoryRecorder *owner) : owner_(owner) {}
~Recorder() { owner_->isDestoryed_ = true; }
};
bool isDestoryed_ = false;

public:
TLsDestoryRecorder() { static thread_local Recorder recorder{this}; }
bool isDestoryed() {
return isDestoryed_;
}
};

struct A : public Singleton<A> {
A() {
cout << syscall(SYS_gettid) << "|A" << endl;
}
~A() {
cout << syscall(SYS_gettid) << "|~A" << endl;
}
void doSomething();
};

struct B {
B() {
cout << syscall(SYS_gettid) << "|B" << endl;
}
~B() {
cout << syscall(SYS_gettid) << "|~B" << endl;
}
};

static B& getB() {
static thread_local B c;
return c;
}

void A::doSomething() {
if (TLsDestoryRecorder::getInstance().isDestoryed()) {
cout << "downgrade" << endl;
}else {
cout << "not downgrade" << endl;
getB();
}
}

struct MyExitClass : public Singleton<MyExitClass> {
~MyExitClass()
{
std::cout << "Exit logic in destructor\n";
A::getInstance().doSomething();
}
};

int main() {
TLsDestoryRecorder::getInstance();
getB();
std::thread([]() {
getB();
A::getInstance();
MyExitClass::getInstance();
}).join();
}

这里的MyExitClass在A后构造,因此会在A析构之前析构,以执行A::getInstance().doSomething()验证是否降级成功

输出

1
2
3
4
5
6
7
8
148853|B
148859|B
148859|A
148859|~B
148853|~B
Exit logic in destructor
downgrade
148853|~A

OK,降级成功~

taf框架的日志类根据这个逻辑进行优化,验证了一下,OK,业务服务不再coredump

总结

本篇根据一个自己轮的简单日志类打印所有的单例和部分线程私有变量的生命周期

大胆假设线程局部变量总是会先于全局变量析构

随后根据写demo验证,证明了如果所有子线程如果先于主线程结束,那么所有线程局部变量总是会先于全局变量析构

从而最终修复了问题