tcmalloc导致coredump问题踩坑

自研serverless平台存在一个问题很多年了,引入cpython以后,就不能使用tcmalloc了

否则会直接coredump,这个问题不解决,使用平台的同学就没办法进行内存泄露分析

在一个多部门组成的python和C++的混合脚本上,问题爆发了,由于申请内存是一个部门的模块,释放内存又是另外一个部门的模块,跨部门协作下的内存排查太过困难了

因此还是需要从平台侧解决这个问题

coredump问题

一开始让业务去掉python,看看纯C++代码有没有哪里内存泄露,但不幸的是,依然发生了coredump

关键堆栈中,看起来是dlopen打开了某个动态库触发的问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#0  0x00007fd71df27428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007fd71df2902a in __GI_abort () at abort.c:89
#2 0x00007fd71df697ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fd71e082ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007fd71df7237a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7fd71e07fcaf "free(): invalid pointer", action=3) at malloc.c:5006
#4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867
#5 0x00007fd71df7653c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#6 0x00007fd6e809616c in ?? () from so/libnuma.so.1
#7 0x00007fd7253c267a in call_init (l=0xd65ca00, argc=argc@entry=1, argv=argv@entry=0x7ffe88af4708, env=env@entry=0xa20e6c0) at dl-init.c:58
#8 0x00007fd7253c27cb in call_init (env=0xa20e6c0, argv=0x7ffe88af4708, argc=1, l=<optimized out>) at dl-init.c:30
#9 _dl_init (main_map=main_map@entry=0xd654800, argc=1, argv=0x7ffe88af4708, env=0xa20e6c0) at dl-init.c:120
#10 0x00007fd7253c78e2 in dl_open_worker (a=a@entry=0x7ffe88af0a50) at dl-open.c:575
#11 0x00007fd7253c2564 in _dl_catch_error (objname=objname@entry=0x7ffe88af0a40, errstring=errstring@entry=0x7ffe88af0a48, mallocedp=mallocedp@entry=0x7ffe88af0a3f,
operate=operate@entry=0x7fd7253c74d0 <dl_open_worker>, args=args@entry=0x7ffe88af0a50) at dl-error.c:187
#12 0x00007fd7253c6da9 in _dl_open (file=0xa9b9140 "npm/OneLeafProxy@7.2.20/lib/libhycodecsa.so", mode=-2147483638, caller_dlopen=
0x1054469 <leafcore::Loader::sysDLOpen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+111>, nsid=-2, argc=<optimized out>,
argv=<optimized out>, env=0xa20e6c0) at dl-open.c:660
#13 0x00007fd7251aef09 in dlopen_doit (a=a@entry=0x7ffe88af0c80) at dlopen.c:66
#14 0x00007fd7253c2564 in _dl_catch_error (objname=0xa1c4010, errstring=0xa1c4018, mallocedp=0xa1c4008, operate=0x7fd7251aeeb0 <dlopen_doit>, args=0x7ffe88af0c80) at dl-error.c:187
#15 0x00007fd7251af571 in _dlerror_run (operate=operate@entry=0x7fd7251aeeb0 <dlopen_doit>, args=args@entry=0x7ffe88af0c80) at dlerror.c:163
#16 0x00007fd7251aefa1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#17 0x0000000001054469 in leafcore::Loader::sysDLOpen (this=0xa8ec780, filename="npm/OneLeafProxy@7.2.20/lib/libhycodecsa.so") at src/loader.cpp:145
#18 0x0000000001054c40 in leafcore::Loader::loadDynamicLibraries (this=0xa8ec780, dylibs=std::vector of length 23, capacity 32 = {...}, libraryDir="",
handles=std::map with 9 elements = {...}) at src/loader.cpp:197
#19 0x000000000105f086 in leafcore::CppLoader::load (this=0xa8ec780,
binary="\006\000\026\000*\a\000\261vX\177ELF\002\001\001\000\000\000\000\000\000\000\000\000\001\000>\000\001", '\000' <repeats 19 times>, "\330\354\244\000\000\000\000\000\000\000\000\000@\000\000\000\000\000@\000&2\001\000UH\211\345AVSH\203\354 H\211}\320H\211u\330L\213u\320H\270\000\000\000\000\000\000\000\000L\211\367\377\320H\270\000\000\000\000\000\000\000\000H\203\300\020I\211\006L\211\363H\203\303\bH\213u\330H\270\000\000\000\000\000\000\000\000H\211\337\377\320\353\000A\307F(\000\000\000\000H\270\000\000\000\000\000\000\000\000L\211\367\377\320\353\000H\203\304 [A^]\303H\211E\340\211U\354\353\026"..., libraryDir="") at cpp/cpp_loader.cpp:113
#20 0x0000000000bf7fc7 in Engine::addModule (this=this@entry=0xa296000, scriptInfo=..., errMsg="", doPrepare=doPrepare@entry=0) at Engine.cpp:418
#21 0x0000000000bfdc7e in Engine::doLoadModule (this=0xa296000, sScriptName="97b893b5-2aa1-4def-8b9a-6b9e2446c8ea_0", sVersion="2",
iError=@0x7ffe88af458c: HUYA::PreinstallRetValue_Success, errMsg="", doPrepare=0) at Engine.cpp:932
#22 0x0000000000b00dda in main (argc=1, argv=0x7ffe88af4708) at main.cpp:121

但是我本地的简单测试用例链接了tcmalloc以后,再打开这个动态库又一点问题都没有

思考了一下,堆栈里面有个关键信息是dlopen中调用glibc的free失败了,提示是无效的指针

1
2
3
4
5
#3  0x00007fd71df7237a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7fd71e07fcaf "free(): invalid pointer", action=3) at malloc.c:5006
#4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867
#5 0x00007fd71df7653c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
...
#16 0x00007fd7251aefa1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87

问题复现

emmm,free应该被tcmalloc hook掉了,tcmalloc使用mmap和sbrk分配内存,哪里来的glibc的free?

莫非,是tcmalloc的bug?hook失效了?

考虑到serverless平台在dlopen动态库的时候确实有不太常规的操作(使用RTLD_DEEPBIND和RTLD_LOCAL参数),因此我实现了一个简单的demo来复现这个问题

思路上,是模拟遇到的coredump场景,在主程序进行内存申请,在动态库中进行内存释放

flowchart LR
    Main[主程序]
    Lib[动态库]
    Mem[内存块#40;由主程序分配#41;]

    Main -->|malloc#40;ptr#41;| Mem
    Main -->|freeMemory#40;ptr#41;| Lib
    Lib -->|free#40;ptr#41;| Mem

动态库代码很简单:在动态库中进行glibc的内存释放

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_coredump/dynamic_lib.cpp

1
2
3
4
5
#include <cstdlib>

extern "C" void freeMemory(void* ptr) {
free(ptr);
}

然后是主程序,为了对比问题,我在代码中实现了直接链接和dlsym打开的方式

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_coredump/main.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <gperftools/tcmalloc.h>
#include <dlfcn.h>

extern "C" void freeMemory(void* ptr);

using namespace std;

void my_free(void* ptr) {
tc_free(ptr);
}

int main() {
// 1) 主程序tcmalloc申请和释放
void* ptr = malloc(128);
if (!ptr) {
return 1;
}
free(ptr);

// 2) 主程序tcmalloc申请,动态库非dlopen RTLD_DEEPBIND打开的释放
ptr = malloc(128);
if (!ptr) {
return 1;
}
// 将内存交给动态库释放
freeMemory(ptr);

// 3) 主程序tcmalloc申请,动态库dlopen RTLD_DEEPBIND打开的释放
// 这里为了不会被dlopen缓存,重新复制了一个动态库
ptr = malloc(128);
if (!ptr) {
return 1;
}
void* handle = dlopen("./libdynamic1.so", RTLD_NOW | RTLD_DEEPBIND | RTLD_LOCAL);
if (!handle) {
return 1;
}
using FreeMemoryFuncT = decltype(&freeMemory);
FreeMemoryFuncT freeMemory1 = (FreeMemoryFuncT)dlsym(handle, "freeMemory");
if (!freeMemory1) {
return 1;
}
freeMemory1(ptr);

return 0;
}

这里dlopen打开的libdynamic1.so是makefile复制出来的

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_coredump/Makefile

1
2
3
4
5
6
7
8
9
10
11
12
MAIN_DIR := $(shell git rev-parse --show-toplevel)

all: libdynamic.so
g++ -std=c++14 -g -o main main.cpp -L. -ldynamic -ltcmalloc -lpthread -ldl
patchelf --set-rpath . main

libdynamic.so: dynamic_lib.cpp
g++ -g -shared -fPIC -o libdynamic.so dynamic_lib.cpp
cp libdynamic.so libdynamic1.so

clean:
rm -f main libdynamic.so libdynamic1.so

为什么要这么做呢?

这是因为dlopen打开已经打开的动态库(直接链接的动态库也算dlopen打开的),只会使用之前的缓存(即使这一次dlopen传入参数和上次不同)

用不同的路径或是软链打开也没有用,底层的文件描述符指向的是同一个路径,只有复制一个文件再dlopen才有效

编译运行,果然core了,gdb看下

1
2
3
4
5
6
7
8
9
10
(gdb) bt
#0 0x00007f79ff5e7428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f79ff5e902a in __GI_abort () at abort.c:89
#2 0x00007f79ff6297ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f79ff742ed8 "*** Error in `%s': %s: 0x%s ***\n")
at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007f79ff63237a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7f79ff73fcaf "free(): invalid pointer", action=3) at malloc.c:5006
#4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867
#5 0x00007f79ff63653c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#6 0x00007f7a00190168 in freeMemory (ptr=0x172a000) at dynamic_lib.cpp:4
#7 0x0000000000401289 in main () at main.cpp:44

嗯,完美复现,堆栈和遇到的问题一毛一样

原因分析

看来是使用RTLD_DEEPBIND和RTLD_LOCAL参数导致tcmalloc出现了bug

根据测试发现:

  • dlopen("./libdynamic1.so", RTLD_NOW | RTLD_DEEPBIND);

    会core

  • dlopen("./libdynamic1.so", RTLD_NOW | RTLD_LOCAL);

    不会core

看来是RTLD_DEEPBIND参数的问题,看下man dlopen

RTLD_DEEPBIND (since glibc 2.3.4)

​ Place the lookup scope of the symbols in this shared object ahead of the global scope. This means that a self-contained object will use its own symbols in preference to global symbols ​ with the same name contained in objects that have already been loaded.

将此共享对象中符号的查找作用域置于全局作用域之前。也就是说,一个自包含的对象会优先使用它自身的符号,而不是那些已被加载的对象中同名的全局符号。

此共享对象中符号的查找作用域(the lookup scope of the symbols in this shared object)是什么呢?

以libdynamic.so为例,他依赖的动态库是记录在elf格式里面的

1
2
~ readelf -d libdynamic.so|grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]

使用ldd来看的话,还会列出操作系统的搜索路径

1
2
3
4
~ ldd libdynamic.so 
linux-vdso.so.1 => (0x00007ffffaf49000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2a20cf8000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2a210c2000)

复杂一些的例子,比如ssh

1
2
3
4
5
6
7
8
~ readelf -d /usr/bin/ssh|grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libselinux.so.1]
0x0000000000000001 (NEEDED) Shared library: [libcrypto.so.1.0.0]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libz.so.1]
0x0000000000000001 (NEEDED) Shared library: [libresolv.so.2]
0x0000000000000001 (NEEDED) Shared library: [libgssapi_krb5.so.2]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]

由于他的动态库还会依赖其他动态库,因此ldd来看会复杂许多

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
~ ldd /usr/bin/ssh
linux-vdso.so.1 => (0x00007ffc493ca000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f7ae1732000)
libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f7ae12ee000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f7ae10ea000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f7ae0ed0000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f7ae0cb5000)
libgssapi_krb5.so.2 => /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007f7ae0a6b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7ae06a1000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007f7ae0431000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7ae1c04000)
libkrb5.so.3 => /usr/lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007f7ae015f000)
libk5crypto.so.3 => /usr/lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007f7adff30000)
libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007f7adfd2c000)
libkrb5support.so.0 => /usr/lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007f7adfb21000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f7adf904000)
libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007f7adf700000)

那么此共享对象中符号的查找作用域(the lookup scope of the symbols in this shared object),也就是说直接依赖的动态库

将此共享对象中符号的查找作用域置于全局作用域之前,也就是说我不管你全局作用域这个函数符号是咋样的,我就只调用我直接依赖的动态库

tcmalloc hook了glibc的malloc,free等内存分配函数,是基于动态链接的符号覆盖来做的。

里面涉及到的got表等具体原理可以参看我之前写的博客https://weakyon.com/2022/09/12/magical-effect-of-hook.html

这里不做展开,简单地说,tcmalloc在他的代码里面实现了一个叫malloc的函数,由于优先级高过glibc的,因此hook了glibc的malloc

RTLD_DEEPBIND打破了这个简单的hook规则

解决办法

做一个深度的hook,把glibc的malloc和free等内存分配函数在内存中的汇编代码,修改成跳转到hook函数,就能解决这个问题了

hook代码如下:

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_fix_coredump/hook.cpp

思路上,是先把页面权限从只读改成可写,然后写入到被hook函数开头:

  • 0xFF2500000000的6字节绝对地址跳转指令(这个指令的含义可以看另外一篇博客https://weakyon.com/2025/08/28/analyzing-the-source-of-LLVM-MCJIT.html#ff-25jmpq)
  • hook函数的8字节地址

合计16字节

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <sys/mman.h>
#include <dlfcn.h>
#include <cstring>
#include <stdlib.h>
#include <stdint.h>

// 把原函数的前14字节改成绝对间接跳转到目标函数
void simple_hook(void *sym, void* targetFunc) {
unsigned char patch[14] = {0xFF, 0x25, 0x00, 0x00, 0x00, 0x00};
memcpy(&patch[6], &targetFunc, 8);

// 改写页面权限(mprotect需要4K对齐),写入,再恢复权限
void* pstart = reinterpret_cast<void*>(reinterpret_cast<uint64_t>(sym) &
0xFFFFFFFFFFFFF000);

if (mprotect(pstart, 4096, PROT_READ | PROT_WRITE | PROT_EXEC) != 0)
abort();

memcpy(sym, patch, sizeof(patch));

if (mprotect(pstart, 4096, PROT_READ | PROT_EXEC) != 0)
abort();
}

main函数需要把被hook的free地址从glibc中加载出来,传入sym

同样的,需要把hook的tc_free地址也加载出来,传入targetFunc

由于当前主程序已经hook成功了,所以RTLD_DEFAULT的free符号指向了tcmalloc的free

完整代码如下:

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_fix_coredump/main.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void hookGlibc() {
const char* library_path = "/lib/x86_64-linux-gnu/libc.so.6";
void* handle = dlopen(library_path, RTLD_LOCAL | RTLD_DEEPBIND | RTLD_NOW);
if (!handle) {
cerr << "Failed to open library: " << dlerror() << endl;
quick_exit(0);
}
void* symbol = dlsym(handle, "free");
// void* symbol = dlsym(RTLD_NEXT, "free"); => 指向glibc free
if (symbol) {
cout << "Symbol found in " << library_path << " at address: " << symbol << endl;
} else {
cerr << "Failed to find symbol in " << library_path << ": " << dlerror() << endl;
quick_exit(0);
}
void* hookSymbol = dlsym(RTLD_DEFAULT,"free"); //=>指向tcmalloc free
if (hookSymbol) {
cout << "Symbol found in RTLD_DEFAULT at address: " << hookSymbol << endl;
} else {
cerr << "Failed to find symbol in RTLD_DEFAULT: " << dlerror() << endl;
quick_exit(0);
}
simple_hook(symbol, hookSymbol);
}

int main() {
hookGlibc();
...省略
}

验证

运行以后,不再出现coredump,问题解决

gdb验证下glibc的free是否被正确覆盖了

查看不用tcmalloc和使用tcmalloc的free符号

首先写一个简单程序看下不带tcmalloc的时候,free指向的哪个符号?

1
2
3
4
5
6
#include <stdlib.h>
int main() {
void* ptr = malloc(128);
free(ptr);
return 0;
}

编译运行

1
2
3
4
5
6
7
8
9
10
11
12
13
~ g++ -std=c++14 -g -o main main.cpp 
~ gdb main
(gdb) b main
Breakpoint 1 at 0x401419: file main.cpp, line 2.
(gdb) r
Starting program: /root/tcmalloc_hook_debug/tcmalloc_fix_coredump/main

Breakpoint 1, main () at main.cpp:37
void* ptr = malloc(128);
(gdb) n
free(ptr)
(gdb) n
return 0;

这时free已经执行完了,free的got表被填充好了

搜索下got表的plt桩,可以发现有好多个

1
2
3
4
5
6
7
8
(gdb) info func free\@plt
All functions matching regular expression "free\@plt":

Non-debugging symbols:
0x00000000004010b0 free@plt
0x00007ffff7bd3ce0 free@plt
0x00007ffff78d8db0 free@plt
0x00007ffff6f6a7e0 free@plt

第一个看起来是主程序的,dump下确认下

1
2
(gdb) info symbol 0x00000000004010b0
free@plt in section .plt of /root/tcmalloc_hook_debug/tcmalloc_fix_coredump/main

然后看下plt桩的汇编代码指向哪个got表

1
2
3
4
5
(gdb) disassemble 0x00000000004010b0
Dump of assembler code for function free@plt:
0x00000000004010b0 <+0>: jmpq *0x2f8a(%rip) # 0x404040
0x00000000004010b6 <+6>: pushq $0x8
0x00000000004010bb <+11>: jmpq 0x401020

那么打印got表里面存储的真实free地址

1
2
3
4
5
6
7
8
9
10
(gdb) x/gx 0x404040
0x404040: 0x00007ffff750b4f0
(gdb) disassemble 0x00007ffff750b4f0
Dump of assembler code for function __GI___libc_free:
0x00007ffff750b4f0 <+0>: push %r13
0x00007ffff750b4f2 <+2>: push %r12
0x00007ffff750b4f4 <+4>: push %rbp
0x00007ffff750b4f5 <+5>: push %rbx
0x00007ffff750b4f6 <+6>: sub $0x28,%rsp
0x00007ffff750b4fa <+10>: mov 0x33f9f7(%rip),%rax # 0x7ffff784aef8

可知glibc的free是__GI___libc_free,同理,把tcmalloc一起编译以后按这个步骤打印,此时的free指向的是tc_free

所以要验证的就是__GI___libc_free是否正确跳转到tc_free

验证hook成功

在执行hook前打印下__GI___libc_free

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(gdb) b main
Breakpoint 1 at 0x401409: file main.cpp, line 36.
(gdb) r
Starting program: /root/tcmalloc_hook_debug/tcmalloc_fix_coredump/main

Breakpoint 1, main () at main.cpp:36
36 hookGlibc();
(gdb) disassemble __GI___libc_free
Dump of assembler code for function __GI___libc_free:
0x00007ffff71144f0 <+0>: push %r13
0x00007ffff71144f2 <+2>: push %r12
0x00007ffff71144f4 <+4>: push %rbp
0x00007ffff71144f5 <+5>: push %rbx
0x00007ffff71144f6 <+6>: sub $0x28,%rsp
0x00007ffff71144fa <+10>: mov 0x33f9f7(%rip),%rax # 0x7ffff7453ef8

在hook后再打印下glibc的free内容

1
2
3
4
5
6
7
(gdb) n
Symbol found in /lib/x86_64-linux-gnu/libc.so.6 at address: 0x7ffff71144f0
Symbol found in RTLD_DEFAULT at address: 0x7ffff7a16f00
40 void *ptr = malloc(128);
(gdb) disassemble __GI___libc_free
Dump of assembler code for function __GI___libc_free:
0x00007ffff71144f0 <+0>: jmpq *0x0(%rip) # 0x7ffff71144f6 <__GI___libc_free+6>

可以看到已经变成了jmpq *0x0(%rip),也就是跳转到下一个指令0x7ffff71144f6里面记录的地址

看下这个地址内存储是什么

1
2
3
4
5
6
7
8
(gdb) x/gx 0x7ffff71144f6
0x7ffff71144f6 <__GI___libc_free+6>: 0x00007ffff7a16f00
(gdb) disassemble 0x00007ffff7a16f00
Dump of assembler code for function tc_free(void*):
0x00007ffff7a16f00 <+0>: mov 0x3b81f9(%rip),%rax # 0x7ffff7dcf100 <_ZN4base8internal13delete_hooks_E>
0x00007ffff7a16f07 <+7>: test %rax,%rax
0x00007ffff7a16f0a <+10>: jne 0x7ffff7a16fa0 <tc_free(void*)+160>
0x00007ffff7a16f10 <+16>: mov 0x211e11(%rip),%rax # 0x7ffff7c28d28

嗯,正是tc_free,验证完毕

hook的函数列表

要hook的api,都在tcmalloc的一个文件里面

https://github.com/gperftools/gperftools/blob/gperftools-2.7/src/gperftools/tcmalloc.h.in#L87-L106

编写简单的代码全部hook一遍就ok了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include <gperftools/tcmalloc.h>

void* getGlibc() {
const char* library_path = "/lib/x86_64-linux-gnu/libc.so.6";
void* handle = dlopen(library_path, RTLD_LOCAL | RTLD_DEEPBIND | RTLD_NOW);
if (!handle) {
cerr << "Failed to open library: " << dlerror() << endl;
quick_exit(0);
}
return handle;
}

//RTLD_DEFAULT是tc_malloc的,RTLD_NEXT是glibc的
#define HOOK_FUNC(libc_func) \
do { \
void *f = dlsym(getGlibc(), #libc_func); \
assert(f); \
simple_hook(f, (void*)tc_##libc_func); \
if (!libc_func##_f) { \
cout << "hook " #libc_func " failed" << std::endl; \
std::quick_exit(0); \
} \
} while (0)

#define HOOK_FUNC_RENAME(libc_func, tcmalloc_func) \
do { \
void *f = dlsym(getGlibc(), #libc_func); \
assert(f); \
simple_hook(f, (void*)tcmalloc_func); \
if (!libc_func##_f) { \
cout << "hook " #libc_func " failed" << std::endl; \
std::quick_exit(0); \
} \
} while (0)

void hookAll() {
HOOK_FUNC(malloc);
HOOK_FUNC(free);
HOOK_FUNC(realloc);
HOOK_FUNC(calloc);
HOOK_FUNC(cfree);
HOOK_FUNC(memalign);
HOOK_FUNC(posix_memalign);
HOOK_FUNC(valloc);
HOOK_FUNC(pvalloc);
HOOK_FUNC(malloc_stats);
HOOK_FUNC(mallopt);
HOOK_FUNC_RENAME(malloc_usable_size, tc_malloc_size);
}

内存统计不全问题

但是我发现高兴的还是太早了,在某些场景下,发现有些内存没有采集到

简化成了这个case:

首先,在动态库执行mmap申请一部分内存

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_miss_mmap_hook/dynamic_lib.cpp

1
2
3
4
5
6
7
#include <cstdlib>
#include <sys/mman.h>

extern "C" void* test() {
return mmap(NULL, 4 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
}

随后在主程序dlopen加载并使用,随后dump出内存消耗

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <fstream>
#include <dlfcn.h>
#include <assert.h>

using namespace std;

#include <gperftools/heap-profiler.h>
int main() {
void *handle =
dlopen("./libdynamic.so", RTLD_NOW | RTLD_LOCAL);
if (!handle) {
return 1;
}
using testFuncT = void*(*)();
testFuncT testFunc = (testFuncT)dlsym(handle, "test");
if (!testFunc) {
return 1;
}

HeapProfilerStart("");

for (int i = 0;i < 10;i++) {
testFunc();
}
//用MallocExtension::instance()->GetHeapSample也会导致采集不到mmap
string s = GetHeapProfile();
HeapProfilerStop();

fstream f;
f.open("./allbin.hprof", ios_base::out);
f << s;
f.close();

return 0;
}

根据文档https://gperftools.github.io/gperftools/heapprofile.html

HEAP_PROFILE_MMAP default: false Profile mmap, mremap and sbrk calls in addition to malloc, calloc, realloc, and new. NOTE: this causes the profiler to profile calls internal to tcmalloc, since tcmalloc and friends use mmap and sbrk internally for allocations. One partial solution is to filter these allocations out when running pprof, with something like pprof --ignore='DoAllocWithArena|SbrkSysAllocator::Alloc|MmapSysAllocator::Alloc.
配置分析会除 malloc、calloc、realloc 和 new 之外,还对 mmap、mremap 和 sbrk 调用进行剖析。注意:这会导致分析器把 tcmalloc 的内部调用也一并剖析,因为 tcmalloc 等库在内部使用 mmap 和 sbrk 来进行内存分配。一个部分可行的解决办法是在运行 pprof 时把这些分配过滤掉

设置环境变量HEAP_PROFILE_MMAP,手动打开mmap采集来运行

1
~ HEAP_PROFILE_MMAP=1 ./main

把吐出的allbin.hprof用pprof解析一下(ignore选项是HEAP_PROFILE_MMAP文档中提到的)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
~ pprof --ignore='DoAllocWithArena|SbrkSysAllocator::Alloc|MmapSysAllocator::Alloc' --text --lines ./main allbin.hprof
Using local file ./main.
Using local file allbin.hprof.
Total: 44.6 MB
40.0 97.6% 97.6% 40.0 97.6% test /root/tcmalloc_hook_debug/tcmalloc_miss_mmap_hook/dynamic_lib.cpp:7
1.0 2.4% 100.0% 1.0 2.4% base::subtle::NoBarrier_CompareAndSwap (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/atomicops-internals-x86.h:81
0.0 0.0% 100.0% 1.0 2.4% GetHeapProfile /root/tcmalloc_hook_debug/gperftools/build/../src/heap-profiler.cc:213
0.0 0.0% 100.0% 1.0 2.4% SpinLock::Lock (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/spinlock.h:69
0.0 0.0% 100.0% 1.0 2.4% SpinLockHolder::SpinLockHolder (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/spinlock.h:133
0.0 0.0% 100.0% 41.0 100.0% __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
0.0 0.0% 100.0% 41.0 100.0% _start ??:0
0.0 0.0% 100.0% 1.0 2.4% base::subtle::Acquire_CompareAndSwap (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/atomicops-internals-x86.h:109
0.0 0.0% 100.0% 40.0 97.6% main /root/tcmalloc_hook_debug/tcmalloc_miss_mmap_hook/main.cpp:24
0.0 0.0% 100.0% 1.0 2.4% main /root/tcmalloc_hook_debug/tcmalloc_miss_mmap_hook/main.cpp:27

此时解析出来dynamic_lib.cpp里面用mmap申请了40MB的内存,这是符合预期的

那么把dlopen("./libdynamic.so", RTLD_NOW | RTLD_LOCAL);

增加RTLD_DEEPBIND,改成dlopen("./libdynamic.so", RTLD_NOW | RTLD_DEEPBIND| RTLD_LOCAL);再试试看?

1
2
3
4
5
6
7
8
9
10
11
12
13
~ pprof --ignore='DoAllocWithArena|SbrkSysAllocator::Alloc|MmapSysAllocator::Alloc' --text --l
ines ./main allbin.hprof
Using local file ./main.
Using local file allbin.hprof.
Total: 4.6 MB
1.0 100.0% 100.0% 1.0 100.0% base::subtle::NoBarrier_CompareAndSwap (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/atomicops-internals-x86.h:81
0.0 0.0% 100.0% 1.0 100.0% GetHeapProfile /root/tcmalloc_hook_debug/gperftools/build/../src/heap-profiler.cc:213
0.0 0.0% 100.0% 1.0 100.0% SpinLock::Lock (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/spinlock.h:69
0.0 0.0% 100.0% 1.0 100.0% SpinLockHolder::SpinLockHolder (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/spinlock.h:133
0.0 0.0% 100.0% 1.0 100.0% __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
0.0 0.0% 100.0% 1.0 100.0% _start ??:0
0.0 0.0% 100.0% 1.0 100.0% base::subtle::Acquire_CompareAndSwap (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/atomicops-internals-x86.h:109
0.0 0.0% 100.0% 1.0 100.0% main /root/tcmalloc_hook_debug/tcmalloc_miss_mmap_hook/main.cpp:27

复现了,dynamic_lib.cpp里面的mmap没有采集到

解决办法

simple_hook可能存在的问题

那还是需要hook,但是这里又不太一样了

上一小节里面hook的函数列表,tcmalloc完全实现了glibc的全部api,因此直接修改glibc的free跳转到到tcmalloc的free,glibc的free直接作废了

而mmap这个api,是tcmalloc用来申请内存的通道,要是直接作废了,tcmalloc也用不了了

解决思路是:写一段代码确认正常使用mmap和sbrk的时候用的什么符号,看tcmalloc本身会不会用

也就是https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_mmap_sbrk/main.cpp

1
2
3
4
5
6
7
8
#include <unistd.h>
#include <sys/mman.h>
int main() {
sbrk(10);
mmap(NULL, 4 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
return 0;
}

先看sbrk的使用符号

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
(gdb) b main
Breakpoint 1 at 0x40116a: file main.cpp, line 4.
(gdb) r
Starting program: /root/tcmalloc_hook_debug/tcmalloc_fix_mmap_hook/tmp/main

Breakpoint 1, main () at main.cpp:4
4 sbrk(10);
(gdb) n
6 MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
(gdb) n
7 return 0;
(gdb) info func sbrk\@
All functions matching regular expression "sbrk\@":

Non-debugging symbols:
0x0000000000401050 sbrk@plt
(gdb) disassemble 0x0000000000401050
Dump of assembler code for function sbrk@plt:
0x0000000000401050 <+0>: jmpq *0x2fba(%rip) # 0x404010
0x0000000000401056 <+6>: pushq $0x2
0x000000000040105b <+11>: jmpq 0x401020
End of assembler dump.
(gdb) x/gx 0x404010
0x404010: 0x00007ffff7b09e80
(gdb) disassemble 0x00007ffff7b09e80
Dump of assembler code for function __GI___sbrk:
0x00007ffff7b09e80 <+0>: push %r12
0x00007ffff7b09e82 <+2>: mov 0x2c703f(%rip),%r12 # 0x7ffff7dd0ec8
0x00007ffff7b09e89 <+9>: push %rbp

再看mmap的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(gdb) info func mmap\@
All functions matching regular expression "mmap\@":

Non-debugging symbols:
0x0000000000401030 mmap@plt
(gdb) disassemble 0x0000000000401030
Dump of assembler code for function mmap@plt:
0x0000000000401030 <+0>: jmpq *0x2fca(%rip) # 0x404000
0x0000000000401036 <+6>: pushq $0x0
0x000000000040103b <+11>: jmpq 0x401020
End of assembler dump.
(gdb) x/gx 0x404000
0x404000: 0x00007ffff7b0e680
(gdb) disassemble 0x00007ffff7b0e680
Dump of assembler code for function __mmap:
0x00007ffff7b0e680 <+0>: test %rdi,%rdi
0x00007ffff7b0e683 <+3>: push %r15
0x00007ffff7b0e685 <+5>: mov %r9,%r15

这两个符号都是glibc的

1
2
3
4
(gdb) info symbol __mmap
mmap64 in section .text of /lib/x86_64-linux-gnu/libc.so.6
(gdb) info symbol __GI___sbrk
sbrk in section .text of /lib/x86_64-linux-gnu/libc.so.6

然后看下代码,tcmalloc分别是如何实现sbrk和mmap的

tcmalloc的sbrk实现

https://github.com/gperftools/gperftools/blob/gperftools-2.7/src/malloc_hook_mmap_linux.h#L211-L218

1
2
3
4
5
6
7
8
9
// libc's version:
extern "C" void* __sbrk(intptr_t increment);

extern "C" void* sbrk(intptr_t increment) __THROW {
MallocHook::InvokePreSbrkHook(increment);
void *result = __sbrk(increment);
MallocHook::InvokeSbrkHook(result, increment);
return result;
}

嗯,直接调用的glibc的符号,那看来不能直接覆盖掉glibc的sbrk

tcmalloc的mmap实现

https://github.com/gperftools/gperftools/blob/gperftools-2.7/src/malloc_hook_mmap_linux.h#L173-L184

1
2
3
4
5
6
7
8
9
10
11
12
extern "C" void* mmap(void *start, size_t length, int prot, int flags,
int fd, off_t offset) __THROW {
MallocHook::InvokePreMmapHook(start, length, prot, flags, fd, offset);
void *result;
if (!MallocHook::InvokeMmapReplacement(
start, length, prot, flags, fd, offset, &result)) {
result = do_mmap64(start, length, prot, flags, fd,
static_cast<size_t>(offset)); // avoid sign extension
}
MallocHook::InvokeMmapHook(result, start, length, prot, flags, fd, offset);
return result;
}

接着看do_mmap64

https://github.com/gperftools/gperftools/blob/gperftools-2.7/src/malloc_hook_mmap_linux.h#L61C1-L65C2

1
2
3
4
5
static inline void* do_mmap64(void *start, size_t length,
int prot, int flags,
int fd, __off64_t offset) __THROW {
return sys_mmap(start, length, prot, flags, fd, offset);
}

阿,这里调用的就是linux的接口了

https://github.com/gperftools/gperftools/blob/gperftools-2.7/src/base/linux_syscall_support.h#L2800-L2805

1
2
3
4
5
6
7
8
if defined(__x86_64__)
/* Need to make sure __off64_t isn't truncated to 32-bits under x32. */
LSS_INLINE void* LSS_NAME(mmap)(void *s, size_t l, int p, int f, int d,
int64_t o) {
LSS_BODY(6, void*, mmap, LSS_SYSCALL_ARG(s), LSS_SYSCALL_ARG(l),
LSS_SYSCALL_ARG(p), LSS_SYSCALL_ARG(f),
LSS_SYSCALL_ARG(d), (uint64_t)(o));
}

所以没有用glibc的mmap封装,直接进系统调用了

小结

那么mmap依然可以使用simple_hook,而mmap就需要使用更强力的hook工具了

PFishHook

也就是https://github.com/Menooker/PFishHook

PFishHook copies a few bytes at the head of the target function to a new "shadown function". Then it replace the head of the target function with a jump to the function specified by the user. And it returns the address of the "shadown function" to users. PFishHook 将目标函数头部的几个字节复制到一个新的“shadown 函数”中。然后将目标函数的头部替换为跳转到用户指定的函数。最后将“shadown 函数”的地址返回给用户。

The "shadown function" has the same functionality of the original function. “shadown 函数”具有与原始函数相同的功能。

以hook free为例

把free函数hook到my_free,而这个my_free里面调用PFishHook创建的shadown 函数

1
2
3
4
5
6
7
8
9
10
HookStatus HookIt(void* oldfunc, void** poutold, void* newfunc);

using free_t = decltype(&name);
static free_t free_f = nullptr;
void my_free(void *ptr) {
free_f(ptr);
}
void main() {
HookIt(free_f, dlsym(getGlibc(), "free"), (void*)&my_free)
}

hook前

1
2
3
4
5
6
7
8
9
10
11
(gdb) x/10i __GI___libc_free
0x7ffff6dcf4f0 <__GI___libc_free>: push %r13
0x7ffff6dcf4f2 <__GI___libc_free+2>: push %r12
0x7ffff6dcf4f4 <__GI___libc_free+4>: push %rbp
0x7ffff6dcf4f5 <__GI___libc_free+5>: push %rbx
0x7ffff6dcf4f6 <__GI___libc_free+6>: sub $0x28,%rsp
0x7ffff6dcf4fa <__GI___libc_free+10>: mov 0x33f9f7(%rip),%rax # 0x7ffff710eef8
0x7ffff6dcf501 <__GI___libc_free+17>: mov (%rax),%rax
0x7ffff6dcf504 <__GI___libc_free+20>: test %rax,%rax
0x7ffff6dcf507 <__GI___libc_free+23>: jne 0x7ffff6dcf5e0 <__GI___libc_free+240>
0x7ffff6dcf50d <__GI___libc_free+29>: test %rdi,%rdi

hook后

1
2
3
4
5
6
7
8
9
10
11
(gdb) x/10i __GI___libc_free
0x7ffff6dcf4f0 <__GI___libc_free>: jmpq *0x0(%rip) # 0x7ffff6dcf4f6 <__GI___libc_free+6>
0x7ffff6dcf4f6 <__GI___libc_free+6>: xchg %eax,%esi
0x7ffff6dcf4f7 <__GI___libc_free+7>: push %rbp
0x7ffff6dcf4f8 <__GI___libc_free+8>: add %al,(%rax)
0x7ffff6dcf4fb <__GI___libc_free+11>: add %al,(%rax)
0x7ffff6dcf4fd <__GI___libc_free+13>: add %cl,%ah
0x7ffff6dcf4ff <__GI___libc_free+15>: int3
0x7ffff6dcf500 <__GI___libc_free+16>: int3
0x7ffff6dcf501 <__GI___libc_free+17>: mov (%rax),%rax
0x7ffff6dcf504 <__GI___libc_free+20>: test %rax,%rax

可以看到,也是用的0xff25跳转到了0x7ffff6dcf4f6里面指向的地址

这个地址里面的值是

1
2
(gdb) x/gx 0x7ffff6dcf4f6
0x7ffff6dcf4f6 <__GI___libc_free+6>: 0x0000000000405596

这个值指向的函数正是my_free

1
2
(gdb) x/x 0x405596
0x405596 <my_free(void*)>: 0x55

再看下my_free里面调用的shadown 函数

1
2
3
4
5
6
7
8
9
10
11
(gdb) x/10i free_f
0x7ffff654b018: push %r13
0x7ffff654b01a: push %r12
0x7ffff654b01c: push %rbp
0x7ffff654b01d: push %rbx
0x7ffff654b01e: sub $0x28,%rsp
0x7ffff654b022: mov 0xbc3ecf(%rip),%rax # 0x7ffff710eef8
0x7ffff654b029: jmpq *0x0(%rip) # 0x7ffff654b02f
0x7ffff654b02f: add %esi,%ebp
0x7ffff654b031: fdiv %st,%st(6)
0x7ffff654b033: (bad)

确实,和原来的free函数一毛一样

解决sbrk的hook

但是tcmalloc的sbrk代码写死了,没法改,只能把代码复制出来,然后让glibc的srbk指向他

graph LR
  new_glibc_sbrk["new_glibc_sbrk"]
  my_sbrk["my_sbrk(从tcmalloc_sbrk复制过来改掉对glibc的调用)"]
  old_glibc_sbrk["old_glibc_sbrk(PFishHook的shadown函数)"]

  new_glibc_sbrk --> my_sbrk
  my_sbrk --> old_glibc_sbrk

代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <gperftools/malloc_hook.h>

void MallocHook::InvokePreSbrkHook(ptrdiff_t increment) {
InvokePreSbrkHookSlow(increment);
}
void MallocHook::InvokeSbrkHook(const void* result, ptrdiff_t increment) {
InvokeSbrkHookSlow(result, increment);
}
extern "C" void *my_sbrk(intptr_t increment) __THROW {
MallocHook::InvokePreSbrkHook(increment);
void *result = sbrk_f(increment);
MallocHook::InvokeSbrkHook(result, increment);
return result;
}

验证

完整实例在这里

https://github.com/tedcy/tcmalloc_hook_debug/blob/master/tcmalloc_fix_mmap_hook/main.cpp

编译运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
~ ./main 
Starting tracking the heap
~ pprof --ignore='DoAllocWithArena|SbrkSysAllocator::Alloc|MmapSysAllocator::Alloc' --text --li
nes ./main allbin.hprof
Using local file ./main.
Using local file allbin.hprof.
Total: 166.8 MB
40.0 49.4% 49.4% 40.0 49.4% test /root/tcmalloc_hook_debug/tcmalloc_fix_mmap_hook/dynamic_lib.cpp:5
40.0 49.4% 98.8% 40.0 49.4% test /root/tcmalloc_hook_debug/tcmalloc_fix_mmap_hook/dynamic_lib.cpp:8
1.0 1.2% 100.0% 1.0 1.2% base::subtle::NoBarrier_CompareAndSwap (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/atomicops-internals-x86.h:81
0.0 0.0% 100.0% 1.0 1.2% GetHeapProfile /root/tcmalloc_hook_debug/gperftools/build/../src/heap-profiler.cc:213
0.0 0.0% 100.0% 1.0 1.2% SpinLock::Lock (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/spinlock.h:69
0.0 0.0% 100.0% 1.0 1.2% SpinLockHolder::SpinLockHolder (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/spinlock.h:133
0.0 0.0% 100.0% 81.0 100.0% __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
0.0 0.0% 100.0% 81.0 100.0% _start ??:0
0.0 0.0% 100.0% 1.0 1.2% base::subtle::Acquire_CompareAndSwap (inline) /root/tcmalloc_hook_debug/gperftools/build/../src/base/atomicops-internals-x86.h:109
0.0 0.0% 100.0% 80.0 98.8% main /root/tcmalloc_hook_debug/tcmalloc_fix_mmap_hook/main.cpp:128
0.0 0.0% 100.0% 1.0 1.2% main /root/tcmalloc_hook_debug/tcmalloc_fix_mmap_hook/main.cpp:133

可以看到dynamic_lib.cpp的mmap消耗也成功统计到了

总结

本文遇到了dlopen使用RTLD_DEEPBIND选项以后,tcmalloc的hook失效,导致coredump和统计不全的问题

因此需要帮tcmalloc实现hook:

  • 通过PFishHook解决了sbrk的问题

  • 用simple_hook解决了其他符号的hook问题

附录

GetHeapProfile和GetHeapSample

有两种方式都可以dump出pprof来

1
2
3
4
5
6
7
HeapProfilerStart("");

for (int i = 0;i < 10;i++) {
testFunc();
}
string s = GetHeapProfile();
HeapProfilerStop();

或者是

1
2
3
4
5
for (int i = 0;i < 10;i++) {
testFunc();
}
string s;
MallocExtension::instance()->GetHeapSample(&s)

两者有什么区别呢?

快速源码分析

简单的进行一个快速源码分析,根据gperftools-2.7,在https://github.com/gperftools/gperftools/blob/gperftools-2.7/src/tcmalloc.cc中

入口函数tc_malloc的堆栈调用大概如下:

  • tc_malloc->malloc_fast_path

    • 若已注册 new hooks(base::internal::new_hooks_ 非空 -> 转慢路径

    • 获取线程缓存 ThreadCache::GetFastPathCache() 失败(线程缓存未就绪)-> 转慢路径

    • sizemap 找不到 size 对应的 size class(说明是“大对象”:超过小对象上限)-> 转慢路径

    • cache->TryRecordAllocationFast(allocated_size) 返回 false -> 转慢路径

      TryRecordAllocationFast 内部会做极简的计数/预算更新(需要采样,或者需要从 central 获取就会返回false)

    • 上述检查全部成功,直接 cache->Allocate(allocated_size, cl, OOMHandler) 从线程缓存拿对象,完成分配,不会触发采样逻辑

  • 慢路径逻辑:

    dispatch_allocate_full(size)->allocate_full_malloc_oom(size)->do_allocate_full(size)

    • do_malloc(size)

      根据bool res = Static::sizemap()->GetSizeClass(size, &cl)

      从已有的sizemap取这个size

      • res = false:大对象逻辑

        do_malloc_pages(cache, size)

        • 如果heap->SampleAllocation(size)

          开始采样DoSampledAllocation(size)

          • GetStackTrace()获取堆栈
          • 堆栈和size记录到Static::stacktrace_allocator()->New()
        • 申请页面:Static::pageheap()->New(num_pages)

      • res = true:小对象逻辑

        • 如果cache->SampleAllocation(allocated_size))

          开始采样DoSampledAllocation(size)

          • GetStackTrace()获取堆栈
          • 堆栈和size记录到Static::stacktrace_allocator()->New()
        • 从线程局部变量缓存分配内存:cache->Allocate(allocated_size, cl, nop_oom_handler)

    • 如果失败

      • 进入OOMHandler(size)
    • 如果成功

      • 执行base::internal::new_hooks_

GetHeapSample

是否触发采样逻辑如下:

1
2
3
4
5
6
7
8
9
10
11
12
inline bool ThreadCache::SampleAllocation(size_t k) {
return !sampler_.RecordAllocation(k);
}
inline bool Sampler::RecordAllocation(size_t k) {
if (static_cast<size_t>(bytes_until_sample_) < k) {
bool result = RecordAllocationSlow(k);
return result;
} else {
bytes_until_sample_ -= k;
return true;
}
}

可以看到,维护了一个全局的bytes_until_sample_,由环境变量TCMALLOC_SAMPLE_PARAMETER控制,建议值为512KB

也就是每分配满512KB,就会进行一次采样,举个例子:

  • 每次分配1M的都会有采样(大于512KB)
  • A函数申请511次1KB,然后B函数申请1KB,此时B函数就会被采样

GetHeapProfile

使用GetHeapProfile之前,需要调用HeapProfilerStart

1
2
3
4
5
extern "C" void HeapProfilerStart(const char* prefix) {
...
MallocHook::AddNewHook(&NewHook);
MallocHook::AddDeleteHook(&DeleteHook);
}

此时base::internal::new_hooks_被填入hook函数

那么分配/释放路径会强制走“慢路径”

性能测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include <iostream>
#include <vector>
#include <chrono>
#include <gperftools/heap-profiler.h>

// 测试参数
constexpr int kAllocCount = 5000000; // 分配次数
constexpr int kSize = 64; // 每次分配大小(字节)

// 一轮分配 + 释放测试
double test_alloc_free() {
std::vector<void*> ptrs;
ptrs.reserve(kAllocCount);

auto start = std::chrono::high_resolution_clock::now();

for (int i = 0; i < kAllocCount; ++i) {
ptrs.push_back(::operator new(kSize));
}
for (int i = 0; i < kAllocCount; ++i) {
::operator delete(ptrs[i]);
}

auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
return diff.count();
}

int main() {
std::cout << "Running allocation benchmark with "
<< kAllocCount << " x " << kSize << " bytes...\n\n";

// baseline: 不启用 HeapProfiler
double t1 = test_alloc_free();
std::cout << "[No profiler] Time = " << t1 << " s" << std::endl;

// 启动 HeapProfiler
HeapProfilerStart("");

double t2 = test_alloc_free();
HeapProfilerStop();

std::cout << "[With HeapProfiler] Time = " << t2 << " s" << std::endl;
std::cout << "\nOverhead: " << (t2 / t1 - 1.0) * 100.0 << " % slower\n";

return 0;
}

在Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz的洋垃圾上,运行结果

1
2
3
4
5
6
7
8
~ HEAP_PROFILE_ALLOCATION_INTERVAL=0 HEAP_PROFILE_INUSE_INTERVAL=0 ./main
Running allocation benchmark with 5000000 x 64 bytes...

[No profiler] Time = 0.748285 s
Starting tracking the heap
[With HeapProfiler] Time = 201.31 s

Overhead: 26802.9 % slower

也就是差了2-3个数量级,HeapProfilerStart以后的每次内存分配耗时

201.31/5000000*1000*1000=40.262us

相当于4次ssd随机读写,400次内存访问

结论

HeapProfilerStart开启的是“基于MallocHook的逐条记录型堆剖析”,它不会改变tcmalloc自身的“采样分配”逻辑(ThreadCache::SampleAllocation 和 DoSampledAllocation)。两者互不替换、可以并存。

性能会下降的挺多,使用需要慎重