之前写过一篇c++ 分析 gperftools 总结
对普通的性能优化来说,gperftools已经足够了
但是如果要深入优化,还是需要借助linux内置的perf工具
这个工具的功能包括但不限于:
协助优化代码中的cpu热点
gpertools最多只能精确到某一行热点,perf是汇编级别的,因此可以协助优化生成的汇编代码
协助优化代码的分支预测命中率
协助优化代码的cpu高速缓存命中率
协助优化代码中对内核部分的使用
协助理解代码运行时在linux的调度情况
安装
1 apt-get install -y linux-tools-$(uname -r) linux-tools-generic
perf本体的安装比较简单,但是要充分利用perf,深入到glibc甚至内核的话,需要安装各种提供符号的调试软件包
ubuntu下debug包安装
根据官方文档
1 2 3 4 5 6 7 ~# echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | \ tee -a /etc/apt/sources.list.d/ddebs.list ~# apt install -y ubuntu-dbgsym-keyring ~# apt-key adv --keyserver keyserver.ubuntu.com --recv-keys F2EDC64DC5AEE1F6B9C621F0C8CAB6595FDFF622 ~# apt-get update
libstd++和libc++的debug包安装
根据how-to-install-libstdc6-debug-symbols-on-ubuntu-20-04
先查看自己的libstdc++版本
1 2 ~# dpkg --list | grep libstdc++ libstdc++6:amd64 10.3.0-1ubuntu1~20.04 amd64 GNU Standard C++ Library v3
是libstdc++6
,所以
1 apt-get install -y libstdc++6-dbgsym
对libc++同理
1 2 ~# dpkg --list | grep libc++ libc++1-10:amd64 1:10.0.0-4ubuntu1 amd64 LLVM C++ Standard library
是libc++1-10
,所以
1 apt-get install -y libc++1-10-dbgsym
现在gdb就可以深入到string内部了
内核debug包和源码包安装
内核debug包安装很简单,利用uname确定linux的版本即可
1 2 3 ~# uname -r 5.4.0-48-generic ~# apt-get install -y linux-image-$(uname -r)-dbgsym
但是源码包需要debug包的Version信息才能安装,因此需要自己选定版本号,单行命令安装如下
1 apt-get install -y linux-source-$(uname -r | cut -f1 -d-)=$(apt-cache show linux-image-$(uname -r)-dbgsym|grep Version|cut -f2 -d' ')
命令拆分如下
1 2 3 4 5 6 7 8 ~# apt-cache show linux-image-$(uname -r)-dbgsym ... Version: 5.4.0-48.52 ... ~# apt list --all-versions linux-source-5.4.0|grep 5.4.0-48.52 linux-source-5.4.0/focal-updates,focal-security,now 5.4.0-48.52 all ~# apt install -y linux-source-5.4.0=5.4.0-48.52 ~# cd /usr/src/linux-source-5.4.0 && tar xvf linux-source-5.4.0.tar.bz2
使用
perf的用途非常广,使用list子命令可以展示所有支持的event
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ~# perf list List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-stores [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-store-misses [Hardware cache event]
这里只展示了一部分,下图展示了perf的event对应linux的各部分
img
命令行使用
简单介绍一下,例如对ls指令进行perf,详细用法可以看brendangregg
stat子命令可以查看一个粗略情况
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ~# perf stat ls Performance counter stats for 'ls': 0.80 msec task-clock # 0.215 CPUs utilized 2 context-switches # 0.003 M/sec 0 cpu-migrations # 0.000 K/sec 91 page-faults # 0.114 M/sec 2056567 cycles # 2.574 GHz 1552259 instructions # 0.75 insn per cycle 315784 branches # 395.212 M/sec 14554 branch-misses # 4.61% of all branches 0.003712417 seconds time elapsed 0.001071000 seconds user 0.000000000 seconds sys
可以看到上下文切换,运行了多少指令,分支预测的命中率等等
record子命令可以录入更详细的一些情况
1 2 3 ~# perf record -g -F 99 ls [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.012 MB perf.data (5 samples) ]
这里-g
是为了生成堆栈信息,-F
是采样频率,采样频率越高对程序本身的分析影响越大
此时生成了一个perf.data文件,可以通过report,annotate,script命令查看,这里以perf report举例
1 2 3 4 5 6 7 8 9 10 ~# perf report --stdio # Samples: 5 of event 'cycles' # Event count (approx.): 20796061 # # Overhead Command Shared Object Symbol # ........ ....... ................ ............................ # 99.78% ls [aufs] [k] 0x0000000000014975 0.22% perf [kernel.vmlinux] [k] __x86_indirect_thunk_rax 0.00% perf [kernel.vmlinux] [k] native_write_msr
封装支持单个函数perf
perf本身作为一个命令行工具,是只能针对整个程序来分析的
这样一些准备工作也会被误统计在内,并且也不利于多个函数对比性能差异
因此我对perf进行了简单的封装来便于使用
思路来自perf-stat-for-part-of-program
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 namespace profile{#include <unistd.h> #include <stdio.h> #include <signal.h> inline void perfFork (const std::function<void ()> &cb, const std::function<string(int )> &pid2Str) { static atomic<int > l = {0 }; assert (l++ == 0 ); int pid = getpid (); int cpid = fork(); string fileName; if (cpid == 0 ) { string command = pid2Str (pid); execl ("/bin/sh" , "sh" , "-c" , command.c_str (), nullptr ); }else { setpgid (cpid, 0 ); sleep (1 ); cb (); kill (-cpid, SIGINT); sleep (1 ); } l--; } void perfStat (const string &filePrefix, const std::function<void ()>& cb) { cout << "perf stat " << filePrefix << " start" << endl; string filePath = "/tmp/" + filePrefix + "_stat.log" ; perfFork (cb, [&filePath](int pid) { stringstream ss; ss << "echo 0 > /proc/sys/kernel/nmi_watchdog && " "perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles," "instructions,branches,branch-misses,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L ss << " -p " << pid << " > " << filePath << " 2 >&1 "; return ss.str(); }); string result; fstream f(filePath); if (!f.is_open()) { cout << LOGV(filePath) << " open failed" << endl; return; } copy(std::istreambuf_iterator<char>(f), std::istreambuf_iterator<char>(), std::ostreambuf_iterator<char>(cout)); perfFork(cb, [&filePath](int pid) { stringstream ss; ss << " echo 1 > /proc/sys/kernel/nmi_watchdog"; return ss.str(); }); cout << " perf stat " << filePrefix << " end" << endl; } void perfRecord(const string &filePrefix, const std::function<void()>& cb, const string& event) { cout << " perf record " << filePrefix << LOGV(event) << " start" << endl; string filePath = " /tmp/" + filePrefix + " _record.perf.data"; perfFork(cb, [&filePath, &event](int pid) { stringstream ss; if (event.empty()) { ss << " perf record -F 99 -g -e task-clock,context-switches,cpu-migrations,page-faults,cycles," " instructions,branches,branch-misses,cache-references,cache-misses"; }else { ss << " perf record -F 99 -g -e " << event; } ss << " -p " << pid << " -o " << filePath << " > /dev/null 2 >&1 "; return ss.str(); }); cout << " perf stat " << filePrefix << LOGV(event) << " end" << endl; cout << " Run any of this following" << endl; cout << " perf report -i " << filePath << endl; cout << " perf annotate -i " << filePath << endl; cout << " If need flamegraph, specify event when call perfRecord" << endl; cout << " perf script -i " << filePath << " |/root/FlameGraph/stackcollapse-perf.pl|/root/FlameGraph/flamegraph.pl > " << filePath << " .svg" << endl; } }//namespace profile end
下面就以cpu高速缓存命中率的优化来看perf的用法
例子:优化cpu高速缓存命中率
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 #include <cstring> const int64_t size = 10000 ;int64_t vs[size][size];int main () { for (auto i = 0 ; i < size;i++) { for (auto j = 0 ;j < size;j++) { vs[i][j] = 0 ; } } auto innerFirstF = []() { int64_t sum = 0 ; for (auto i = 0 ;i < size;i++) { for (auto j = 0 ;j < size;j++) { sum += vs[i][j]; } } }; auto outerFirstF = []() { int64_t sum = 0 ; for (auto j = 0 ;j < size;j++) { for (auto i = 0 ;i < size;i++) { sum += vs[i][j]; } } }; profile::perfStat ("outerFirst_test" , outerFirstF); profile::perfStat ("innerFirst_test" , innerFirstF); }
innerFirst对二维数组按行访问,outerFirstF对二维数组按列访问
这两个lambda哪个执行的更快呢?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 perf stat outerFirst_test start Performance counter stats for process id '55547': 1907.26 msec task-clock # 0.668 CPUs utilized 4 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 1 page-faults # 0.001 K/sec 4935944518 cycles # 2.588 GHz (39.60%) 1484281008 instructions # 0.30 insn per cycle (49.67%) 216467939 branches # 113.497 M/sec (59.73%) 25665 branch-misses # 0.01% of all branches (59.78%) 172501006 cache-references # 90.445 M/sec (59.99%) 13511580 cache-misses # 7.833 % of all cache refs (60.20%) 614233484 L1-dcache-loads # 322.051 M/sec (39.43%) 202013498 L1-dcache-load-misses # 32.89% of all L1-dcache hits (20.09%) 172090035 LLC-loads # 90.229 M/sec (19.88%) 13728104 LLC-load-misses # 7.98% of all LL-cache hits (29.74%) 2.853302550 seconds time elapsed perf stat outerFirst_test end perf stat innerFirst_test start Performance counter stats for process id '55547': 306.90 msec task-clock # 0.246 CPUs utilized 3 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 1 page-faults # 0.003 K/sec 794022115 cycles # 2.587 GHz (37.57%) 1403810285 instructions # 1.77 insn per cycle (47.99%) 202822088 branches # 660.873 M/sec (58.41%) 13848 branch-misses # 0.01% of all branches (59.72%) 836003 cache-references # 2.724 M/sec (61.02%) 462287 cache-misses # 55.297 % of all cache refs (62.32%) 612783117 L1-dcache-loads # 1996.686 M/sec (36.37%) 12809356 L1-dcache-load-misses # 2.09% of all L1-dcache hits (19.43%) 524248 LLC-loads # 1.708 M/sec (18.25%) 883747 LLC-load-misses # 168.57% of all LL-cache hits (27.37%) 1.245627548 seconds time elapsed perf stat innerFirst_test end
可以看到按列访问执行时间是按行访问的两倍
按列访问的L1缓存的miss率是32.89%,而按行访问的L1缓存的miss率只有2.09%
这是因为高速缓存一次性会缓存多个数据,按行访问的局部性更好
但是按行访问的cache-misses达到了55.297 %,远高于按列访问的7.833 %
根据查询到的资料Linux Perf 简单试用
如果存取在L1-dcache-stores命中,则不会记入cache-references,因此当程序数据局部性提升后,在L1层就已经能取到了,也就导致cache-references急剧减少,相对也就导致cache-misses/cache-references
比例反而上升了
how-does-linux-perf-calculate-the-cache-references-and-cache-misses-events
If a cacheable memory access missed in the L1 and the L2, then it will be counted by cache-references
. Otherwise, if it hits in the L1, then, no, it will not be counted by cache-references
可以看到,我的测试结果cache-references和cache-misses基本和LLC-loads还有LLC-load-misses能对应上
LLC全称Last Level Cache,也就是3级缓存,cache-references主要统计的是这个,所以是不全的
因此对代码的存储结构局部性优化的时候,不仅要关注cache-references和cache-misses,也需要关注L1-dcache-loads和L1-dcache-load-misses指标
不过有个很奇怪的地方是perf看不到Middle Level Cache也就是2级缓存,通过lscpu可以看到我电脑上是有的
1 2 3 4 5 ~# lscpu L1d cache: 512 KiB L1i cache: 512 KiB L2 cache: 4 MiB L3 cache: 40 MiB
例子:内核执行时变量分析
对内核的tcp_sendmsg的参数和返回值进行分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ~# perf probe -V tcp_sendmsg //列出可用变量 @<tcp_sendmsg+0> int ret size_t size struct msghdr* msg struct sock* sk ~# perf probe --add 'tcp_sendmsg size' //对size进行分析 Added new event: probe:tcp_sendmsg (on tcp_sendmsg with size) ~# perf record -e probe:tcp_sendmsg -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.068 MB perf.data (102 samples) ] ~# perf script sshd 51381 [031] 2882680.117201: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x34 sshd 51381 [031] 2882680.216572: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x34 ping 0 [024] 2882680.274319: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x18e sshd 51381 [031] 2882680.315973: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x34 sshd 51381 [031] 2882680.415371: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x34 sshd 51381 [031] 2882680.514757: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x34 ping 0 [008] 2882680.537833: probe:tcp_sendmsg: (ffffffffaf7c6cd0) size=0x2eb
甚至可以根据源码的行数,针对某个变量来分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ~# perf probe -L tcp_sendmsg --source /usr/src/linux-source-5.4.0/linux-source-5.4.0 <tcp_sendmsg@/usr/src/linux-source-5.4.0/linux-source-5.4.0//net/ipv4/tcp.c:0> 0 int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) 1 { 2 int ret; 4 lock_sock(sk); 5 ret = tcp_sendmsg_locked(sk, msg, size); 6 release_sock(sk); 8 return ret; } EXPORT_SYMBOL(tcp_sendmsg); ~# perf probe -V tcp_sendmsg:5 Available variables at tcp_sendmsg:5 @<tcp_sendmsg+31> int ret size_t size struct msghdr* msg struct sock* sk ~# perf probe --add 'tcp_sendmsg:5 ret' Added new event: probe:tcp_sendmsg (on tcp_sendmsg:5 with ret) ~# perf record -e probe:tcp_sendmsg -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.065 MB perf.data (88 samples) ] ~# perf script sshd 51381 [015] 2883164.805408: probe:tcp_sendmsg: (ffffffffaf7c6cef) ret=-905020008 sshd 51381 [015] 2883164.904783: probe:tcp_sendmsg: (ffffffffaf7c6cef) ret=-905020008 python3 55841 [028] 2883164.959670: probe:tcp_sendmsg: (ffffffffaf7c6cef) ret=-913736176 python3 55841 [028] 2883164.959734: probe:tcp_sendmsg: (ffffffffaf7c6cef) ret=-913736176
例子:调度器分析
检查进程在调度器的切换次数,调度时延等等
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~# perf sched record -- sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.208 MB perf.data (19037 samples) ] ~# perf sched latency ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- :33539:33539 | 0.101 ms | 10 | avg: 0.010 ms | max: 0.031 ms | max at: 2883308.367008 s :140:140 | 0.012 ms | 1 | avg: 0.010 ms | max: 0.010 ms | max at: 2883307.615766 s :30434:30434 | 0.016 ms | 1 | avg: 0.010 ms | max: 0.010 ms | max at: 2883308.366990 s :33562:33562 | 0.105 ms | 10 | avg: 0.008 ms | max: 0.014 ms | max at: 2883307.666567 s :33556:33556 | 0.129 ms | 10 | avg: 0.008 ms | max: 0.025 ms | max at: 2883307.866712 s
总结
本篇简单介绍了perf的使用,有一些高级用法仅做简单了解,以防不时之需
参考资料
关于缓存原理
在优化缓存利用率的时候对缓存原理做了一些简单了解,参考资料记录在下面