RTO

RTO对tcp超时的影响

原创内容,转载请注明出处

Posted by Weakyon Blog on July 30, 2015

一 RFC标准

二 对linux的实际实现进行抓包分析

三 linux对超时的判断

四 linux的对RTO计算的实际实现

五 人为介入修改RTO的方法

六 总结

经常能看到线上的服务器报出超时,除了自己代码的问题,因为网络环境导致超时的原因是各种各样的:

硬件上:各层交换机微突发;芯片转发;链路,网卡阻塞等等

软件上:在负载较高的情况下,KVM之类的虚拟化性能更不上(同一宿主机的两台KVM虚拟机压测时会报出大量超时);

网络导致的超时,简单的可以用ping -f来观察,但是有些细微的超时,需要借助更加细微的工具来发现

我遇到的网络问题,大部分情况下都能使用下面的代码测试验证

tcp connect测试

这段代码简单的使用客户端向服务端发起连接,得到一个连接建立所要使用的时间。

而我在测试的时候,发现大部分异常连接时长都在1s左右

通过抓包分析,可以看到这部分的包被重传了,重传的时间固定为1秒。

那么这个时间是否能缩短呢?缩短后是否能提升网络的质量?

另外在客户端网络较差的移动网络下,能否增长重传时间来减少服务器的重传压力?

本文主要讨论的就是这些内容(代码分析基于centos的2.6.32-358)


RFC标准

抱着这个问题,首先回顾了《TCP/IP详解卷1》,在178页 18.3连接建立的超时中看到,第一次重传时间是6秒,第二次是24秒,这显然和我实际看到的不一样。

另外在书中第21章《TCP的超时与重传》中可以看到关于RTO和RTT的概念,

RTT = Round Trip Time

指连接的往返时间,由三部分组成:链路的传播时间、末端系统的处理时间、路由器缓存中的排队和处理时间,反应了网络当前的状况

RTO = Retransmission TimeOut

指连接的超时重传时间,根据RTT不断的进行调整,防止重传时间太短导致发出太多包,防止重传时间太长使得应用层反应缓慢

稍微看了下感觉其中的计算可能过时了(后来发现确实是如此)。

不得已,只能直接看内核实现了

在内核代码的注释中能看到提到了RFC标准文档

  
#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC2988bis initial RTO value */
#define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now
                         * used as a fallback RTO for the
                         * initial data transmission if no
                         * valid RTT sample has been acquired,
                         * most likely due to retrans in 3WHS.
                         */

去RFC查了下,重传超时相关最新的是RFC6298,他更新了RFC1122并且废弃了RFC2988,事实上代码注释提到的RFC2988bis和RFC6298基本一致,是RFC2988的升级版

稍微介绍一下其中内容,有兴趣的可以点进去看

RFC6298

1 重申了RTO的基本计算方法:

首先有个通过时钟得到的时间参数G
初始化:
RTO = 1

第一次计算:
SRTT = R
RTTVAR = R/2
RTO = SRTT + max(G,4*RTTVAR)

以后的计算:
RTTVAL = 0.75*RTTVAL + 0.25*|SRTT - R'|
SRTT = 0.875*SRTT + 0.125*R'
RTO = SRTT + max(G,4*RTTVAR)

RTO的最小值建议是1秒,最大值必须大于60秒

2 对于同一个包的多次重传,必须使用Karn算法,也就是刚才看到的双倍增长

另外RTT采样不能使用重传的包,除非开启了timestamps参数(利用该参数可以准确计算出RTT)

3 当4*rttvar趋向于0时,得到的值必须向G时间靠近

RTO = SRTT + max(G,4*RTTVAR)

经验上时钟越准确越好,最好误差在100ms内

4 RTO计时器的管理

(1)发送数据(包括重传时),检查计时器是否启动,若没有则启动。当收到该数据的ACK时删除计时器

(2)使用RTO = RTO * 2的方式进行退避

(3)新的FALLBACK特性:当计时器在等待SYN报文时过期,且当前TCP实现使用了小于3秒的RTO,那么该连接对的RTO必须被重设为3秒,重设的RTO将用在正式数据的传输上(就是三次握手结束以后)

在RFC6298的文末附带了把初始值从3秒调到1秒的解释

1 当初制定初始值3秒的时候网络还很慢(定义于1989年发表的RFC1122中)

2 统计显示97.5%的连接的RTT都是小于1秒的,因此这1秒是根据实践得出的

3 另外,研究显示三次握手中的重传率只有2%,因此减少RTO的初始值是有益的

但是依然存在2.5%的情况下,RTT时长大于1秒,如果此时RTO的初始值是1秒,就会引发一次重传。

此时引入一个新的FALLBACK特性,把数据传输阶段的RTO初始值重设为3秒,因此,重传的影响被降低:

只有一个额外的SYN被重传进网络,且根据RFC5681,一旦在三次握手阶段发生重传,初始的拥塞窗口被限制为1个报文段。这种做法使得建立的连接不会进入异常的状态

除此之外,假如使用了timestamps特性,即便在虚假重传(某端故意延缓报文从而进行攻击)时也能得到正确的RTT时间,使得RTT能够收敛到正确的值上


对linux的实际实现进行抓包分析

linux的实际实现和RFC文档大致相当

唯一要注意的是,他把RTO的最小值设为200ms(RFC建议1秒),最大值设置为120秒(RFC强制60秒以上)

设置如此小的RTO是很激进的做法,没有看到linux使用这个值的依据,可能linux大多用于服务器环境,认为网络情况更好吧

我简单的使用iptables来做验证(设置filter全为drop)

*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT DROP [0:0]

三次握手的syn包发送

01:00:00.129688 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:00:01.129065 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:00:03.129063 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:00:07.129074 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:00:15.129072 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:00:31.129128 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0

从1秒起双倍递增

值得注意是实质上第五次超时以后等到第六次,才会通知上层连接超时,那一共是63秒

三次握手的syncak包发送

使用iptables对ack包进行过滤
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT DROP [0:0]
-A OUTPUT -p tcp --tcp-flags SYN,ACK SYN,ACK -j ACCEPT
-A INPUT -p tcp --tcp-flags SYN SYN -j ACCEPT

01:17:20.084839 IP 172.16.3.15.2535 > 172.16.3.14.80: Flags [S], seq 1297135388, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:17:20.084908 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:17:21.284093 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:17:23.284088 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:17:27.284095 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:17:35.284097 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:17:51.284093 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0

从1秒起双倍递增

正常的数据包发送

01:32:20.443757 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:20.644600 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:21.046579 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:21.850632 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:23.458555 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:26.674594 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:33.106601 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:32:45.970567 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:33:11.698415 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:34:03.154300 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:35:46.065892 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:37:46.065382 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:39:46.064917 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:41:46.064466 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:43:46.064060 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11
01:45:46.063675 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11

从0.2秒起双倍递增,最大到120秒,一共15次

值得注意的是从32分开始,47分才结束,也就是15分钟25秒左右

linux是否支持了FALLBACK特性,做一个简单的测试

server开启iptables后,client连接server,在5次超时次数内关闭iptables
23:35:01.036565 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
23:35:02.036152 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
23:35:04.036126 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
23:35:08.036127 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
23:35:16.036131 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
23:35:16.036842 IP 172.16.10.40.12345 > 172.16.3.14.6071: Flags [S.], seq 3634006739, ack 2364912155, win 14600, options [mss 1460], length 0
23:35:16.036896 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [.], ack 3634006740, win 14600, length 0

接着server开启iptables后,client发送数据包,在15次超时次数内关闭iptables
23:35:48.129273 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 1
23:35:51.129120 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 1
23:35:57.129070 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 1
23:36:09.129068 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 1
23:36:09.129802 IP 172.16.10.40.12345 > 172.16.3.14.6071: Flags [.], ack 2364912156, win 14600, length 0

接着server不开iptables时,client发送数据包
23:36:15.217231 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912156:2364912157, ack 3634006740, win 14600, length 1
23:36:15.217766 IP 172.16.10.40.12345 > 172.16.3.14.6071: Flags [.], ack 2364912157, win 14600, length 0

接着server开启iptables,client发送数据包
23:36:26.658172 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:26.859055 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:27.261065 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:28.065106 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:29.673132 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:32.889068 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:39.321091 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:36:52.185135 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1
23:37:17.913091 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1

从这个测试中可以发现,当三次握手时RTT超过1秒时,数据发送阶段的RTO为3秒(服务端的SYNACK发生超时也是如此)

而后正常的一次RTT后,RTO重新收敛到200ms左右

再看看timestamps的支持如何

server开启iptables后,client连接server,在5次超时次数内关闭iptables
23:47:47.754316 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336007392 ecr 0,nop,wscale 7], length 0
23:47:48.754079 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336008392 ecr 0,nop,wscale 7], length 0
23:47:50.754088 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336010392 ecr 0,nop,wscale 7], length 0
23:47:54.754083 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336014392 ecr 0,nop,wscale 7], length 0
23:48:02.754094 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336022392 ecr 0,nop,wscale 7], length 0
23:48:02.754683 IP 172.16.10.40.12345 > 172.16.3.14.8603: Flags [S.], seq 697602971, ack 479022249, win 14480, options [mss 1460,nop,nop,TS val 4044659641 ecr 2336022392], length 0
23:48:02.754742 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [.], ack 697602972, win 14600, options [nop,nop,TS val 2336022392 ecr 4044659641], length 0

接着server开启iptables后,client发送数据包,在15次超时次数内关闭iptables
23:48:11.944170 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336031582 ecr 4044659641], length 1
23:48:12.145036 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336031783 ecr 4044659641], length 1
23:48:12.547084 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336032185 ecr 4044659641], length 1
23:48:13.351106 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336032989 ecr 4044659641], length 1
23:48:14.959080 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336034597 ecr 4044659641], length 1
23:48:18.175092 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336037813 ecr 4044659641], length 1
23:48:24.607088 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336044245 ecr 4044659641], length 1

可以看到开启了timestamps后,FALLBACK机制重设RTO为3秒将不会起作用

timestamps的原理就是A发送包给B时带了时间戳t1,B收到后返回ACK时会带上时间戳t1

A计算RTT时用now - t1就行了

这么做的好处是RTT更加精确,但是在一定的场合下是不适用的

如果开启了tcp_tw_recycle和tcp_timestamps后,linux下60s(timewait时间)内同一源ip主机的socket connect请求中的timestamp必须是递增的。

如果timestamp比前面数据包的timestamp小将会被丢弃(见内核tcp_v4_conn_request函数)

tw_recycle是为了使得TIMEWAIT状态被快速过期,过期时间和rto有关,逻辑在tcp_time_wait()里

[root@localhost.localdomain ~]# netstat -s|grep reject

netstat指令可以看到这些丢弃的包数目

如果开启了NAT网络,例如lvs等等,通过同一个NAT网关访问同一个服务,由于互相的timestamp不是一致的,所以互相干扰下经常会被丢弃

此时必须关闭其中一个选项

我认为把tw_recycle设为0,tcp_timestamps设为1更好

因为复用timewait理论上是不安全的,同一个端口对的两个连接,由于现在syn包的序号是随机的,不能判断syn序号来进行包丢弃

因此有可能出现老端口对的包影响新端口对,出现歧义包

TCP复用所产生的故障解析 — 一个非常实用的案例

这是支持自增syn包序号的A10系统的一个问题,和linux的随机syn包序号是不同的,不过正好看到这个感觉也挺有意思就放在这里


linux对超时的判断

该函数判断是否通过超时的总时间来放弃一个连接

  
/* This function calculates a "timeout" which is equivalent to the timeout of a
 * TCP connection after "boundary" unsucessful, exponentially backed-off
 * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
 * syn_set flag is set.
 */
static inline bool retransmits_timed_out(struct sock *sk, 
                     unsigned int boundary,
                     unsigned int timeout,
                     bool syn_set)
{
    unsigned int linear_backoff_thresh, start_ts;

    //如果是syn包,初试时间为TCP_TIMEOUT_INIT,否则为TCP_RTO_MIN
    unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;

    //如果没重传过,可以判断肯定不需要放弃,返回false
    if (!inet_csk(sk)->icsk_retransmits)
        return false;

    //获取第一个发包的时间
    if (unlikely(!tcp_sk(sk)->retrans_stamp))
        start_ts = TCP_SKB_CB(tcp_write_queue_head(sk))->when;
    else 
        start_ts = tcp_sk(sk)->retrans_stamp;

    //对两倍退避进行计算
    if (likely(timeout == 0)) {
        linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base);

        if (boundary <= linear_backoff_thresh)
            timeout = ((2 << boundary) - 1) * rto_base;
        else 
            timeout = ((2 << linear_backoff_thresh) - 1) * rto_base +
                  (boundary - linear_backoff_thresh) * TCP_RTO_MAX;
    }    
    return (tcp_time_stamp - start_ts) >= timeout;
}

这个函数的传入值之一的timeout是由用户设置的超时时间(TCP_USER_TIMEOUT选项)决定的,在三次握手期间这个值为0,另外当用户没有设置时也显然为0

在这种情况下,判断当前重传次数和log2(TCP_RTO_MAX/rto_base)的值,若重传次数小于该值,最大超时时间进行指数增长,否则进行线性增长

  
#define TCP_RTO_MAX ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))
#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC2988bis initial RTO value */
#define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now
                         * used as a fallback RTO for the
                         * initial data transmission if no
                         * valid RTT sample has been acquired,
                         * most likely due to retrans in 3WHS.
                         */

这些宏定义都是以HZ为单位的,linux下默认HZ是1000个时钟TICK,也就是1秒,可以在/boot/config-内核版本里看到

[root@localhost.localdomain ~]# grep CONFIG_HZ /boot/config-2.6.32-358.el6.x86_64 
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
[root@localhost.localdomain ~]# cat /proc/sys/net/ipv4/tcp_syn_retries 
5
[root@localhost.localdomain ~]# cat /proc/sys/net/ipv4/tcp_retries2
15

三次握手期间,在log2(120/1),也就是6次内,将进行指数增长,linux默认syn会重传5次(/proc/sys/net/ipv4/tcp_syn_retries),也就是都是全部指数增长

总时长是2^6*TCP_TIMEOUT_INIT - 1,也就是刚才观察到的63秒

数据传送期间,在log2(120/0.2),也就是9次内,将进行指数增长,linux默认重传15次(/proc/sys/net/ipv4/tcp_retries2),也就是10次到15次将进行线性增长

总时长(2^10 - 1)*TCP_RTO_MIN + (16-10)*TCP_RTO_MAX,也就是刚才观察到的15分25秒


linux的对RTO计算的实际实现

回顾一下RFC文档内容

我们称之为公式1 SRTT’ = 7/8SRTT + 1/8R’

我们称之为公式2 RTTVAL’ = 3/4RTTVAL + 1/4|SRTT - R’|

这一部分内容,我们来具体计算,是否能符合这两个公式:

任何一方发送包后得到ACK都会进入tcp_ack()

在该函数中调用tcp_clean_rtx_queue()来去除重传队列中已经被确认的报文段

如果是重传包的ACK,那么得到的seq_rtt = -1

否则,seq_rtt = now - scb->when,也就得到了一次RTT的时长

这个seq_rtt以tcp_ack_update_rtt(sk, flag, seq_rtt)的形式被传入来更新RTT时间

其中sk是struct sock对象,FLAG将标识这个包是SYN包还是数据包等信息

  
static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
                      const s32 seq_rtt)
{
    const struct tcp_sock *tp = tcp_sk(sk);
    /* Note that peer MAY send zero echo. In this case it is ignored. (rfc1323) */
    if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
        tcp_ack_saw_tstamp(sk, flag);
    else if (seq_rtt >= 0)
        tcp_ack_no_tstamp(sk, seq_rtt, flag);
}

在这里会进入一个分支,如果开启了timestamps选项,进入tcp_ack_saw_tstamp()

否则进入tcp_ack_no_tstamp()

分支一

  
/* Read draft-ietf-tcplw-high-performance before mucking
 * with this code. (Supersedes RFC1323)
 */
static void tcp_ack_saw_tstamp(struct sock *sk, int flag)
{
    /* RTTM Rule: A TSecr value received in a segment is used to
     * update the averaged RTT measurement only if the segment
     * acknowledges some new data, i.e., only if it advances the
     * left edge of the send window.
     *
     * See draft-ietf-tcplw-high-performance-00, section 3.3.
     * 1998/04/10 Andrey V. Savochkin <saw@msu.ru>
     *
     * Changed: reset backoff as soon as we see the first valid sample.
     * If we do not, we get strongly overestimated rto. With timestamps
     * samples are accepted even from very old segments: f.e., when rtt=1
     * increases to 8, we retransmit 5 times and after 8 seconds delayed
     * answer arrives rto becomes 120 seconds! If at least one of segments
     * in window is lost... Voila.              --ANK (010210)
     */
    struct tcp_sock *tp = tcp_sk(sk);

    tcp_valid_rtt_meas(sk, tcp_time_stamp - tp->rx_opt.rcv_tsecr);
}

注释的很明白,把参考的RFC文档也标识了,这里就是用timestamps选项去计算RTT = time_stamp - recv_tsecr,而后调用tcp_valid_rtt_meas()

这里即使是重传的包也能计算出RTT的值

时间戳选项的具体内容这里不讨论了

分支二

  
static void tcp_ack_no_tstamp(struct sock *sk, u32 seq_rtt, int flag)
{
    /* We don't have a timestamp. Can only use
     * packets that are not retransmitted to determine
     * rtt estimates. Also, we must not reset the
     * backoff for rto until we get a non-retransmitted
     * packet. This allows us to deal with a situation
     * where the network delay has increased suddenly.
     * I.e. Karn's algorithm. (SIGCOMM '87, p5.)
     */

    if (flag & FLAG_RETRANS_DATA_ACKED)
        return;

    tcp_valid_rtt_meas(sk, seq_rtt);
}

这里如果是重传包,他的seq_rtt设为-1,不能用来更新RTT的值

可以看到,这里依然调用了tcp_valid_rtt_meas()

很明显,计算RTO的步骤就在这里了

  
tcp_valid_rtt_meas(struct sock *sk, u32 seq_rtt)
{
    tcp_rtt_estimator(sk, seq_rtt);//通过RTT来更新一些中间变量
    tcp_set_rto(sk);//使用中间变量来计算RTO
    inet_csk(sk)->icsk_backoff = 0;//退避数清零
}

先看RTO的如何用中间变量来生成

  
static inline void tcp_set_rto(struct sock *sk)
{
    const struct tcp_sock *tp = tcp_sk(sk);
    inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
    tcp_bound_rto(sk);
}

static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
{
    return (tp->srtt >> 3) + tp->rttvar;
}

static inline void tcp_bound_rto(const struct sock *sk)
{
    if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
        inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
}

这里可以看到RTO = 1/8 * srtt + rttvar

在RFC文档中RTO = SRTT + 4 * RTTVAR

所以在实际实现上,srtt实际上是8倍的RFC文档变量SRTT,而rttvar实际上是4倍的RFC文档变量RTTVAR

下文我把大小写区分清楚,小写指的是linux实际实现,而大写则是RFC文档的定义的变量

这样进行实际实现是避免进行浮点数运算

然后再来看中间变量srtt和rttvar是如何生成的

  
/* Called to compute a smoothed rtt estimate. The data fed to this
 * routine either comes from timestamps, or from segments that were
 * known _not_ to have been retransmitted [see Karn/Partridge
 * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
 * piece by Van Jacobson.
 * NOTE: the next three routines used to be one big routine.
 * To save cycles in the RFC 1323 implementation it was better to break
 * it up into three procedures. -- erics
 */
static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
{
    struct tcp_sock *tp = tcp_sk(sk);
    long m = mrtt; /* RTT */
    //mrtt就是刚才得到的seq_rtt,新得到的RTT值

    /*  The following amusing code comes from Jacobson's
     *  article in SIGCOMM '88.  Note that rtt and mdev
     *  are scaled versions of rtt and mean deviation.
     *  This is designed to be as fast as possible
     *  m stands for "measurement".
     *
     *  On a 1990 paper the rto value is changed to:
     *  RTO = rtt + 4 * mdev
     *
     * Funny. This algorithm seems to be very broken.
     * These formulae increase RTO, when it should be decreased, increase
     * too slowly, when it should be increased quickly, decrease too quickly
     * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
     * does not matter how to _calculate_ it. Seems, it was trap
     * that VJ failed to avoid. 8)
     */
    if (m == 0)
        m = 1;
    if (tp->srtt != 0) {
        m -= (tp->srtt >> 3);   /* m is now error in rtt est */
        tp->srtt += m;      /* rtt = 7/8 rtt + 1/8 new */
        if (m < 0) {
            m = -m;     /* m is now abs(error) */
            m -= (tp->mdev >> 2);   /* similar update on mdev */
            /* This is similar to one of Eifel findings.
             * Eifel blocks mdev updates when rtt decreases.
             * This solution is a bit different: we use finer gain
             * for mdev in this case (alpha*beta).
             * Like Eifel it also prevents growth of rto,
             * but also it limits too fast rto decreases,
             * happening in pure Eifel.
             */
            if (m > 0)
                m >>= 3;
        } else {
            m -= (tp->mdev >> 2);   /* similar update on mdev */
        }
        tp->mdev += m;          /* mdev = 3/4 mdev + 1/4 new */
        if (tp->mdev > tp->mdev_max) {
            tp->mdev_max = tp->mdev;
            if (tp->mdev_max > tp->rttvar)
                tp->rttvar = tp->mdev_max;
        }
        if (after(tp->snd_una, tp->rtt_seq)) {
            if (tp->mdev_max < tp->rttvar)
                tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
            tp->rtt_seq = tp->snd_nxt;
            tp->mdev_max = tcp_rto_min(sk);
        }
    } else {
        /* no previous measure. */
        tp->srtt = m << 3;  /* take the measured time to be rtt */
        tp->mdev = m << 1;  /* make sure rto = 3*rtt */
        tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
        tp->rtt_seq = tp->snd_nxt;
    }
}

这个函数位于net/ipv4/tcp_input.c内,需要重点讨论一下

如果是第一次被调用时,在649行判断了tp->srtt还未生成,进入到680行的else逻辑

  
649     if (tp->srtt != 0) {
            ...
680     } else {
            ...
682         tp->srtt = m << 3;  /* take the measured time to be rtt */
683         tp->mdev = m << 1;  /* make sure rto = 3*rtt */
            ...

此时

srtt = 8m,mdev = 2m

而后tcp_rtt_estimator再被调用时

  
649     if (tp->srtt != 0) {
650         m -= (tp->srtt >> 3);   /* m is now error in rtt est */
651         tp->srtt += m;      /* rtt = 7/8 rtt + 1/8 new */

此时

m’ = m - SRTT,8SRTT’ = 8SRTT + m - SRTT

SRTT’ = 7/8SRTT + 1/8m 符合公式1

公式1证明完毕

往下看

  
652         if (m < 0) {
653             m = -m;     /* m is now abs(error) */
654             m -= (tp->mdev >> 2);   /* similar update on mdev */
665         } else {
666             m -= (tp->mdev >> 2);   /* similar update on mdev */
667         }
668         tp->mdev += m;          /* mdev = 3/4 mdev + 1/4 new */

mdev实际上和rttvar在一定情况下是相等的(不等情况见微调2)

所以mdev = rttvar = 4RTTVAR

4RTTVAR’ = 4RTTVAR + |m - SRTT| - RTTVAR

4*RTTVAR’ = 3*RTTVAR + |m - SRTT|

RTTVAR’ = 3/4RTTVAR + 1/4|SRTT - m| 符合公式2

公式2证明完毕

可以看到,骨干算法还是一样的,只是有一些微调:

微调1

根据这一段代码

  
663             if (m > 0)
664                 m >>= 3;

在|m - SRTT| - RTTVAR > 0时,说明m的波动太大了,和SRTT历史记录比,相差大的比RTTVAR还大,于是把(|m - SRTT| - RTTVAR)除以8

此时4RTTVAR’ = 4RTTVAR + (|m - SRTT| - RTTVAR)/8

RTTVAR’ = 31/32RTTVAR + 1/32|m - SRTT|

可以看到,系数乘以了1/8,表示这部分采样的数据对RTTVAR的影响将减小,使得RTTVAR平滑,RTO也会平滑

微调2

rttvar不是直接等于mdev的,当mdev减少的时候,会对mdev做一次平滑处理,使得rto不会下降的太离谱

  
649     if (tp->srtt != 0) {
            ...
680     } else {
            ...
684         tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
            ...

mdev_max的初始值是是tp->mdev和tcp_rto_min()的较大值

往下看

  
669         if (tp->mdev > tp->mdev_max) {
670             tp->mdev_max = tp->mdev;
671             if (tp->mdev_max > tp->rttvar)
672                 tp->rttvar = tp->mdev_max;
673         }

如果得到的mdev值比mdev_max大,那么mdev_max设为mdev

这之后mdev_max就被限制了下界,这是为了保证mdev_max比RTO的最小值要大

随后rttvar被设为上一个rttvar和被限制下界的mdev的值

  
675             if (tp->mdev_max < tp->rttvar)
676                 tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
678             tp->mdev_max = tcp_rto_min(sk);

rttvar’ = rttvar - (rttvar - mdev_max) » 2 = 3/4 * rttvar + mdev_max/4

这是一个1/4系数的平滑公式

如果mdev_max比rttvar小,说明rttvar在减少,此时将对mdev_max做一次平滑处理,使得rto一下不会下降的太快

这里为什么不对增大的情况做处理呢?因为RTO增大的话其实没事,但是如果减少量很大的话,可能会引起spurious retransmission(关于这个名词,详细见上文提到的RFC文档)

本段最后回顾一下RTO计算相关的函数调用链
             tcp_ack
                |
       tcp_clean_rtx_queue
                |
        tcp_ack_update_rtt
        |               |
tcp_ack_saw_tstamp tcp_ack_no_tstamp
        |               |
        tcp_valid_rtt_meas 
                |
        tcp_rtt_estimator --- tcp_set_rto
                                  |
                            __tcp_set_rto --- tcp_bound_rto

人为介入修改RTO的方法

回到最初的问题,是否能修改RTO的值

显然RTO初始值(包括FALLBACK)是不能改变的,这部分是固死写在代码里的

在数据发送阶段的RTO还是能够改变的

假设网络稳定,rtt始终不变为m

那么第一次调用时srtt为8m,mdev为2m,并且srtt将始终为8m,mdev将始终为2m(根据上一部分计算克制)

因此rto = 8m/8 + rttvar = m + rttvar

此时rto完全由rttvar来决定

在tcp_rtt_estimator中,mdev_max每次被初始化为tcp_rto_min()

  
static inline u32 tcp_rto_min (struct sock *sk)
{
    const struct dst_entry *dst = __sk_dst_get(sk);
    u32 rto_min = TCP_RTO_MIN;

    if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
        rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN));
    return rto_min;
}

这里去查询了路由缓存,如果这条链路中设置了RTO_MIN,那么可以替代TCP_RTO_MIN

把自己设置的值叫做ROUTE_MIN

根据上一章可以看到,rttvar由mdev和ROUTE_MIN决定

如果ROUTE_MIN大于mdev(值为2m)时

rttvar’永远被设置为ROUTE_MIN

此时rto = m + ROUTE_MIN

如果ROUTE_MIN小于mdev(值为2m)时

那么rttvar = mdev = 2m,rto = m + rttvar = 3m

以上是代码分析,下面是ROUTE_MIN的设置

[root@localhost.localdomain ~]# ping www.baidu.com
PING www.a.shifen.com (180.97.33.107) 56(84) bytes of data.
64 bytes from 180.97.33.107: icmp_seq=1 ttl=51 time=30.8 ms
64 bytes from 180.97.33.107: icmp_seq=2 ttl=51 time=29.9 ms

获得百度的IP后

[root@localhost.localdomain ~]# ip route add 180.97.33.108/32 via 172.16.3.1 rto_min 20
[root@localhost.localdomain ~]# nc www.baidu.com 80
[root@localhost.localdomain ~]# ss -eipn '( dport = :www )'
State       Recv-Q Send-Q                                                              Local Address:Port                                                                Peer Address:Port 
ESTAB       0      0                                                                     172.16.3.14:14149                                                              180.97.33.108:80     users:(("nc",7162,3)) ino:48057454 sk:ffff88023905adc0
         sack cubic wscale:7,7 rto:81 rtt:27/13.5 cwnd:10 send 4.3Mbps rcv_space:14600

由于ROUTE_MIN < 2rtt,所以rto = 3rtt = 27 * 3 = 81

如果是内网的话,rtt非常小

[root@localhost.localdomain ~]# ip route add 172.16.3.16/32 via 172.16.3.1 rto_min 20               
[root@localhost.localdomain ~]# nc 172.16.3.16 22
SSH-2.0-OpenSSH_5.3
[root@localhost.localdomain ~]# ss -eipn '( dport = :22 )'
State       Recv-Q Send-Q                                                              Local Address:Port                                                                Peer Address:Port 
ESTAB       0      0                                                                     172.16.3.14:57578                                                                172.16.3.16:22     users:(("nc",7272,3)) ino:48059707 sk:ffff88023b7c7000
         sack cubic wscale:7,7 rto:21 rtt:1/0.5 ato:40 cwnd:10 send 116.8Mbps rcv_space:14600

因为ROUTE_MIN > 2rtt,所以rto = rtt + ROUTE_MIN = 1 + 20 = 21

如果对内网的整个网络有自信的话,也可以不设置目标IP,直接对全部连接生效,如下

ip route change dev eth0 rto_min 20ms

总结

1 连接的SYN重传时间,在除非重新编译内核的情况下是无法调整的,但是push包是可以调整重传时间的

2 在比较稳定的网络中,假设rtt为m,假设设置的rto最小值为ROUTE_MIN

如果ROUTE_MIN大于2m时,每一次的rto = m + ROUTE_MIN

如果ROUTE_MIN小于2m时,那么rto = 3m

3 适当的修改ROUTE_MIN的值可以减轻服务器的压力,或者(在KVM出现IO性能问题时)优化网络质量

30 Jul 2015