Nov 25, 2010

Google and Microsoft Cheat on Slow-Start. Should You?


A Quest for Speed

I decided a couple of weeks ago that I wanted to build an app, most likely a web app. Being a premature optimizer by nature, my first order of business (after deciding I need to learn to draw) was to find the absolute fastest way to serve up a web page. The Google home page is the fastest-loading page I know of, so I thought a good place to start would be to figure out how they do it and then replicate their strategy.

The full story of my search is below, but the short version is that to match Google's page load times you have to cheat on the tcp slow-start algorithm. It appears that stretching the parameters a little bit is fairly common, but Google and Microsoft push it a lot further than most. This may well be common knowledge in web development circles, but it was news to me.

Some Sleuthing

My first step was to measure the load time of www.google.com over my home cable modem connection. As a first pass, I timed the download with curl:

$ time curl www.google.com > /dev/null  
  % Total  % Received % Xferd Average Speed  Time  Time   Time Current  
                  Dload Upload  Total  Spent  Left Speed  
 100 8885  0 8885  0   0  115k   0 --:--:-- --:--:-- --:--:-- 173k  
 real     0m0.085s

Holy smokes, that was fast! We were able to open a tcp connection, make an http request, receive an 8KB response, and close the connection all in 85ms! That's even faster than I expected, and demonstrates that it should be possible to build an app with a page load time below the threshold that humans perceive as instantaneous (about 150ms, according to one study). Sign me up.

Curious about how they pulled that off (did someone sneak into my house and install a GGC node in the attic?), I fired up tcpdump to take a closer look. What I saw surprised me:

$ tcpdump -ttttt host www.google.com

# 3-way handshake (RTT 16ms)
00:00:00.000000 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [S], seq 2726806947, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 949329348 ecr 0,sackOK,eol], length 0
00:00:00.016255 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [S.], seq 3505557820, ack 2726806948, win 5672, options [mss 1430,sackOK,TS val 688795316 ecr 949329348,nop,wscale 6], length 0
00:00:00.016376 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 1, win 65535, options [nop,nop,TS val 949329348 ecr 688795316], length 0

# client sends request and server acks
00:00:00.017437 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [P.], seq 1:180, ack 1, win 65535, options [nop,nop,TS val 949329348 ecr 688795316], length 179
00:00:00.037139 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], ack 180, win 106, options [nop,nop,TS val 688795338 ecr 949329348], length 0

# server sends 8 segments in the space of 3ms (interspersed with client acks)
00:00:00.067151 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 1:1419, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 1
00:00:00.069693 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 1419:2837, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 2
00:00:00.069814 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 2837, win 65405, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.069918 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 2837:4255, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 3
00:00:00.070374 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [P.], seq 4255:4711, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 456 # segment 4
00:00:00.070486 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 4711, win 65525, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.070796 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 4711:6129, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 5
00:00:00.070847 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 6129:7547, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 6
00:00:00.070853 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [P.], seq 7547:8109, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 562 # segment 7
00:00:00.070876 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 7547, win 65228, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.070900 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 8109, win 65512, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.070962 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [P.], seq 8109:9501, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1392 # segment 8
00:00:00.070990 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 9501, win 65408, options [nop,nop,TS val 949329349 ecr 688795368], length 0

# connection close (RTT 22 ms)
00:00:00.071300 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [F.], seq 180, ack 9501, win 65535, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.093299 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [F.], seq 9501, ack 181, win 106, options [nop,nop,TS val 688795393 ecr 949329349], length 0
00:00:00.093469 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 9502, win 65535, options [nop,nop,TS val 949329349 ecr 688795393], length 0

On the performance front, this is really exciting. They actually managed to deliver the whole response in just 70ms, 30ms of which was spent generating the response (come on Google, you can do better than 30ms). That means that a load time under 50ms should be possible.

How they accomplished that is what surprised me. The rate at which a server can send data over a new connection is limited by the tcp slow-start algorithm, which works as follows: The server maintains a congestion window which controls how many tcp segments it can send before receiving acks from the client. The server starts with a small initial window (IW), and then for each ack received from the client, increases the window size by one segment until it either reaches the client's receive window size or encounters congestion. This allows the server to discover the true bandwidth of the path in a way that's fair to other flows and minimizes congestion.

If you look at the trace, though, you'll notice that the server is actually sending the entire 8 segment response before there's time for the first client ack to reach it. This is a clear violation of RFC-3390, which defines the following algorithm for determining the max IW:

The upper bound for the initial window is given more precisely in
   (1): min (4*MSS, max (2*MSS, 4380 bytes))  

Note: Sending a 1500 byte packet indicates a maximum segment size
(MSS) of 1460 bytes (assuming no IP or TCP options).  Therefore,
limiting the initial window's MSS to 4380 bytes allows the sender to
transmit three segments initially in the common case when using 1500
byte packets.

www.google.com is indeed advertising an MSS of 1460, allowing it an IW of 3 segments according to the rfc. In our trace, they appear to be using an IW of at least 8, which allows them to shave off 2 round trips (~50ms) over an IW of 3 for this request. This raises the question: just how far will they go? Let's request a larger file and see what happens:

$ tcpdump -i en1 -ttttt host www.google.com

# 3-way handshake (RTT 22ms)
00:00:00.000000 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [S], seq 2589435808, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 949341091 ecr 0,sackOK,eol], length 0
00:00:00.022780 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [S.], seq 4085145017, ack 2589435809, win 5672, options [mss 1430,sackOK,TS val 990595778 ecr 949341091,nop,wscale 6], length 0
00:00:00.022913 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 1, win 65535, options [nop,nop,TS val 949341092 ecr 990595778], length 0

# client request and server ack
00:00:00.023699 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [P.], seq 1:193, ack 1, win 65535, options [nop,nop,TS val 949341092 ecr 990595778], length 192
00:00:00.048205 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], ack 193, win 106, options [nop,nop,TS val 990595802 ecr 949341092], length 0

# server sends 9 segments in 4ms (interspersed with client acks)
00:00:00.082766 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 1:1419, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.083077 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 1419:2837, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.083118 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 2837, win 65405, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.083284 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 2837:3966, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1129
00:00:00.083318 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 3966, win 65441, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.085550 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 3966:5384, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.085875 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 5384:6802, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.085976 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 6802, win 65405, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.086045 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 6802:8062, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1260
00:00:00.086067 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 8062, win 65425, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.086601 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 8062:9480, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.086709 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 9480:10898, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.086728 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 10898, win 65405, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.086820 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 10898:12158, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1260
00:00:00.086836 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 12158, win 65425, options [nop,nop,TS val 949341092 ecr 990595836], length 0

# 24ms after first client ack was sent, we get 2 more segments
00:00:00.107116 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 12158:13576, ack 193, win 106, options [nop,nop,TS val 990595860 ecr 949341092], length 1418
00:00:00.107403 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 13576:14651, ack 193, win 106, options [nop,nop,TS val 990595860 ecr 949341092], length 1075
00:00:00.107518 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 14651, win 65448, options [nop,nop,TS val 949341092 ecr 990595860], length 0

# connection close (RTT 25ms)
00:00:00.107938 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [F.], seq 193, ack 14651, win 65535, options [nop,nop,TS val 949341092 ecr 990595860], length 0
00:00:00.129947 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [F.], seq 14651, ack 194, win 106, options [nop,nop,TS val 990595884 ecr 949341092], length 0
00:00:00.130071 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 14652, win 65535, options [nop,nop,TS val 949341093 ecr 990595884], length 0

Interestingly, the server waits for ~1 RTT after sending 9 segments, indicating an IW of 9. This suggests that the value was tuned for the home page (or for the similarly-sized search results page).

How Common is This?

So, is this common practice that I just never noticed before, or is Google the only one doing it? I thought I'd run traces against a few more sites and try to deduce their IWs. Here's what I found:

akamai:4
amazon:3
cisco:2
facebook:4
limelightnetworks:4
yahoo:3

It looks like goosing the IW to 4 is pretty common practice, but I was about to give up on finding anyone pushing as far as Google until, almost as an afterthought, I tried www.microsoft.com. You have to see it to believe it:

$ tcpdump -i en1 -ttttt host www.microsoft.com

# 3-way handshake (RTT 92ms)
00:00:00.000000 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [S], seq 2134062443, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 949406122 ecr 0,sackOK,eol], length 0
00:00:00.091960 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [S.], seq 2932567358, ack 2134062444, win 8190, options [mss 1460], length 0
00:00:00.092094 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 1, win 65535, length 0

# request form client and server ack
00:00:00.092909 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [P.], seq 1:179, ack 1, win 65535, length 178
00:00:00.189451 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 1:1461, ack 179, win 64034, length 1460

# server sends 43 segments without pause, for a total of almost 64KB!!! (the full client receive window size)
00:00:00.189780 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 1461:2921, ack 179, win 64034, length 1460
00:00:00.190009 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 2921:4381, ack 179, win 64034, length 1460
00:00:00.190055 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 4381, win 65535, length 0
00:00:00.190204 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 4381:5841, ack 179, win 64034, length 1460
00:00:00.190282 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 5841:7301, ack 179, win 64034, length 1460
00:00:00.190325 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 7301, win 65535, length 0
00:00:00.192433 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 7301:8761, ack 179, win 64034, length 1460
00:00:00.192742 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 8761:10221, ack 179, win 64034, length 1460
00:00:00.192834 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 10221, win 65535, length 0
00:00:00.192965 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 10221:11681, ack 179, win 64034, length 1460
00:00:00.193438 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 11681:13141, ack 179, win 64034, length 1460
00:00:00.193523 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 13141:14601, ack 179, win 64034, length 1460
00:00:00.193627 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 14601:16061, ack 179, win 64034, length 1460
00:00:00.193916 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 16061:17521, ack 179, win 64034, length 1460
00:00:00.194171 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 17521:18981, ack 179, win 64034, length 1460
00:00:00.194257 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 18981:20441, ack 179, win 64034, length 1460
00:00:00.194365 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 20441:21901, ack 179, win 64034, length 1460
00:00:00.199122 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 21901:23361, ack 179, win 64034, length 1460
00:00:00.199129 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 23361:24821, ack 179, win 64034, length 1460
00:00:00.199164 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 24821:26281, ack 179, win 64034, length 1460
00:00:00.199251 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 23361, win 65535, length 0
00:00:00.199255 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 26281:27741, ack 179, win 64034, length 1460
00:00:00.199810 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 27741:29201, ack 179, win 64034, length 1460
00:00:00.200126 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 29201:30661, ack 179, win 64034, length 1460
00:00:00.200135 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 30661:32121, ack 179, win 64034, length 1460
00:00:00.200403 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 32121:33581, ack 179, win 64034, length 1460
00:00:00.200503 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 33581:35041, ack 179, win 64034, length 1460
00:00:00.202268 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 35041:36501, ack 179, win 64034, length 1460
00:00:00.202301 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 36501, win 65535, length 0
00:00:00.202792 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 36501:37961, ack 179, win 64034, length 1460
00:00:00.203063 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 37961:39421, ack 179, win 64034, length 1460
00:00:00.203642 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 39421:40881, ack 179, win 64034, length 1460
00:00:00.205313 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 40881:42341, ack 179, win 64034, length 1460
00:00:00.205576 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 42341:43801, ack 179, win 64034, length 1460
00:00:00.205905 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 43801:45261, ack 179, win 64034, length 1460
00:00:00.206253 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 45261:46721, ack 179, win 64034, length 1460
00:00:00.206354 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 46721:48181, ack 179, win 64034, length 1460
00:00:00.206627 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 48181:49641, ack 179, win 64034, length 1460
00:00:00.206655 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 49641, win 65535, length 0
00:00:00.208561 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 49641:51101, ack 179, win 64034, length 1460
00:00:00.208883 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 51101:52561, ack 179, win 64034, length 1460
00:00:00.209052 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 52561:54021, ack 179, win 64034, length 1460
00:00:00.209290 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 54021:55481, ack 179, win 64034, length 1460
00:00:00.209373 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 55481:56941, ack 179, win 64034, length 1460
00:00:00.209677 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 56941:58401, ack 179, win 64034, length 1460
00:00:00.209758 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 58401:59861, ack 179, win 64034, length 1460
00:00:00.210097 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 59861:61321, ack 179, win 64034, length 1460
00:00:00.210188 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 61321:62781, ack 179, win 64034, length 1460
00:00:00.210216 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 62781, win 65535, length 0
00:00:00.210471 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 62781:64241, ack 179, win 64034, length 1460

# finally, the server waits for an ack before continuing
00:00:00.282291 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [P.], seq 64241:65701, ack 179, win 64034, length 1460
00:00:00.282420 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 65701, win 65535, length 0
00:00:00.284785 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 65701:67161, ack 179, win 64034, length 1460
00:00:00.285122 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 67161:68621, ack 179, win 64034, length 1460
00:00:00.287224 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 68621:70081, ack 179, win 64034, length 1460
00:00:00.287518 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 70081:71541, ack 179, win 64034, length 1460
00:00:00.287746 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [.], seq 71541:73001, ack 179, win 64034, length 1460
00:00:00.288043 IP wwwco1vip.microsoft.com.http > 192.168.1.21.52625: Flags [P.], seq 73001:74396, ack 179, win 64034, length 1395
00:00:00.288084 IP 192.168.1.21.52625 > wwwco1vip.microsoft.com.http: Flags [.], ack 74396, win 65535, length 0
...

Microsoft appears to be skipping slow-start altogether and setting the IW to the full client receive buffer size. Crazy.

Some Discussion

A search for "google tcp initial window" turns up a Google-authored research paper and Internet-Draft proposing a change to the slow-start algorithm to allow an IW of up to 10 segments (IW10). Interesting.

There's also a lively ongoing discussion on the IETF TMRG mailing list. I haven't read every post (there have been hundreds over the last few months), but it seems that most of the participants are approaching this as a theoretical problem, not as an issue that is actually occurring in the wild and needs to be addressed. The Google engineers on the mailing list have taken on a more frustrated tone recently, so it's possible that they decided the best way to make forward progress was to just turn it on and see whether the internet actually melts down or not. It's also possible that I happen to part of an ongoing test that they're running.

I wasn't able to find any discussion relevant to what I saw in my Microsoft trace.

Conclusions

It's getting late, so I'll wrap this up with a few thoughts:
  1. Fast is good. I'm excited to see that sub-100ms page loads are possible, and it's a shame to not be able to take full advantage of modern networks because of protocol limitations (http being the limiting protocol, btw).
  2. Being non-standards-compliant in a way that privileges their flows relative to others seems more than a little hypocritical from a company that's making such a fuss about network neutrality.
  3. I'm not really qualified to render judgment on whether IW10 is a net positive, but after reading the discussion (and considering that the internet hasn't actually melted down), I'm inclined to think that it is.
  4. I'm pretty confident that turning off slow-start entirely, as Microsoft seems to be doing in my trace, is a very bad thing (maybe even a bug).
So, this leaves the question, what should I do in my app (and what should you do in yours)? Join the arms race or sit on the sidelines and let Google have all the page-load glory? I'll let you know what I decide (and if I do it, I'll be sure to let you know how it works out).

64 comments:

olalonde said...
This comment has been removed by the author.
Dave said...

Any kernel-tuning tricks to play with slow start options for a linux-based web server?

Ben Strong said...

@Dave - Check out this patch:

http://www.amailbox.org/mailarchive/linux-netdev/2010/5/26/6278007

Dutchmaster said...

That's pretty funny. Makes you wonder how many hacks these companies have going on behind the scenes.

angsuman said...

I ran the same test on our site from a different server over internet and got the following result:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 55529 100 55529 0 0 16.2M 0 --:--:-- --:--:-- --:--:-- 50.4M

real 0m0.006s


That appears to be must faster than Google and replicatable. Am I missing something here?

Ben Strong said...

@angsuman - I should have been more clear about it, but the interesting part is that these tests were run over a home internet connection. My guess is that your test was from a server with a very short RTT to your site.

Ben Smith said...

Angsuman: compare/contrast your network latency before comparing yourself to the google!

Google's ping times average somewhere around 30-60ms for most Internet connections in the USA. Your Intranet will probably have a 5-10 ms ping times since it's all local LAN in the same building. So you get another 20-50ms to respond before Google can even begin. And because of the short round trip, the "slow startup" algorithm picks up the pace much more quickly, as it was designed to do.

John McLear said...

This may be useful

http://www.isi.edu/lsam/publications/phttp_tcp_interactions/node2.html

Chris Riley said...

great article. thanks for posting.


I think on #1 you mean TCP? though.

Adam Rosien said...

Serendipity: http://www.mnot.net/blog/2010/11/27/htracr

Absolute-Z said...

Good post! Thanks!

Adam said...

FWIW, this is somewhat anecdotal, but I've always noticed that large downloads from MS (e.g. service packs) will totally dominate all other traffic on my connection. Everything else will slow to a crawl rather than sharing somewhat nicely as most other connections seem to. That would seem to make sense as that is partly what the slow start is intended to prevent.

Jim said...

Not implementing an optional variant is not cheating.

Simon said...

I'm not sure if this would apply to Google's case, but Linux caches the cwnd per src/dst flow in the route cache (see "ip route show cache"), and so the cwnd for a new connection can be inflated beyond the "initial cwnd", assuming the src:dst pair is in the route cache. This may affect your testing, and provide the results you are seeing.

Zygo said...

I think Google's net neutrality position is that people should be free to do things like this at the network edges, without having to worry about unexpected interactions with some traffic mangling system in between.

In other words, it's OK for Google to play with their own TCP flows (where they're an end point), but it's not OK for Comcast to play with Google's TCP flows as they go by. This sort of implementation variation is the only way that we can get any kind of forward progress on protocols used on live networks.

3*MSS is a number back from the days when an Ethernet packet was a non-trivial chunk of host device driver memory and could take seconds to transmit over slow links. These days, network interface hardware drivers don't want to bother with host CPU interrupt service overhead until there's a few dozen packets sitting in the receive buffer, because the precious microseconds required to process data one packet at a time are a cost too expensive to bear. DSL and cable modems can buffer hundreds of packets (even though they shouldn't, and dammit, almost all of them do now).

If you intend to have users who are still using ancient network hardware, you might want to consider the effects of this kind of tuning carefully. If I throw a naive TBF QoS filter (simulating the behavior of my 1990's-era cable modem, with no bursting or queueing, it simply drops packets that arrive too quickly) between a user and Google, I notice that Google's pages load extremely slowly compared to a standard-compliant web server. Sometimes they don't load at all, and every time I open a Google page, any other traffic through the filter stalls. No problems if the TBF allows bursting or queueing, though.

exploderator said...

In Google's case it's a reliably small size page. All 8K of that data is going to have to get transfered sooner or later. As long as an 8K burst isn't breaking buffers along the pipes, they are saving several rounds of back and forth protocol overhead by skipping the whole soft start. That means not only does their page load faster, but it is also less costly to everyone involved, in the long run.

As for Microsoft, I go with bug.

Dan Jacobs said...

The limiting protocol is HTTP, you say? Then it's no wonder Google is working on their own replacement for the HTTP protocol. Perhaps this gives us some insight as to why they may be doing this...

Ben Strong said...

@Chris - I actually did mean http, but should have provided some more explanation. The main reason that people are doing this is to work around the fact that browsers create large numbers of short-lived http connections, a problem that SPDY is designed to alleviate. At that point, slow-start will be less of an issue.

@Simon - Interesting. Something to look into for sure. Had I known how much interest this post was going to receive, I would have done a lot more research up front.

Sumit said...

Nice observation, Ben.
The other cases do not raise as much issue as Microsoft. It would be interesting to see how Microsoft responds.

David Rodecker said...

Great discovery.

Curious how long that these sites have been doing this... is this perhaps the result of hiring the inventor of TCP/IP?


I wouldn't necessarily say that they "cheat", but that they've optimized for their page sizes and users. There are many cases where the RFC standard isn't optimal in specific cases. Nevertheless, I'd sure be happy if our webserver could take advantage of this easter-egg.

plugsukah said...

Hm, interesting... could make a tangible difference in high latency situations like most wireless connections.

parena said...

As Jim stated: Not implementing an optional variant is not cheating.

Jacob Taylor said...

Great article Ben. Thanks for taking the time to track it down and share it with us.

Justin Grant said...

It's very likely that all the web sites in question have an application delivery controller appliance (such as an F5 Big-IP), terminating all traffic and load balancing it to a server farm.

Given this, you can use two differently tweaked TCP stacks, one WAN side and one LAN side. On the WAN side you would tweak the slow start algorithm, so that short HTML pages could be served within 1-2 windows. Latency (not bandwidth) is the real killer of web applications, so serving a page with minimal round-trips is key.

On the LAN side, you can pretty much abandon slow start. You have <1ms latency, 1Gb or more bandwidth and no packet loss. A correctly configured application delivery controller will aggressively increase the TCP connection server side and cache/buffer the response to WAN based client.

I would agree that Microsoft have got it wrong and it would seem that they have applied a LAN type TCP profile to the WAN client. This is bad in that it wastes bandwidth for large TCP sessions where the client is likely to request retransmission for the bulk of the TCP segments (even if selected acks are enabled).

Andy said...

I was interested in your results, and tried running from my linux box.

I get different results, but I notice in the original handshake, my box is advertising an initial window of 5840 (4 packets), while yours is 64K. So it appears the buffer window is being observed, but not the slow start algorithm.

I would think that this isn't a major problem any more. Slow start is about gaining confidence in the network infrastructure and to me looks aimed at a world at 9600 baud modems. It's been around for a lot longer than RFC 3390:
http://tools.ietf.org/html/rfc2001

I'm not sure what the stats would be, but I think it's got to be incredibly rare to encounter connections below 56 Kb today, and I suspect 512 Kb ASDL is now a bare minimum connection for people trying to access modern websites.

Google's approach of an IW of 10 seems a decent balance. 15K might take an appreciable amount of time to transfer over a modem but it shouldn't cause a problem, and could improve performance for the other 95% of the internet.

GWBasic said...

Take a look at Nagle's algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm. If you want very high performance, make sure it's off, and make sure that you send everything to the socket at once so you don't have tiny packets floating around. (IE, be careful with streaming / buffering APIs.) Also, make sure your library supports HTTP keep-alive.

costan said...

Hey,

please remember that Linux Kernel and probably other OS, cache the performance of IP addresses already "know".

So it's rather easily that your IP address was already known from the perspective of the Google webserver (or, to be more precise, the loadbalancer in front of the webserver farm).
It may belong to multiple users if a NAT is in place on your home or ISP, or you simply already asked something on the very same server (loadbalancer) and it get to "know" you and the performance of the net in between.

If you issue an "ip route list cache" on a linux box you may see what is already know in term of MTU and MSS, but inside the kernel also other values (like window) are kept.


Ciao,
Andrea

s9 said...

Um, RFC 3390 is an update to RFC 2581, which was obsoleted by RFC 5681. The maximum initial window specified by RFC 5681 is four.

arvid said...

The slow start algorithm doubles the window size each RTT, until it hits congestion or ssthresh. The algorithm you're describing is the normal congestion avoidance regime the TCP connection is put into after slow-start (AIMD).

The purpose of slow start (whose name is slightly misleading) is to find an appropriate window size as fast as possible, and it's done by exponentially increasing the window size.

It makes perfect sense to start with a larger window than the RFC's, since connections have a lot more bandwidth, and starting with 2 or 3 MSS is a quite pessimistic assumption about the available bandwidth.

heirofsalazar said...

This begs the question: just how far will they go?

You mean "raises the question". Begging the question is a type of logical fallacy, and using "begs the question" instead of "raises the question" is a quick way to display your ignorance.

Ma said...

Slightly confused why it's a violation of RFC. I mean how is it a violation if it's improved response time?

Ma Diga

Garp said...

@Ma:

I'm not sure why you equate performance with RFC specifications. The RFC sets out the standard for communication. Ben clearly states the relevant RFC and explains why it's a violation (shouldn't be sending more than 3 before waiting for an Ack.

The danger of violating RFCs is that device manufacturers and programmers rely on them to provide the standard to which they code. If you break RFC, you can't be sure how devices at the other end might handle it.
Microsoft is quite notorious for breaking RFCs, as are a number of major companies; for example Microsoft Exchange's version of SMTP is almost, but not quite, RFC compliant (though they may have fixed that now).
99.9% of the time that's not a problem, but that 0.1% of the time it would cause problems for any non-Exchange mail server, like those most service providers use.

Slightly Ugly Analogy Time:
It's like speed limits on roads, you can break them and get to your destination faster, but there are generally reasons behind why the speed limits are in place. They might be outdated ones, but they are still there, and everyone on the road generally expects you to follow them and might not react well to your speeding.

Bageshwar P Narain said...

Very informative post.Thanks.

Michael said...

"This may well be common knowledge in web development circles"

Depends on the circle. :-). It's pretty well-known among the top websites' architects. Microsoft and Google have led the way on this, but Amazon, Yahoo, Facebook, etc. are aware of it, have had discussions with the Microsoft and Google engineers, and have run tests. Look up "SPDY" which is related. (Chrome even has a way to detect whether the site used SPDY from JavaScript.)

The W3C and IETF are having trouble keeping up with the pace of innovation. WebTimings, for example, is long-overdue, addressing something the top sites have been doing for 5+ years (to the degree that the data can be calculated in JavaScript) and only addresses half the problem (gathering timers) -- there's still the challenge of reliably uploading that performance data and tying it to the original page.

There are some really exciting optimizations going on, but the space is so competitive that few will talk openly about it. Keep tracing, view source, and watch the patent applications to get some sense of it. :-)

Ben Strong said...

@Adam - Cool. I was thinking about writing a tool that would sniff the IW. Maybe I can add it to htracr.

@s9 - Not sure how you came to this conclusion. The best I can tell, RFC-5681 obsoletes 2581 and references 3390 as the authority on slow-start. The IW algorithm is the same in 5681 as in 3390. An IW of 4 is only allowed if the segment size is less than 1095, which it pretty much never is these days.

@Michael - That is very interesting info on what's going on.

Michael said...

Thanks, yeah it's a fun time to be alive. I love my job! Xbox performance was hard and fun, but in many ways website latency is even harder.

Anyone here who finds this stuff existing should come help us make websites faster. My employer (Amazon) is hiring, and so are all the other companies I mentioned.

Some additional resources on SPDY:
http://www.chromium.org/spdy/spdy-whitepaper
http://en.oreilly.com/velocityfall09/public/schedule/detail/11477
and Velocity has many good resources about website latency more generally.

gpshead said...

This work is not being done in the dark. The standard is up to be changed based on the data Google has gathered: http://www.google.com/research/pubs/pub36640.html

Sean said...

If you have not read the vegas TCP stack spec I highly suggest it as it deals with some of the assumptions you folks are making.

These patchs to let the application control TCP seem unwise as folks normally call that protocol UDP.. :)

Sean said...
This comment has been removed by the author.
Kris Hofmans said...

It will surely take you a long long time before you can release that app if you stop and look at everything on the way :)

jeffery said...

Seems that they want to make it a standard.....

http://tools.ietf.org/html/draft-ietf-tcpm-initcwnd-00

Glowing Face Man said...

I've always been thinking about how inefficient HTTP is. The standard LAMP setup is horribly wasteful: you wait for the entire http request before apache even considers the possibility of sending the first byte of a response... even though, with the uniformity of most sites, you could send half the damn html before even reading which url they're requesting...

Sheldon Hearn said...

@Glowing Face Man

Huh? Do you have a defined minimum set of headers that you can safely wait for before starting to reply, without waiting for the rest?

I'd be very surprised to hear of such a set.

Daryl said...

As an FYI for this article, there are tcp/ip stack implementation tweaks you can play with for both Windows and Linux to get similar behavior.

In Win 2k8 (or 2k3 with sp3 & KB949316) you can change to the CTCP TCP congestion provider via 'netsh interface tcp set global congestionprovider=ctcp'

In Linux since 2.6.19, there has been a sysctl setting 'net.ipv4.tcp_congestion_control' that allows you to choose between cubic and htpc tcp congestion control algorithms. By default it's 'cubic' afaik.

More info here: (Windows server)
http://smallvoid.com/article/winnt-ctcp-support.html

and Here: (Linux)
http://fasterdata.es.net/fasterdata/host-tuning/linux/

Matt said...

Looks like IW10 is now default on Linux

McNate said...

On Linux, the following command should set the server's IW on the default route to 10*MMS, according to the man page for "ip".

ip route change default via 192.168.1.1 initcwnd 10

Similar for local networks:

ip route change 192.168.1.0/24 dev wlan0 proto kernel scope link src 192.168.1.42 metric 2 initcwnd 10

I've tested this using Apache on Linux as the server, Google Chrome on Windows 7 as the client, and a 12KiB text file as the requested page.

With the default setting for initcwnd, slow start looks normal.

With initcwnd set to 10, the entire 12KiB is sent in one burst.

neilhendry said...

Page loading speed is critical in achieving an overall quality page score/ page rank....websites will be tested on the 'desktop' loading speeds but more so on mobile loading speeds...given that c.40% of searches are on mobile devices now...tinnitus
Neil

Joe Dev said...

There is no violation. You missed the part of the standard that says your quoted equation is optional and even that, expressly, "a TCP MAY start with a larger initial window".

Scumola said...

I doubt that You were hitting google at all. Dns was probably cached on your os, local dns server or ISPs dns server. The google homepage was probably in a transparent cache/proxy at your ISP or in a CDs like Akamai. I seriously doubt that you were hitting google directly for a static page like the google homepage. A better test would be to hit a dynamic site with "no-cache" headers set from AWS or some other site that doesn't cache web traffic. I'll bet that you have detected no slow-start by your ISP or Akamai.

Bernard Mckeever said...

Looks like this post from google confirms your suspicions :) http://googlecode.blogspot.ie/2012/01/lets-make-tcp-faster.html

Hack Facebook said...


I could not refrain from commenting. Well written! Ever wanted to hack your friends or foes facebook account? Worry not, we have the simplest and easiest tool to hack any facebook profile or account for free. Just visit www.hackfbaccounts.org and start hacking.

hemcoined said...

Could make a tangible difference in high latency situations like most wireless connections.
Check Valve Distributor

Malik Gupta said...
This comment has been removed by the author.
Malik Gupta said...

FWIW, this is somewhat anecdotal, but I've always noticed that large downloads from MS (e.g. service packs) will totally dominate all other traffic on my connection. Everything else will slow to a crawl rather than sharing somewhat nicely as most other connections seem to. That would seem to make sense as that is partly what the slow start is intended to prevent.

Hack Facebook

tony zoe-divine said...

Exactly what type of app are you interested in building?
Anyway for the google lovers, you can checkout how to make google your homepage in all browsers.

mash said...

steam wallet hack
call of duty ghosts beta
steam wallet hack
steam wallet hack
call of duty ghosts beta
plants vs zombies 2 download

mash said...

steam wallet

mash said...

In Google's case it's a reliably small size page. All 8K of that data is going to have to get transfered sooner or later. As long as an 8K burst isn't breaking buffers along the pipes, they are saving several rounds of back and forth protocol overhead by skipping the whole soft start. That means not only does their page load faster, but it is also less costly to everyone involved, in the long run.



Dragon City hack

SEO 197 said...

That's their trick! Thanks for sharing some good post. Keep it up!

pay for performance seo

SEO 197 said...
This comment has been removed by the author.
cRonEk said...

I guess you can refer to it as cheating but not a surprise either way. Great post!

Spotify Premium Code
<a

cRonEk said...

Great post with examples and proof to backup your claims. Google is a dictator sad to say.

Follower instagram gratuit

King Ashley said...
This comment has been removed by the author.
Sidra Ali said...

I don't know whether its fair me or if other people experiencing issues with your site.
It shows up like a percentage of the composed content inside your substance are running off the screen. Would somebody be able to else please remark and let me know whether this occurrence to them also? This could bumblebee an issue with my program on the grounds that I've had this happen awhile ago. Much thanks to you..
web development company uae