[Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

This subforum contains all the issues which we already resolved.
Post Reply
Hornwitser
Fast Inserter
Fast Inserter
Posts: 205
Joined: Fri Oct 05, 2018 4:34 pm
Contact:

[Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Hornwitser »

When UDP packets are received they are placed into a receive queue by the network layer. The size of this queue varies by system. On Windows 7 this is controlled by a network layer called AFD (Ancillary Function Driver for Winsock) and defaults to 8192 bytes. When packets are read from the socket they are taken off the receive queue. The queue size for an individual socket can be overridden with the SO_RCVBUF option.

If packets are received but not imminently handled by the application they will start filling up the recive queue. If more data is received than will fit in the receive queue additional packets get dropped. This behavior can be seen in the connection handshake to a server that has a particularly big handshake. Approximately 16 fragments of the handshake get accepted, corresponding to about 8kb, and the remaining get requested to be retransmitted indicating that they were dropped.

This same dropping can be seen happen when a ServerToClientHeartbeat message reaches about 10kb. The receive buffer overflows and the last fragments of it gets dropped. There's no fragment based retransmission logic for these packets (at lest that I could find) so instead of requesting the lost fragments to be retransmitted, the whole message is retrasmitted which of course will have the same last fragments dropped again (this incidentally appears to be the root cause for the behaviour observed in [0.17.52] Long /silent-command lead to dropping players.)

Steps to reproduce.
- Have the server send a ServerToClientHeartbeat message that's over 10kb to a client on Windows 7.

Increasing the default receive queue size stops this behavior from happening for messages that are less than the new size. But this is not a real solution as it'll only increase the threshold for when this problem occurs. (On windows the default size can be changed by setting
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Afd\Parameters\DefaultReceiveWindow to for example 0x10000 for 64kb, note that this value should be a multiple of 4096.)

kovarex
Factorio Staff
Factorio Staff
Posts: 8078
Joined: Wed Feb 06, 2013 12:00 am
Contact:

Re: [0.17.54] UDP Receive queue may overflow on Windows

Post by kovarex »

Insightful read. I had no idea bout this.

slippycheeze
Filter Inserter
Filter Inserter
Posts: 587
Joined: Sun Jun 09, 2019 10:40 pm
Contact:

Re: [0.17.54] UDP Receive queue may overflow on Windows

Post by slippycheeze »

kovarex wrote:
Thu Jul 11, 2019 2:19 pm
Insightful read. I had no idea bout this.
FWIW, just set the receive buffer size in software all the time. The defaults are the biggest pain in the neck for cross-platform and cross-version support. Just thank everything that you don't have to support 23 obscure Unix varieties with this, each with their own unique annoyances.

For TCP connections it really doesn't matter much, but you *might* get slightly better bandwidth on a high latency, otherwise perfect, connection with a higher receive buffer size there too. It usually has some amount of influence on how big the window can get, hence letting more bytes be in flight, which helps with that.

If you use TCP anywhere performance-critical, which I don't think you do.

Oxyd
Former Staff
Former Staff
Posts: 1428
Joined: Thu May 07, 2015 8:42 am
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Oxyd »

Well I added retransmission logic for those messages, so that should hopefully fix that.

slippycheeze
Filter Inserter
Filter Inserter
Posts: 587
Joined: Sun Jun 09, 2019 10:40 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by slippycheeze »

Oxyd wrote:
Tue Jul 16, 2019 1:40 pm
Well I added retransmission logic for those messages, so that should hopefully fix that.
I tried, like, three times to write this succinctly *and* with an explanation, and I don’t think I can. This part of the network API is full of horrible, historical stuff that makes it a pain in the neck to work with, and also very prone to “portability” issues even just from version changes on the same OS kernel.

The rule is very, very simple: if you use UDP you MUST set your receive socket buffer to a size that is >= the maximum packet size your application can ever generate. If you can generate a 65535 byte UDP packet, you must have at least 65535 bytes work of receive buffer to work reliably.

Retransmission cannot fix that problem. It is just impossible (see horrible historical stuff) to ever get a UDP message larger than your socket receive buffer size without truncation, at the lowest levels. As in, where the kernel talks to the application to hand over network data.

You need to make sure something or someone calls [inline]setsockopt[/inline] for [inline]SO_RCVBUF[/inline] to set the size to handle those messages.

The argument is painfully still platform specific, or at least was if my memory serves:
  • on Linux, pass a number >= your largest packet size. the kernel will double it, and getsockopt will return the doubled value.
  • on Windows, pass a number >= 2 * your largest packet size. it will not double the requested value, but it needs that extra space to work.
  • on macOS, no idea, and their setsockopt manpage does not say. double the value just to be safe.
Your change will sometimes fix packet drops though: if the socket receive buffer is partly full, it’ll also truncate a message, but if the buffer has enough space when the retransmission gets there it’ll work.

Also, OS vendors and kernels have been known to change the default receive buffer size without warning, in what look like otherwise trivial updates. You cannot depend on any value, ever, on any platform, for the default buffer size. Where is your god now, BSD networking, where is your god now?

I have omitted the history and other details from this deliberately. I’m happy to explain why on any of the stuff that isn’t clear, if you ask, but you can’t fix this problem (assuming the original analysis is correct, which I believe it to be) without that call to [inline]setsockopt(fd, SO_RCVBUF, …)[/inline].

Hornwitser
Fast Inserter
Fast Inserter
Posts: 205
Joined: Fri Oct 05, 2018 4:34 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Hornwitser »

slippycheeze it's nice that you want to provide insightful information on the pitfalls of network programming, but I think you're missing a vital piece of information. The Factorio protocol fragments messages larger than about 500 bytes, no UDP packet greater than about 510 bytes is ever sent by either of the server or the client. Since this maximum packet size is much smaller than the default receive queue sizes of any system adding fragment based retransmission logic (which I presume is what Oxyd did) will allow the lost fragments to get resent and the complete message reassembled. There being large messages is also a rare edge case, more or less the only thing sent over the network is player inputs, and even with a hundred players 10kb messages are rare.

slippycheeze
Filter Inserter
Filter Inserter
Posts: 587
Joined: Sun Jun 09, 2019 10:40 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by slippycheeze »

Hornwitser wrote:
Fri Jul 19, 2019 1:53 pm
slippycheeze it's nice that you want to provide insightful information on the pitfalls of network programming, but I think you're missing a vital piece of information.
I am. Thank you. As I said, I made an assumption without spending the time to verify, and it was invalid. I just dislike the thought of some other poor developer running into the walls I did. I'd rather look a fool than watch someone else suffer. :)

I suppose the lesson learned is that I should figure out what the reasonable to use equivalents of [lds]trace, and tcpdump/ngrep are on Windows.

Hornwitser
Fast Inserter
Fast Inserter
Posts: 205
Joined: Fri Oct 05, 2018 4:34 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Hornwitser »

This issue has not been fixed. When the server sends a 10kb ServerToClientHeartbeat message in 0.17.59 it still shows the same behavior where the client endlessly request it be resent in full but the full message is never received by the client.

Hornwitser
Fast Inserter
Fast Inserter
Posts: 205
Joined: Fri Oct 05, 2018 4:34 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Hornwitser »

This issue has still not been fixed. A 10kb ServerToClientHeartbeat message sent to a Windows client in 0.17.69 still shows the same behavior of being endlessly resent but never received by the client.

Hornwitser
Fast Inserter
Fast Inserter
Posts: 205
Joined: Fri Oct 05, 2018 4:34 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Hornwitser »

This issue is still present in 0.18.6.

movax20h
Fast Inserter
Fast Inserter
Posts: 164
Joined: Fri Mar 08, 2019 7:07 pm
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by movax20h »

FYI. On Linux (at least on a distro and kernel I tested), the default receive buffer is reasonably high, and is automatically tuned by the kernel's networking stack based on available (free) memory and the load (number of incoming packets and speed at which application is handling them) on each socket. This is the case for Linux 2.6.7 and newer. In older ones it was not-autotuning, but still had relatively high defaults (definitively more than 10kb maximum).

Rseding91
Factorio Staff
Factorio Staff
Posts: 13201
Joined: Wed Jun 11, 2014 5:23 am
Contact:

Re: [Oxyd] [0.17.54] UDP Receive queue may overflow on Windows

Post by Rseding91 »

I changed the logic around sockets so it should set the buffer size to 256 KB for the next release. Let me know once that releases if it still happens.
If you want to get ahold of me I'm almost always on Discord.

Post Reply

Return to “Resolved Problems and Bugs”