05-15-2010, 04:58 PM 
	
	
	
		I've been struggling for the last couple of months with a server issue that I just couldn't get a handle on. Essentially what would happen is that every so often while playing on my server, I would experience a momentary stutter that caused a rubber banding effect. Many times it would happen as someone was joining the server, but at other times it would just randomly happen.
I tried to pin it down with the usual monitoring tools, but could see no indications of what was happening. net_graph showed no spikes or dips or really any obvious changes at all when the problem occurred. I was getting some fps dips, but even those didn't directly correspond to the stutter.
From there I just ended up using the shotgun troubleshooting method. I tried several variations of server settings without satisfactory results, and then went on a kernel building binge. Still no success.
Then last week I came across an article that was talking about Linux networking and UDP in particular. I dismissed the issue being network related early on since the server is sitting in a datacenter, and connected to the level3 backbone, and when I was setting it up I never had any issues sending or receiving files. Also, it's a fairly powerful box with a quad core nehalem 2.56 GHz processor. I just couldn't believe that it would have a problem keeping up with the relatively slow transfer speed of a TF2 server, but sure enough that's what the problem was.
Apparently, the default Linux UDP buffer sizing is just adequate for desktop network use, but wholly inadequate for a server. The article I read introduced me to the netstat command, which revealed some interesting information. Here's an example of the UDP statistics I was seeing with the default Debian Lenny UDP buffer sizing.
netstat -su
Udp:
758564 packets received
138 packets to unknown port received.
2477 packet receive errors
1068745 packets sent
RcvbufErrors: 2473 <-----
Now, UDP receive errors are normal, but notice the "RcvbufErrors" error count. This statistic shows the number of packets the kernel had to throw away because srcds didn't pull them from the buffer fast enough. And, I noticed that this error count would increase every time someone joined the server, probably because the client sends a lot of information to the server at connect time.
This explains a lot. net_graph wouldn't show the packet loss since it was on the receive side at the server, and the stutter was probably due to the fact that the server and client would have to resync after the packets got thrown away.
Now that the cause of the problem was identified, the next step was to fix it. Fortunately there's a lot of good info on the internet about tuning the Linux network stack. I added these lines to /etc/sysctl.conf.
net.core.rmem_max=1048575
net.core.wmem_max=1048575
net.core.rmem_default=1048575
net.core.wmem_default=1048575
net.core.optmem_max=1048575
net.core.netdev_max_backlog=10000
After doing this, the stutter issue disappeared, and netstat now shows:
Udp:
9396411 packets received
280 packets to unknown port received.
9 packet receive errors
12671863 packets sent
After running for two days, not a single packet thrown out due to buffer overflow.
I'm hoping that perhaps someone else who is having this problem will find this information useful, and won't have to go through all the trouble I did.
Cheers
	
	
	
I tried to pin it down with the usual monitoring tools, but could see no indications of what was happening. net_graph showed no spikes or dips or really any obvious changes at all when the problem occurred. I was getting some fps dips, but even those didn't directly correspond to the stutter.
From there I just ended up using the shotgun troubleshooting method. I tried several variations of server settings without satisfactory results, and then went on a kernel building binge. Still no success.
Then last week I came across an article that was talking about Linux networking and UDP in particular. I dismissed the issue being network related early on since the server is sitting in a datacenter, and connected to the level3 backbone, and when I was setting it up I never had any issues sending or receiving files. Also, it's a fairly powerful box with a quad core nehalem 2.56 GHz processor. I just couldn't believe that it would have a problem keeping up with the relatively slow transfer speed of a TF2 server, but sure enough that's what the problem was.
Apparently, the default Linux UDP buffer sizing is just adequate for desktop network use, but wholly inadequate for a server. The article I read introduced me to the netstat command, which revealed some interesting information. Here's an example of the UDP statistics I was seeing with the default Debian Lenny UDP buffer sizing.
netstat -su
Udp:
758564 packets received
138 packets to unknown port received.
2477 packet receive errors
1068745 packets sent
RcvbufErrors: 2473 <-----
Now, UDP receive errors are normal, but notice the "RcvbufErrors" error count. This statistic shows the number of packets the kernel had to throw away because srcds didn't pull them from the buffer fast enough. And, I noticed that this error count would increase every time someone joined the server, probably because the client sends a lot of information to the server at connect time.
This explains a lot. net_graph wouldn't show the packet loss since it was on the receive side at the server, and the stutter was probably due to the fact that the server and client would have to resync after the packets got thrown away.
Now that the cause of the problem was identified, the next step was to fix it. Fortunately there's a lot of good info on the internet about tuning the Linux network stack. I added these lines to /etc/sysctl.conf.
net.core.rmem_max=1048575
net.core.wmem_max=1048575
net.core.rmem_default=1048575
net.core.wmem_default=1048575
net.core.optmem_max=1048575
net.core.netdev_max_backlog=10000
After doing this, the stutter issue disappeared, and netstat now shows:
Udp:
9396411 packets received
280 packets to unknown port received.
9 packet receive errors
12671863 packets sent
After running for two days, not a single packet thrown out due to buffer overflow.
I'm hoping that perhaps someone else who is having this problem will find this information useful, and won't have to go through all the trouble I did.
Cheers

 
 
 

