Linux TCP Tuning

13 ans ago

admin

9 minutes

To make persistent changes to the kernel settings described bellow, add the entries to the /etc/sysctl.conf file and then run « sysctl -p » to apply.

Like all operating systems, the default maximum Linux TCP buffer sizes are way too small. I suggest changing them to the following settings:

To increase TCP max buffer size setable using setsockopt():

1 2	`net.core.rmem_max = 33554432` `net.core.wmem_max = 33554432`

To increase Linux autotuning TCP buffer limits min, default, and max number of bytes to use set max to 16MB for 1GE, and 32M or 54M for 10GE:

1 2	`net.ipv4.tcp_rmem = 4096 87380 33554432` `net.ipv4.tcp_wmem = 4096 65536 33554432`

You should also verify that the following are all set to the default value of 1:

sysctl net.ipv4.tcp_window_scaling

sysctl net.ipv4.tcp_timestamps

sysctl net.ipv4.tcp_sack

Note: you should leave tcp_mem alone. The defaults are fine.

Another thing you can do to help increase TCP throughput with 1GB NICs is to increase the size of the interface queue. For paths with more than 50 ms RTT, a value of 5000-10000 is recommended. To increase txqueuelen, do the following:

1	`[root@server1 ~]` `ifconfig` `eth0 txqueuelen 5000`

You can achieve increases in bandwidth of up to 10x by doing this on some long, fast paths. This is only a good idea for Gigabit Ethernet connected hosts, and may have other side effects such as uneven sharing between multiple streams.

Other kernel settings that help with the overall server performance when it comes to network traffic are the following:

TCP_FIN_TIMEOUT – This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. Addjust this in the presense of many connections sitting in the TIME_WAIT state:

1	`[root@server:~]# echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout`

TCP_KEEPALIVE_INTERVAL – This determines the wait time between isAlive interval probes. To set:

1	`[root@server:~]# echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl`

TCP_KEEPALIVE_PROBES – This determines the number of probes before timing out. To set:

1	`[root@server:~]# echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes`

TCP_TW_RECYCLE – This enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). Should be used with caution with loadbalancers.

1	`[root@server:~]# echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle`

TCP_TW_REUSE – This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle

1	`[root@server:~]# echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse`

Note: The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers and loadbalancers. Reusing the sockets can be very effective in reducing server load.

Starting in Linux 2.6.7 (and back-ported to 2.4.27), linux includes alternative congestion control algorithms beside the traditional ‘reno’ algorithm. These are designed to recover quickly from packet loss on high-speed WANs.

There are a couple additional sysctl settings for kernels 2.6 and newer:

Not to cache ssthresh from previous connection:

1	`net.ipv4.tcp_no_metrics_save = 1`

To increase this for 10G NICS:

1	`net.core.netdev_max_backlog = 30000`

Starting with version 2.6.13, Linux supports pluggable congestion control algorithms . The congestion control algorithm used is set using the sysctl variable net.ipv4.tcp_congestion_control, which is set to bic/cubic or reno by default, depending on which version of the 2.6 kernel you are using.

To get a list of congestion control algorithms that are available in your kernel (if you are running 2.6.20 or higher), run:

1	`[root@server1 ~]` `# sysctl net.ipv4.tcp_available_congestion_control`

The choice of congestion control options is selected when you build the kernel. The following are some of the options are available in the 2.6.23 kernel:

* reno: Traditional TCP used by almost all other OSes. (default)

* cubic: CUBIC-TCP (NOTE: There is a cubic bug in the Linux 2.6.18 kernel used by Redhat Enterprise Linux 5.3 and Scientific Linux 5.3. Use 2.6.18.2 or higher!)

* bic: BIC-TCP

* htcp: Hamilton TCP

* vegas: TCP Vegas

* westwood: optimized for lossy networks

If cubic and/or htcp are not listed when you do ‘sysctl net.ipv4.tcp_available_congestion_control’, try the following, as most distributions include them as loadable kernel modules:

1 2	`[root@server1 ~]` `# /sbin/modprobe tcp_htcp` `[root@server1 ~]` `# /sbin/modprobe tcp_cubic`

For long fast paths, I highly recommend using cubic or htcp. Cubic is the default for a number of Linux distributions, but if is not the default on your system, you can do the following:

1	`[root@server1 ~]` `# sysctl -w net.ipv4.tcp_congestion_control=cubic`

On systems supporting RPMS, You can also try using the ktune RPM, which sets many of these as well.

If you have a load server that has many connections in TIME_WAIT state decrease the TIME_WAIT interval that determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. This interval between closure and release is known as the TIME_WAIT state or twice the maximum segment lifetime (2MSL) state. During this time, reopening the connection to the client and server cost less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, providing more resources for new connections. Adjust this parameter if the running application requires rapid release, the creation of new connections, and a low throughput due to many connections sitting in the TIME_WAIT state:

1	`[root@host1 ~]# echo 5 > /proc/sys/net/ipv4/tcp_fin_timeout`

If you are often dealing with SYN floods the following tunning can be helpful:

[root@host1 ~]# sysctl -w net.ipv4.tcp_max_syn_backlog="16384"

[root@host1 ~]# sysctl -w net.ipv4.tcp_synack_retries="1"

[root@host1 ~]# sysctl -w net.ipv4.tcp_max_orphans="400000"

The parameter on line 1 is the maximum number of remembered connection requests, which still have not received an acknowledgment from connecting clients.
The parameter on line 2 determines the number of SYN+ACK packets sent before the kernel gives up on the connection. To open the other side of the connection, the kernel sends a SYN with a piggybacked ACK on it, to acknowledge the earlier received SYN. This is part 2 of the three-way handshake.
And lastly on line 3 is the maximum number of TCP sockets not attached to any user file handle, held by system. If this number is exceeded orphaned connections are reset immediately and warning is printed. This limit exists only to prevent simple DoS attacks, you _must_ not rely on this or lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value, and tune network services to linger and kill such states more aggressively.

More information on tuning parameters and defaults for Linux 2.6 are available in the file ip-sysctl.txt, which is part of the 2.6 source distribution.

Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space!

And finally a warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK.

Starting with Linux 2.4, Linux implemented a sender-side autotuning mechanism, so that setting the optimal buffer size on the sender is not needed. This assumes you have set large buffers on the receive side, as the sending buffer will not grow beyond the size of the receive buffer.

However, Linux 2.4 has some other strange behavior that one needs to be aware of. For example: The value for ssthresh for a given path is cached in the routing table. This means that if a connection has has a retransmission and reduces its window, then all connections to that host for the next 10 minutes will use a reduced window size, and not even try to increase its window. The only way to disable this behavior is to do the following before all new connections (you must be root):

1	`[root@server1 ~]` `# sysctl -w net.ipv4.route.flush=1`

Lastly I would like to point out how important it is to have a sufficient number of available file descriptors, since pretty much everything on Linux is a file.

To check your current max and availability run the following:

1 2	`[root@host1 ~]# sysctl fs.file-nr` `fs.file-nr = 197600 0 3624009`

The first value (197600) is the number of allocated file handles.
The second value (0) is the number of unused but allocated file handles. And the third value (3624009) is the system-wide maximum number of file handles. It can be increased by tuning the following kernel parameter:

1	`[root@host1 ~]# echo 10000000 > /proc/sys/fs/file-max`

To see how many file descriptors are being used by a process you can use one of the following:

1 2	`[root@host1 ~]# lsof -a -p 28290` `[root@host1 ~]# ls -l /proc/28290/fd \| wc -l`

The 28290 number is the process id.