The conn.Bind UDP sockets' send and receive buffers are now being sized
to 7MB, whereas they were previously inheriting the system defaults.
The system defaults are considerably small and can result in dropped
packets on high speed links. By increasing the size of these buffers we
are able to achieve higher throughput in the aforementioned case.
The iperf3 results below demonstrate the effect of this commit between
two Linux computers with 32-core Xeon Platinum CPUs @ 2.9Ghz. There is
roughly ~125us of round trip latency between them.
The first result is from commit 792b49c which uses the system defaults,
e.g. net.core.{r,w}mem_max = 212992. The TCP retransmits are correlated
with buffer full drops on both sides.
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 4.74 GBytes 4.08 Gbits/sec 2742 285 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 4.74 GBytes 4.08 Gbits/sec 2742 sender
[ 5] 0.00-10.04 sec 4.74 GBytes 4.06 Gbits/sec receiver
The second result is after increasing SO_{SND,RCV}BUF to 7MB, i.e.
applying this commit.
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 6.14 GBytes 5.27 Gbits/sec 0 3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 6.14 GBytes 5.27 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 6.14 GBytes 5.25 Gbits/sec receiver
The specific value of 7MB is chosen as it is the max supported by a
default configuration of macOS. A value greater than 7MB may further
benefit throughput for environments with higher network latency and
lower CPU clocks, but will also increase latency under load
(bufferbloat). Some platforms will silently clamp the value to other
maximums. On Linux, we use SO_{SND,RCV}BUFFORCE in case 7MB is beyond
net.core.{r,w}mem_max.
Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Implement TCP offloading via TSO and GRO for the Linux tun.Device, which
is made possible by virtio extensions in the kernel's TUN driver.
Delete conn.LinuxSocketEndpoint in favor of a collapsed conn.StdNetBind.
conn.StdNetBind makes use of recvmmsg() and sendmmsg() on Linux. All
platforms now fall under conn.StdNetBind, except for Windows, which
remains in conn.WinRingBind, which still needs to be adjusted to handle
multiple packets.
Also refactor sticky sockets support to eventually be applicable on
platforms other than just Linux. However Linux remains the sole platform
that fully implements it for now.
Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Accept packet vectors for reading and writing in the tun.Device and
conn.Bind interfaces, so that the internal plumbing between these
interfaces now passes a vector of packets. Vectors move untouched
between these interfaces, i.e. if 128 packets are received from
conn.Bind.Read(), 128 packets are passed to tun.Device.Write(). There is
no internal buffering.
Currently, existing implementations are only adjusted to have vectors
of length one. Subsequent patches will improve that.
Also, as a related fixup, use the unix and windows packages rather than
the syscall package when possible.
Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This does bind_std only; other platforms remain.
The remaining alloc per iteration in the Throughput benchmark
comes from the tuntest package, and should not appear in regular use.
name old time/op new time/op delta
Latency-10 25.2µs ± 1% 25.0µs ± 0% -0.58% (p=0.006 n=10+10)
Throughput-10 2.44µs ± 3% 2.41µs ± 2% ~ (p=0.140 n=10+8)
name old alloc/op new alloc/op delta
Latency-10 854B ± 5% 741B ± 3% -13.22% (p=0.000 n=10+10)
Throughput-10 265B ±34% 267B ±39% ~ (p=0.670 n=10+10)
name old allocs/op new allocs/op delta
Latency-10 16.0 ± 0% 14.0 ± 0% -12.50% (p=0.000 n=10+10)
Throughput-10 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.000 n=10+10)
name old packet-loss new packet-loss delta
Throughput-10 0.01 ±82% 0.01 ±282% ~ (p=0.321 n=9+8)
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
There are more places where we'll need to add it later, when Go 1.18
comes out with support for it in the "net" package. Also, allowedips
still uses slices internally, which might be suboptimal.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The -1 protection was removed and the wrong error was returned, causing
us to read from a bogus fd. As well, remove the useless closures that
aren't doing anything, since this is all synchronized anyway.
Fixes: 10533c3 ("all: make conn.Bind.Open return a slice of receive functions")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
When retrying, if count is not 0, we forget to dequeue another request,
and so the ring fills up and errors out.
Reported-by: Sascha Dierberg <dierberg@dresearch-fe.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
If we receive a large UDP packet, don't return an error to receive.go,
which then terminates the receive loop. Instead, simply retry.
Considering Winsock's general finickiness, we might consider other
places where an attacker on the wire can generate error conditions like
this.
Reported-by: Sascha Dierberg <sascha.dierberg@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
By not comparing these with the modulo, the ring became nearly never
full, resulting in completion queue buffers filling up prematurely.
Reported-by: Joshua Sjoding <joshua.sjoding@scjalliance.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Instead of hard-coding exactly two sources from which
to receive packets (an IPv4 source and an IPv6 source),
allow the conn.Bind to specify a set of sources.
Beneficial consequences:
* If there's no IPv6 support on a system,
conn.Bind.Open can choose not to return a receive function for it,
which is simpler than tracking that state in the bind.
This simplification removes existing data races from both
conn.StdNetBind and bindtest.ChannelBind.
* If there are more than two sources on a system,
the conn.Bind no longer needs to add a separate muxing layer.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
This makes it clearer that they are fresh on each attempt,
and avoids the bookkeeping required to clearing them on failure.
Also, remove an unnecessary err != nil.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
Until we depend on Go 1.16 (which isn't released yet), alias our own
variable to the private member of the net package. This will allow an
easy find replace to make this go away when we eventually switch to
1.16.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Some users report seeing lines like:
> Routine: receive incoming IPv4 - stopped
Popping up unexpectedly. Let's sleep and try again before failing, and
also log the error, and perhaps we'll eventually understand this
situation better in future versions.
Because we have to distinguish between the socket being closed
explicitly and whatever error this is, we bump the module to require Go
1.16.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
If Close is called after ReceiveIPvX, then ReceiveIPvX will block on an
invalid or potentially reused fd.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The sticky socket code stays in the device package for now,
as it reaches deeply into the peer list.
This is the first step in an effort to split some code out of
the very busy device package.
Signed-off-by: David Crawshaw <crawshaw@tailscale.com>