TCP currently neglects to allow sufficient space for its own headers
when allocating I/O buffers. This problem is masked by the fact that
the maximum link-layer header size (802.11) is substantially larger
than the common Ethernet link-layer header.
Fix by allowing sufficient space for any TCP headers, as well as the
network-layer and link-layer headers.
Reported-by: Scott K Logan <logans@cottsay.net>
Debugged-by: Scott K Logan <logans@cottsay.net>
Tested-by: Scott K Logan <logans@cottsay.net>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Update ts_recent whenever window is advanced
Commit 3f442d3 ("[tcp] Record ts_recent on first received packet")
failed to achieve its stated intention.
Fix this (and reduce the code size) by moving the ts_recent update to
tcp_rx_seq(). This is the code responsible for advancing the window,
called by both tcp_rx_syn() and tcp_rx_data(), and so the window check
is now redundant.
Reported-by: Frank Weed <zorbustheknight@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 6861304 ("[tcp] Handle out-of-order received packets")
introduced a regression in which ts_recent would not be updated until
the first packet is received in the ESTABLISHED state, i.e. the
timestamp from the SYN+ACK packet would be ignored. This causes the
connection to be dropped by strictly-conforming TCP peers, such as
FreeBSD.
Fix by delaying the timestamp window check until after processing the
received SYN flag.
Reported-by: winders@sonnet.com
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There are several points in the iPXE codebase where
list_for_each_entry() is (ab)used to extract only the first entry from
a list. Add a macro list_first_entry() to make this code easier to
read.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[retry] Hold reference while timer is running and during expiry callback
Guarantee that a retry timer cannot go out of scope while the timer is
running, and provide a guarantee to the expiry callback that the timer
will remain in scope during the entire callback (similar to the
guarantee provided to interface methods).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Allow out-of-order receive queue to be discarded
Allow packets in the receive queue to be discarded in order to free up
memory. This avoids a potential deadlock condition in which the
missing packet can never be received because the receive queue is
occupying all of the memory available for further RX buffers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Maintain a queue of received packets, so that lost packets need not
result in retransmission of the entire TCP window.
Increase the TCP window to 8kB, in order that we can potentially
transmit enough duplicate ACKs to trigger Fast Retransmission at the
sender.
Using a 10MB HTTP download in qemu-kvm with an artificial drop rate of
1 in 64 packets, this reduces the download time from around 26s to
around 4s.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Treat ACKs as sent only when successfully transmitted
iPXE currently forces sending (i.e. sends a pure ACK even in the
absence of fresh data to send) only in response to packets that
consume sequence space or that lie outside of the receive window.
This ignores the possibility that a previous ACK was not actually sent
(due to, for example, the retransmission timer running).
This does not cause incorrect behaviour, but does cause unnecessary
retransmissions from our peer. For example:
1. Peer sends final data packet (ack 106 seq 521..523)
2. We send FIN (seq 106..107 ack 523)
3. Peer sends FIN (ack 106 seq 523..524)
4. We send nothing since retransmission timer is running for our FIN
5. Peer ACKs our FIN (ack 107 seq 524..524)
6. We send nothing since this packet consumes no sequence space
7. Peer retransmits FIN (ack 107 seq 523..524)
8. We ACK peer's FIN (seq 107..107 ack 524)
What should happen at step (6) is that we should ACK the peer's FIN,
since we can deduce that we have never sent this ACK.
Fix by maintaining an "ACK pending" flag that is set whenever we are
made aware that our peer needs an ACK (whether by consuming sequence
space or by sending a packet that appears out of order), and is
cleared only when the ACK packet has been transmitted.
Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Use a dedicated timer for the TIME_WAIT state
iPXE currently repurposes the retransmission timer to hold the TCP
connection in the TIME_WAIT state (i.e. waiting for up to 2*MSL in
case we are required to re-ACK our peer's FIN due to a lost ACK).
However, the fact that this timer is running will prevent such an ACK
from ever being sent, since the logic in tcp_xmit() assumes that a
running timer indicates that we ourselves are waiting for an ACK and
so blocks the transmission. (We always wait for an ACK before sending
our next packet, to keep our transmit data path as simple as
possible.)
Fix by using an entirely separate timer for the TIME_WAIT state, so
that packets can still be sent.
Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Every other scalar integer value in struct tcp_connection is in host
byte order; change the definition of local_port to match.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[interface] Convert all data-xfer interfaces to generic interfaces
Remove data-xfer as an interface type, and replace data-xfer
interfaces with generic interfaces supporting the data-xfer methods.
Filter interfaces (as used by the TLS layer) are handled using the
generic pass-through interface capability. A side-effect of this is
that deliver_raw() no longer exists as a data-xfer method. (In
practice this doesn't lose any efficiency, since there are no
instances within the current codebase where xfer_deliver_raw() is used
to pass data to an interface supporting the deliver_raw() method.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Standardise on using timer_init() to initialise an embedded retry
timer, to match the coding style used by other embedded objects.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Standardise on using ref_init() to initialise an embedded reference
count, to match the coding style used by other embedded objects.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Update received sequence number before delivering received data
iPXE currently updates the TCP sequence number after delivering the
data to the application via xfer_deliver_iob(). If the application
responds to the received data by transmitting more data, this would
result in a stale ACK number appearing in the transmitted packet,
which potentially causes retransmissions and also gives the
undesirable appearance of violating causality (by sending a response
to a message that we claim not to have yet received).
Reported-by: Guo-Fu Tseng <cooldavid@cooldavid.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Access to the gpxe.org and etherboot.org domains and associated
resources has been revoked by the registrant of the domain. Work
around this problem by renaming project from gPXE to iPXE, and
updating URLs to match.
Also update README, LOG and COPYRIGHTS to remove obsolete information.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Avoid printf format warnings on some compilers
In several places, we currently use size_t to represent a difference
between TCP sequence numbers. This can cause compiler warnings
relating to printf format specifiers, since the result of
(uint32_t+size_t) may be an unsigned long on some compilers.
Fix by using uint32_t for all variables that represent a difference
between TCP sequence numbers.
Tested-by: Joshua Oreman <oremanj@xenon.get-linux.org>
[tcp] Avoid rewinding sequence numbers on receiving old duplicate ACKs
Commit 558c1a4 ("[tcp] Improve robustness in the presence of duplicated
received packets") introduced a regression in that an old duplicate
ACK received while in the ESTABLISHED state would pass through normal
ACK processing, including updating tcp->snd_seq.
Fix by ensuring that ACK processing ignores all duplicate ACKs.
[tcp] Attempt to catch all possible error cases with debug messages
All TCP errors or unusual events should now generate a debugging
message at DBGLVL_LOG, with enough information (SEQ and ACK numbers)
to be able to identify the corresponding packet (or missing packet) in
a network trace from the remote end.
[tcp] Move high-frequency debug messages to DBGLVL_EXTRA
This makes it possible to leave TCP debugging enabled in order to see
interesting TCP events, without flooding the console with at least one
message per packet.
[tcp] Improve robustness in the presence of duplicated received packets
gPXE responds to duplicated ACKs with an immediate retransmission,
which can lead to a sorceror's apprentice syndrome. It also responds
to out-of-range (or old duplicate) ACKs with a RST, which can cause
valid connections to be dropped.
Fix the sorceror's apprentice syndrome by leaving the retransmission
timer running (and so inhibiting the immediate retransmission) when we
receive a potential duplicate ACK. This seems to match the behaviour
of Linux observed via wireshark traces.
Fix the RST issue by sending RST only on out-of-range ACKs that occur
before the connection is fully established, as per RFC 793.
These problems were exposed during development of the 802.11 wireless
link layer; the 802.11 protocol has a failure mode that can easily
cause duplicated packets. The fixes were tested in a controlled way
by faking large numbers of duplicated packets in the rtl8139 driver.
Originally-fixed-by: Joshua Oreman <oremanj@rwcr.net>
Some firewall devices seem to regard SYN,PSH as an invalid flag
combination and reject the packet. Fix by setting PSH only if SYN is
not set.
Reported-by: DSE Incorporated <dseinc@gmail.com>
[xfer] Make consistent assumptions that xfer metadata can never be NULL
The documentation in xfer.h and xfer.c does not say that the metadata
parameter is optional in calls such as xfer_deliver_iob_meta() and the
deliver_iob() method. However, some code in net/ is prepared to
accept a NULL pointer, and xfer_deliver_as_iob() passes a NULL pointer
directly to the deliver_iob() method.
Fix this mess of conflicting assumptions by making everything assume
that the metadata parameter is mandatory, and fixing
xfer_deliver_as_iob() to pass in a dummy metadata structure (as is
already done in xfer_deliver_iob()).
Apparently this can cause a major speedup on some iSCSI targets, which
will otherwise wait for a timer to expire before responding. It
doesn't seem to hurt other simple TCP test cases (e.g. HTTP
downloads).
Problem and solution identified by Shiva Shankar <802.11e@gmail.com>
[tcpip] Allow for transmission to multicast IPv4 addresses
When sending to a multicast address, it may be necessary to specify
the source address explicitly, since the multicast destination address
does not provide enough information to deduce the source address via
the miniroute table.
Allow the source address specified via the data-xfer metadata to be
passed down through the TCP/IP stack to the IPv4 layer, which can use
it as a default source address.
[i386] Change [u]int32_t to [unsigned] int, rather than [unsigned] long
This brings us in to line with Linux definitions, and also simplifies
adding x86_64 support since both platforms have 2-byte shorts, 4-byte
ints and 8-byte long longs.
Maintain state for the advertised window length, and only ever increase
it (instead of calculating it afresh on each transmit). This avoids
triggering "treason uncloaked" messages on Linux peers.
Respond to zero-length TCP keepalives (i.e. empty data packets
transmitted outside the window). Even if the peer wouldn't otherwise
expect an ACK (because its packet consumed no sequence space), force an
ACK if it was outside the window.
We don't yet generate TCP keepalives. It could be done, but it's unclear
what benefit this would have. (Linux, for example, doesn't start sending
keepalives until the connection has been idle for two hours.)