[ifmgmt] Do not sleep CPU while configuring network devices
iPXE currently calls cpu_nap() while performing DHCP, in order to
reduce CPU utilisation on virtual machines. Under mild broadcast load
(~100 packets per second), this can cause received packets to be
dropped because the receive descriptor ring is overrun before the next
18Hz timer interrupt wakes up the CPU. The result is that DHCP is
likely to intermittently fail on networks with appreciable amounts of
broadcast (or multicast) traffic.
This behaviour was introduced in the series of commits which
generalised the "dhcp" command to the "ifconf" command. The earlier
code (which did not handle IPv6 configuration) had no call to
cpu_nap() and so did not suffer from this problem.
Fix by removing the call to cpu_nap() in ifpoller_progress(). This
has the undesirable side effect that CPU utilisation will remain at
100% while waiting for DHCP to complete (which can take several
seconds, if we have to wait around for potential ProxyDHCP offers to
arrive).
Reported-by: Alex Davies <adavies@jumptrading.com>
Reported-by: Christoffer Stokbæk <christoffers@easyspeedy.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Allow for the PIC interrupt vector offset to be changed
Some external code (observed with FreeBSD's bootloader) will continue
to make INT 13 calls after reconfiguring the 8259 PIC to change the
vector offsets for IRQs. If an IRQ (e.g. the timer IRQ) subsequently
occurs while iPXE is in protected mode, this will cause a general
protection fault since the corresponding IDT entry is empty.
A general protection fault is INT 0x0d, which happens to overlap with
the original IRQ5. We therefore do have an ISR set up to handle a
general protection fault, but this ISR simply reflects the interrupt
down to the real-mode INT 0x0d and then attempts to return. Since our
ISR is expecting a hardware interrupt rather than a general protection
fault, it doesn't remove the error code from the stack before issuing
the iret instruction; it therefore attempts to return to a garbage
address. Since the segment part of this address is likely to be
invalid, a second general protection fault occurs. This cycle
continues until we run out of stack space and triple fault.
Fix by reflecting all INTs down to real mode. This actually reduces
the code size by four bytes (but increases the bss size by almost
2kB).
Reported-by: Brian Rak <dn@devicenull.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[ipv6] Avoid potentially copying from a NULL pointer in ipv6_tx()
If ipv6_tx() is called with a non-NULL network device, a NULL or
unspecified source address, and a destination address which does not
match any routing table entry, then it will attempt to copy the source
address from a NULL pointer.
I don't think that there is currently any code path which could
trigger this behaviour, but we should probably ensure that it can
never happen.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[ipv6] Include network device when transcribing multicast addresses
Destination multicast addresses require a sin6_scope_id, which should
therefore be transcribed to a network device name by ipv6_sock_ntoa().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The transmitting network device is specified via the destination
address, not the source address. There is no reason to set
sin6_scope_id on the source address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[dhcpv6] Do not set sin6_scope_id on the unspecified client socket address
Setting sin6_scope_id to a non-zero value will cause the check against
the "empty socket address" in udp_demux() to fail, and incoming DHCPv6
responses on interfaces other than net0 will be rejected with a
spurious "No UDP connection listening on port 546" error.
The transmitting network device is specified via the destination
address, not the source address. Fix by simply not setting
sin6_scope_id on the client socket address.
Reported-by: Anton D. Kachalov <mouse@yandex-team.ru>
Tested-by: Anton D. Kachalov <mouse@yandex-team.ru>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Fix an erroneous htonl() in the definition of IN6_IS_ADDR_LINKLOCAL(),
and add self-tests for the IN6_IS_ADDR_xxx() family of macros.
Reported-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[efi] Do not try to fetch loaded image device path protocol
Some UEFI systems (observed with a Mac Pro) do not provide a loaded
image device path protocol. We don't currently use the loaded image
device path protocol for anything beyond printing a debug message, so
simply remove the code which attempts to fetch it.
Reported-by: Matt Woodward <pxematt@woodwardcc.com>
Tested-by: Matt Woodward <pxematt@woodwardcc.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some UEFI systems (observed with a Mac Pro) do not provide
EFI_HII_DATABASE_PROTOCOL. We can continue to function without
providing access to network device settings via HII, so make this
protocol optional and fall back to simply not providing any HII
protocols.
Reported-by: Matt Woodward <pxematt@woodwardcc.com>
Tested-by: Matt Woodward <pxematt@woodwardcc.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[efi] Make EFI_DEVICE_PATH_TO_TEXT_PROTOCOL optional
Some UEFI systems (observed with a Mac Pro) do not provide
EFI_DEVICE_PATH_TO_TEXT_PROTOCOL. Since we use this protocol only for
debug messages, make it optional and fall back to printing the raw
device path bytes.
Reported-by: Matt Woodward <pxematt@woodwardcc.com>
Tested-by: Matt Woodward <pxematt@woodwardcc.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Get the NFS URI manipulation code out of nfs_open.c. The resulting
code is now much more readable.
Signed-off-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[libc] Prevent strndup() from reading beyond the end of the string
strndup() may be called on a string which is not NUL-terminated. Use
strnlen() instead of strlen() to ensure that we do not read beyond the
end of such a string.
Add self-tests for strndup(), including a test case with an
unterminated string.
Originally-fixed-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid generating syntactically invalid log messages by ensuring that
invalid characters are not present in the hostname. In particular,
ensure that any whitespace is stripped, since whitespace functions as
a field separator for syslog messages.
Reported-by: Alex Davies <adavies@jumptrading.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As of commit d28bb51 ("[tcp] Defer sending ACKs until all received
packets have been processed"), increasing the RX ring size will
increase the number of received packets per transmitted ACK (since
each poll will process up to one complete receive ring). Under KVM,
this can make a substantial (up to ~200%) difference to the overall
download speed, since transmissions are very expensive.
Increase the ring fill level from four to eight packets: this
increases the download speed by around 50% at a cost of around 8kB of
heap space. Further speedups are possible by increasing the ring size
further, but it would be preferable to find alternative methods which
do not use noticeable amounts of heap space.
Tested-by: Robin Smidsrød <robin@smidsrod.no>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[nfs] Fix an invalid free() when loading a regular (non-symlink) file
An invalid free() was ironically introduced by fixing another invalid
free in commit 7aa69c4 ("[nfs] Fix an invalid free() when loading a
symlink").
Signed-off-by: Marin Hannache <git@mareo.fr>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[lkrnprefix] Make real-mode setup code relocatable
The bzImage boot protocol allows the real-mode code to be loaded at
any segment within base memory. (The fact that both iPXE and recent
versions of Syslinux will load the real-mode code at 1000:0000 is a
coincidence; it is not guaranteed by the specification.)
Fix by making the code relocatable.
Reported-by: Andrew Stuart <andrew@shopcusa.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Rework geniso and genliso to provide a single merged utility for
generating ISO images.
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The .lkrn prefix currently provides a zImage kernel with unused setup
sectors and the whole iPXE binary placed within the "protected mode
kernel" portion of the zImage.
The work carried out years ago to create the .mrom format provides a
mechanism allowing the iPXE binary to be split into a small real-mode
header and a larger payload. This neatly matches the way that a
bzImage is loaded: the "setup sectors" can contain the header and the
"protected mode kernel" can contain the payload.
This removes the size restrictions on an iPXE .lkrn image (and hence
on derived image formats such as .iso).
Also remove obsolete copyright information, since none of the original
code or functionality now remains.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[tcp] Defer sending ACKs until all received packets have been processed
When running inside a virtual machine (or when using the UNDI driver),
transmitting packets can be expensive. When we receive several
packets in one poll (e.g. because a slow BIOS timer interrupt routine
has caused us to fall behind in processing), we can safely send just a
single ACK to cover all of the received packets. This reduces the
time spent transmitting and allows us to clear the backlog much
faster.
Various RFCs (starting with RFC1122) state that there should be an ACK
for at least every second segment. We choose not to enforce this
rule. Under normal operation each poll should find at most one
received packet, and we will then not delay any ACKs. We delay
(i.e. omit) ACKs only when under sufficiently heavy load that we are
finding multiple packets per poll; under these conditions it is
important to clear the backlog quickly since any delay may lead to
dropped packets.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 8540300 ("[build] Disable ccache for all relevant build
targets") attempted to generalise the rule for $(BIN)/version.o to
$(BIN)/version.% in order to apply the dependency to all relevant
build targets (debug objects, assembly listings, etc).
This generalisation appears to work for the ccache override
directives, but seems to cause make (at least, GNU make 4.0) to simply
ignore the dependency upon the git index.
Since version.c contains only some string constants, there is unlikely
to be a substantive need for its debug objects, assembly listings,
etc. Restore the previous form of the dependency and accept that
hypothetical builds with e.g. DEBUG=version will not be handled
correctly.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[intel] Exclude time spent in hypervisor from profiling
When profiling, exclude any time spent inside the hypervisor
responding to our MMIO accesses. This substantially reduces the
variance accumulated on many other profilers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[profile] Allow interrupts to be excluded from profiling results
Interrupt processing adds noise to profiling results. Allow
interrupts (from within protected mode) to be profiled separately,
with time spent within the interrupt handler being excluded from any
other profiling currently in progress.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[undi] Place an upper limit on the number of PXENV_UNDI_ISR calls per poll
PXENV_UNDI_ISR calls may implicitly refill the underlying receive
ring, and so could continue to retrieve packets indefinitely. Place
an upper limit on the number of calls to PXENV_UNDI_ISR per call to
undinet_poll().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When making a call from real mode to protected mode, we save and
restore the global and interrupt descriptor table registers. The
restore currently takes place after returning to real mode, which
generates two EXCEPTION_NMIs and corresponding VM exits when running
under KVM on an Intel CPU.
Avoid the VM exits by restoring the descriptor table registers inside
prot_to_real, while still running in protected mode.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Speed up real-to-protected mode transition under KVM
Ensure that all segment registers have zero in the low two bits before
transitioning to protected mode. This allows the CPU state to
immediately be deemed to be "valid", and eliminates the need for any
further emulated instructions.
Load the protected-mode interrupt descriptor table after switching to
protected mode, since this avoids triggering an EXCEPTION_NMI and
corresponding VM exit.
This reduces the time taken by real_to_prot under KVM by around 50%.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Speed up protected-to-real mode transition under KVM
On an Intel CPU supporting VMX, KVM will emulate instructions while
the CPU state remains "invalid". In real mode, the CPU state is
defined to be "invalid" if any segment register has a base which is
not equal to (sreg<<4) or a limit which is not equal to 64kB.
We don't actually use the base stored in the REAL_DS descriptor for
any significant purpose. Change the base stored in this descriptor to
be equal to (REAL_DS<<4). A segment register loaded with REAL_DS is
then automatically valid in both real and protected modes. This
allows KVM to stop emulating instructions much sooner.
The only use of REAL_DS for memory accesses currently occurs in the
indirect ljmp within prot_to_real. Change this to a direct ljmp,
storing rm_cs in .text16 as part of the ljmp instruction. This
removes the only memory access via REAL_DS (thereby allowing for the
above descriptor base address hack), and also simplifies the ljmp
instruction (which will still have to be emulated).
Load the real-mode interrupt descriptor table register before
switching to real mode, since this avoids triggering an EXCEPTION_NMI
and corresponding VM exit.
This reduces the time taken by prot_to_real under KVM by around 65%.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The mode-transition code involves paths which switch back and forth
between the .text and .text16 sections. At present, only the start of
each function is labelled, which makes it difficult to decode
addresses within the parts of the function existing in a different
section.
Add explicit labels at the start of each section change, so that
addresses can be meaningfully decoded to the nearest label.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[pcbios] Do not switch to real mode to sleep the CPU
Now that we can handle interrupts while in protected mode, there is no
need to switch to real mode just to halt the CPU.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[pcbios] Do not switch to real mode to check for timer interrupt
The currticks() function is called at least once per TCP packet, and
so is performance-critical. Switching to real mode just to allow the
timer interrupt to fire is expensive when running inside a virtual
machine, and imposes a significant performance cost.
Fix by enabling interrupts without switching to real mode. This
results in an approximately 100% increase in download speed when
running under KVM.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We now have the ability to handle interrupts while in protected mode,
and so no longer need to set up a dedicated interrupt descriptor table
while running COM32 executables.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When running in a virtual machine, switching to real mode may be
expensive. Allow interrupts to be enabled while in protected mode and
reflected down to the real-mode interrupt handlers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for an explicit debug level of zero, which will enable
assertions and profiling (i.e. anything controlled by NDEBUG) without
generating any debug messages.
Signed-off-by: Michael Brown <mcb30@ipxe.org>