[librm] Preserve FPU, MMX and SSE state across calls to virt_call()
The IBM Tivoli Provisioning Manager for OS Deployment (also known as
TPMfOSD, Rembo-ia32, or Rembo Auto-Deploy) has a serious bug in some
older versions (observed with v5.1.1.0, apparently fixed by v7.1.1.0)
which can lead to arbitrary data corruption.
As mentioned in commit 87723a0 ("[libflat] Test A20 gate without
switching to flat real mode"), Tivoli's NBP sets up a VMM and makes
calls to the PXE stack in VM86 mode. This appears to be some kind of
attempt to run PXE API calls inside a sandbox. The VMM is fairly
sophisticated: for example, it handles our attempts to switch into
protected mode and patches our GDT so that our protected-mode code
runs in ring 1 instead of ring 0. However, it neglects to apply any
memory protections. In particular, it does not enable paging and
leaves us with 4GB segment limits. We can therefore trivially break
out of the sandbox by simply overwriting the GDT (or by modifying any
of Tivoli's VMM code or data structures).
When we attempt to execute privileged instructions (such as "lidt"),
the CPU raises an exception and control is passed to the Tivoli VMM.
This may result in a call to Tivoli's memcpy() function.
Tivoli's memcpy() function includes optimisations which use the SSE
registers %xmm0-%xmm3 to speed up aligned memory copies.
Unfortunately, the Tivoli VMM's exception handler does not save or
restore %xmm0-%xmm3. The net effect of this bug in the Tivoli VMM is
that any privileged instruction (such as "lidt") issued by iPXE may
result in unexpected corruption of the %xmm0-%xmm3 registers.
Even more unfortunately, this problem affects the code path taken in
response to a hardware interrupt from the NIC, since that code path
will call PXENV_UNDI_ISR. The net effect therefore becomes that any
NIC hardware interrupt (e.g. due to a received packet) may result in
unexpected corruption of the %xmm0-%xmm3 registers.
If a packet arrives while Tivoli is in the middle of using its
memcpy() function, then the unexpected corruption of the %xmm0-%xmm3
registers will result in unexpected corruption in the destination
buffer. The net effect therefore becomes that any received packet may
result in a 16-byte block of corruption somewhere in any data that
Tivoli copied using its memcpy() function.
We can work around this bug in the Tivoli VMM by saving and restoring
the %xmm0-%xmm3 registers across calls to virt_call(). To work around
the problem, we need to save registers before attempting to execute
any privileged instructions, and ensure that we attempt no further
privileged instructions after restoring the registers.
This is less simple than it may sound. We can use the "movups"
instruction to save and restore individual registers, but this will
itself generate an undefined opcode exception if SSE is not currently
enabled according to the flags in %cr0 and %cr4. We can't access %cr0
or %cr4 before attempting the "movups" instruction, because access a
control register is itself a privileged instruction (which may
therefore trigger corruption of the registers that we're trying to
save).
The best solution seems to be to use the "fxsave" and "fxrstor"
instructions. If SSE is not enabled, then these instructions may fail
to save and restore the SSE register contents, but will not generate
an undefined opcode exception. (If SSE is not enabled, then we don't
really care about preserving the SSE register contents anyway.)
The use of "fxsave" and "fxrstor" introduces an implicit assumption
that the CPU supports SSE instructions (even though we make no
assumption about whether or not SSE is currently enabled). SSE was
introduced in 1999 with the Pentium III (and added by AMD in 2001),
and is an architectural requirement for x86_64. Experimentation with
current versions of gcc suggest that it may generate SSE instructions
even when using "-m32", unless an explicit "-march=i386" or "-mno-sse"
is used to inhibit this. It therefore seems reasonable to assume that
SSE will be supported on any hardware that might realistically be used
with new iPXE builds.
As a side benefit of this change, the MMX register %mm0 will now be
preserved across virt_call() even in an i386 build of iPXE using a
driver that requires readq()/writeq(), and the SSE registers
%xmm0-%xmm5 will now be preserved across virt_call() even in an x86_64
build of iPXE using the Hyper-V netvsc driver.
Experimentation suggests that this change adds around 10% to the
number of cycles required for a do-nothing virt_call(), most of which
are due to the extra bytes copied using "rep movsb". Since the number
of bytes copied is a compile-time constant local to librm.S, we could
potentially reduce this impact by ensuring that we always copy a whole
number of dwords and so can use "rep movsl" instead of "rep movsb".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 86f96a4 ("[tg3] Remove x86-specific inline assembly")
introduced a regression in _tg3_flag() in 64-bit builds, since any
flags in the upper 32 bits of a 64-bit unsigned long would be
discarded when truncating to a 32-bit int.
Debugged-by: Shane Thompson <shane.thompson@aeontech.com.au>
Tested-by: Shane Thompson <shane.thompson@aeontech.com.au>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Reduce real-mode stack consumption in virt_call()
Some PXE NBPs are known to make PXE API calls with very little space
available on the real-mode stack. For example, the Rembo-ia32 NBP
from some versions of IBM's Tivoli Provisioning Manager for Operating
System Deployment (TPMfOSD) will issue calls with the real-mode stack
placed at 0000:03d2; this is at the end of the interrupt vector table
and leaves only 498 bytes of stack space available before overwriting
the hardware IRQ vectors. This limits the amount of state that we can
preserve before transitioning to protected mode.
Work around these challenging conditions by preserving everything
other than the initial register dump in a temporary static buffer
within our real-mode data segment, and copying the contents of this
buffer to the protected-mode stack.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[image] Skip misleading "format not recognised" error message
Return success (rather than failure) after an image format has been
correctly identified.
This has no practical effect, since the return value from
image_probe() is deliberately never used, but avoids a somewhat
surprising and misleading "format not recognised" error message when
debugging is enabled.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[libgcc] Provide symbol to handle gcc's implicit calls to memset()
On some architectures (such as ARM), gcc will insert implicit calls to
memset(). Handle these using the same mechanism as for the implicit
calls to memcpy() used by x86.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[ethernet] Make LACP support configurable at build time
Add a build configuration option NET_PROTO_LACP to control whether or
not LACP support is included for Ethernet devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This commit makes virtio-net support devices with VEN 0x1af4 and DEV
0x1041, which is how non-transitional (modern-only) virtio-net devices
are exposed on the PCI bus.
Transitional devices supporting both the old 0.9.5 and new 1.0 version
of the virtio spec are driven using the new protocol. Legacy devices
are driven using the old protocol, same as before this commit.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This commit adds support for driving virtio 1.0 PCI devices. In
addition to various helpers, a number of vpm_ functions are introduced
to be used instead of their legacy vp_ counterparts when accessing
virtio 1.0 (aka modern) devices.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[virtio] Add virtio 1.0 constants and data structures
Virtio 1.0 introduces new constants and data structures, common to all
devices as well as specific to virtio-net. This commit adds a subset
of these to be able to drive the virtio-net 1.0 network device.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
PCI devices may support more capabilities of the same type (for
example PCI_CAP_ID_VNDR) and there was no way to discover all of them.
This commit adds a new API pci_find_next_capability which provides
this functionality. It would typically be used like so:
for (pos = pci_find_capability(pci, PCI_CAP_ID_VNDR);
pos > 0;
pos = pci_find_next_capability(pci, pos, PCI_CAP_ID_VNDR)) {
...
}
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The EFI_HII_CONFIG_ACCESS_PROTOCOL's ExtractConfig() method is passed
a request string which includes the parameters being queried plus an
apparently meaningless blob of information (the ConfigHdr), and is
expected to include this same meaningless blob of information in the
results string.
Neither the specification nor the existing EDK2 code (including the
nominal reference implementation in the DriverSampleDxe driver)
provide any reason for the existence of this meaningless blob of
information. It appears to be consumed in its entirety by the
EFI_HII_CONFIG_ROUTING_PROTOCOL, and to contain zero bits of
information by the time it reaches an EFI_HII_CONFIG_ACCESS_PROTOCOL
instance. It would potentially allow for multiple configuration data
sets to be handled by a single EFI_HII_CONFIG_ACCESS_PROTOCOL
instance, in a style alien to the rest of the UEFI specification
(which implicitly assumes that the instance pointer is always
sufficient to uniquely identify the instance).
iPXE currently handles this by simply copying the ConfigHdr from the
request string to the results string, and otherwise ignoring it. This
approach is also used by some code in EDK2, such as OVMF's PlatformDxe
driver.
As of EDK2 commit 8a45f80 ("MdeModulePkg: Make HII configuration
settings available to OS runtime"), this causes an assertion failure
inside EDK2. The failure arises when iPXE is handled a NULL request
string, and responds (as per the specification) with a results string
including all settings. Since there is no meaningless blob to copy
from the request string, there is no corresponding meaningless blob in
the results string. This now causes an assertion failure in
HiiDatabaseDxe's HiiConfigRoutingExportConfig().
The same failure does not affect the OVMF PlatformDxe driver, which
simply passes the request string to the HII BlockToConfig() utility
function. The BlockToConfig() function returns EFI_INVALID_PARAMETER
when passed a null request string, and PlatformDxe propagates this
error directly to the caller.
Fix by matching the behaviour of OVMF's PlatformDxe driver: explicitly
return EFI_INVALID_PARAMETER if the request string is NULL or empty.
This violates the specification (insofar as it is feasible to
determine what the specification actually requires), but causes
correct behaviour with the EDK2 codebase.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[libc] Print "<NULL>" for wide-character NULL strings
The existing code intends to print NULL strings as "<NULL>" (for the
sake of debug messages), but the logic is incorrect when handling
wide-character strings. Fix the logic and add applicable unit tests.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no way for the hardware to give us an invalid length in the
LRH, since it must have parsed this length field in order to perform
header splitting. However, this is difficult to prove conclusively.
Add an unnecessary length check to explicitly reject any packets
larger than the posted receive I/O buffer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no way for the hardware to give us an invalid length in the
LRH, since it must have parsed this length field in order to perform
header splitting. However, this is difficult to prove conclusively.
Add an unnecessary length check to explicitly reject any packets
larger than the posted receive I/O buffer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
It is possible for the preloaded UNDI device to end up with no
specified bus type, since it may not be recognised as either a PCI or
an ISAPnP device. This will result in a bus type value of zero, which
currently results in NULL being treated as a string pointer by
netdev_fetch_bustype().
Fix by returning ENOENT if an unknown bus type is specified.
Reported-by: Todd Stansell <todd@stansell.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[crypto] Allow cross-certificate source to be configured at build time
Provide a build option CROSSCERT in config/crypto.h to allow the
default cross-signed certificate source to be configured at build
time. The ${crosscert} setting may still be used to reconfigure the
cross-signed certificate source at runtime.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some versions of gcc complain that "'__bswap_variable_32' is static
but used in inline function 'golan_check_rc_and_cmd_status' which is
not static".
Fix by making golan_check_rc_and_cmd_status() a static inline.
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[pxe] Implicitly open network device in PXENV_UDP_OPEN
Some end-user configurations have been observed in which the first NBP
(such as GRUB2) uses the UNDI API and then transfers control to a
second NBP (such as pxelinux) which uses the UDP API. The first NBP
closes the network device using PXENV_UNDI_CLOSE, which renders the
UDP API unable to transmit or receive packets.
The correct behaviour under these circumstances is (as often) simply
not documented by the PXE specification. Testing with the Intel PXE
stack suggests that PXENV_UDP_OPEN will implicitly reopen the network
device if necessary, so match this behaviour.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[int13] Allow default drive to be specified via "san-drive" setting
The DHCP option 175.189 has been defined (by us) since 2006 as
containing the drive number to be used for a SAN boot, but has never
been automatically used as such by iPXE.
Use this option (if specified) to override the default SAN drive
number.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[int13] Allow drive to be hooked using the natural drive number
Interpret the maximum drive number (0xff for hard disks, 0x7f for
floppy disks) as meaning "use natural drive number".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[build] Do not use "objcopy -O binary" for objects with relocation records
The mbr.bin and usbdisk.bin standalone blobs are currently generated
using "objcopy -O binary", which does not process relocation records.
For the i386 build, this does not matter since the section start
address is zero and so the ".rel" relocation records are effectively
no-ops anyway.
For the x86_64 build, the ".rela" relocation records are not no-ops,
since the addend is included as part of the relocation record (rather
than inline). Using "objcopy -O binary" will silently discard the
relocation records, with the result that all symbols are effectively
given a value of zero.
Fix by using "ld --oformat binary" instead of "objcopy -O binary" to
generate mbr.bin and usbdisk.bin.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Infiniband specification (volume 1, section 11.4.1.2 "Post Receive
Request") notes that for UD QPs, the GRH will be placed in the first
40 bytes of the receive buffer if present. (If no GRH is present,
which is normal, then the first 40 bytes of the receive buffer will be
unused.)
Mellanox hardware performs this placement automatically: other headers
will be stripped (and their values returned via the CQE), but the
first 40 bytes of the data buffer will be consumed by the (probably
non-existent) GRH.
This does not fit neatly into iPXE's internal abstraction, which
expects the data buffer to represent just the data payload with the
addresses from the GRH (if present) passed as additional parameters to
ib_complete_recv().
The end result of this discrepancy is that attempts to receive
full-sized 2048-byte IPoIB packets on Mellanox hardware will fail.
Fix by allocating a separate ring buffer to hold the received GRHs.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[crypto] Allow trusted certificates to be stored in non-volatile options
The intention of the existing code (as documented in its own comments)
is that it should be possible to override the list of trusted root
certificates using a "trust" setting held in non-volatile stored
options. However, the rootcert_init() function currently executes
before any devices have been probed, and so will not be able to
retrieve any such non-volatile stored options.
Fix by executing rootcert_init() only after devices have been probed.
Since startup functions may be executed multiple times (unlike
initialisation functions), add an explicit flag to preserve the
property that rootcert_init() should run only once.
As before, if an explicit root of trust is specified at build time,
then any runtime "trust" setting will be ignored.
Signed-off-by: Michael Brown <mcb30@ipxe.org>