[librm] Conditionalize the workaround for the Tivoli VMM's SSE garbling
Commit 71560d1 ("[librm] Preserve FPU, MMX and SSE state across calls
to virt_call()") added FXSAVE and FXRSTOR instructions to iPXE. In
KVM virtual machines, these instructions execute fine as long as the
host CPU supports the "unrestricted_guest" feature (that is, it can
virtualize big real mode natively). On older host CPUs however, KVM
has to emulate big real mode, and it currently doesn't implement
FXSAVE emulation.
Upstream QEMU rebuilt iPXE at commit 0418631 ("[thunderx] Fix
compilation with older versions of gcc") which is a descendant of
commit 71560d1 (see above).
This was done in QEMU commit ffdc5a2 ("ipxe: update submodule from
4e03af8ec to 041863191"). The resultant binaries were bundled with
the QEMU v2.7.0 release; see QEMU commit c52125a ("ipxe: update
prebuilt binaries").
This distributed the iPXE workaround for the Tivoli VMM bug to a
number of KVM users with old host CPUs, causing KVM emulation failures
(guest crashes) for them while netbooting.
Make the FXSAVE and FXRSTOR instructions conditional on a new feature
test macro called TIVOLI_VMM_WORKAROUND. Define the macro by default.
There is prior art for an assembly file including config/general.h:
see arch/x86/prefix/romprefix.S. Also, TIVOLI_VMM_WORKAROUND seems to
be a good fit for the "Obscure configuration options" section in
config/general.h.
Cc: Bandan Das <bsd@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Greg <rollenwiese@yahoo.com>
Cc: Michael Brown <mcb30@ipxe.org>
Cc: Michael Prokop <launchpad@michael-prokop.at>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Pickford <arch@netremedies.ca>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Ref: https://bugs.archlinux.org/task/50778
Ref: https://bugs.launchpad.net/qemu/+bug/1623276
Ref: https://bugzilla.proxmox.com/show_bug.cgi?id=1182
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1356762
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Preserve FPU, MMX and SSE state across calls to virt_call()
The IBM Tivoli Provisioning Manager for OS Deployment (also known as
TPMfOSD, Rembo-ia32, or Rembo Auto-Deploy) has a serious bug in some
older versions (observed with v5.1.1.0, apparently fixed by v7.1.1.0)
which can lead to arbitrary data corruption.
As mentioned in commit 87723a0 ("[libflat] Test A20 gate without
switching to flat real mode"), Tivoli's NBP sets up a VMM and makes
calls to the PXE stack in VM86 mode. This appears to be some kind of
attempt to run PXE API calls inside a sandbox. The VMM is fairly
sophisticated: for example, it handles our attempts to switch into
protected mode and patches our GDT so that our protected-mode code
runs in ring 1 instead of ring 0. However, it neglects to apply any
memory protections. In particular, it does not enable paging and
leaves us with 4GB segment limits. We can therefore trivially break
out of the sandbox by simply overwriting the GDT (or by modifying any
of Tivoli's VMM code or data structures).
When we attempt to execute privileged instructions (such as "lidt"),
the CPU raises an exception and control is passed to the Tivoli VMM.
This may result in a call to Tivoli's memcpy() function.
Tivoli's memcpy() function includes optimisations which use the SSE
registers %xmm0-%xmm3 to speed up aligned memory copies.
Unfortunately, the Tivoli VMM's exception handler does not save or
restore %xmm0-%xmm3. The net effect of this bug in the Tivoli VMM is
that any privileged instruction (such as "lidt") issued by iPXE may
result in unexpected corruption of the %xmm0-%xmm3 registers.
Even more unfortunately, this problem affects the code path taken in
response to a hardware interrupt from the NIC, since that code path
will call PXENV_UNDI_ISR. The net effect therefore becomes that any
NIC hardware interrupt (e.g. due to a received packet) may result in
unexpected corruption of the %xmm0-%xmm3 registers.
If a packet arrives while Tivoli is in the middle of using its
memcpy() function, then the unexpected corruption of the %xmm0-%xmm3
registers will result in unexpected corruption in the destination
buffer. The net effect therefore becomes that any received packet may
result in a 16-byte block of corruption somewhere in any data that
Tivoli copied using its memcpy() function.
We can work around this bug in the Tivoli VMM by saving and restoring
the %xmm0-%xmm3 registers across calls to virt_call(). To work around
the problem, we need to save registers before attempting to execute
any privileged instructions, and ensure that we attempt no further
privileged instructions after restoring the registers.
This is less simple than it may sound. We can use the "movups"
instruction to save and restore individual registers, but this will
itself generate an undefined opcode exception if SSE is not currently
enabled according to the flags in %cr0 and %cr4. We can't access %cr0
or %cr4 before attempting the "movups" instruction, because access a
control register is itself a privileged instruction (which may
therefore trigger corruption of the registers that we're trying to
save).
The best solution seems to be to use the "fxsave" and "fxrstor"
instructions. If SSE is not enabled, then these instructions may fail
to save and restore the SSE register contents, but will not generate
an undefined opcode exception. (If SSE is not enabled, then we don't
really care about preserving the SSE register contents anyway.)
The use of "fxsave" and "fxrstor" introduces an implicit assumption
that the CPU supports SSE instructions (even though we make no
assumption about whether or not SSE is currently enabled). SSE was
introduced in 1999 with the Pentium III (and added by AMD in 2001),
and is an architectural requirement for x86_64. Experimentation with
current versions of gcc suggest that it may generate SSE instructions
even when using "-m32", unless an explicit "-march=i386" or "-mno-sse"
is used to inhibit this. It therefore seems reasonable to assume that
SSE will be supported on any hardware that might realistically be used
with new iPXE builds.
As a side benefit of this change, the MMX register %mm0 will now be
preserved across virt_call() even in an i386 build of iPXE using a
driver that requires readq()/writeq(), and the SSE registers
%xmm0-%xmm5 will now be preserved across virt_call() even in an x86_64
build of iPXE using the Hyper-V netvsc driver.
Experimentation suggests that this change adds around 10% to the
number of cycles required for a do-nothing virt_call(), most of which
are due to the extra bytes copied using "rep movsb". Since the number
of bytes copied is a compile-time constant local to librm.S, we could
potentially reduce this impact by ensuring that we always copy a whole
number of dwords and so can use "rep movsl" instead of "rep movsb".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Reduce real-mode stack consumption in virt_call()
Some PXE NBPs are known to make PXE API calls with very little space
available on the real-mode stack. For example, the Rembo-ia32 NBP
from some versions of IBM's Tivoli Provisioning Manager for Operating
System Deployment (TPMfOSD) will issue calls with the real-mode stack
placed at 0000:03d2; this is at the end of the interrupt vector table
and leaves only 498 bytes of stack space available before overwriting
the hardware IRQ vectors. This limits the amount of state that we can
preserve before transitioning to protected mode.
Work around these challenging conditions by preserving everything
other than the initial register dump in a temporary static buffer
within our real-mode data segment, and copying the contents of this
buffer to the protected-mode stack.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Do not unconditionally preserve flags across virt_call()
Commit 196f0f2 ("[librm] Convert prot_call() to a real-mode near
call") introduced a regression in which any deliberate modification to
the low 16 bits of the CPU flags (in struct i386_all_regs) would be
overwritten with the original flags value at the time of entry to
prot_call().
The regression arose because the alignment requirements of the
protected-mode stack necessitated the insertion of two bytes of
padding immediately below the prot_call() return address. The
solution chosen was to extend the existing "pushfl / popfl" pair to
"pushfw;pushfl / popfl;popfw". The extra "pushfw / popfw" appears at
first glance to be a no-op, but fails to take into account the fact
that the flags restored by popfl may have been deliberately modified
by the protected-mode function.
Fix by replacing "pushfw / popfw" with "pushw %ss / popw %ss". While
%ss does appear within struct i386_all_regs, any modification to the
stored value has always been ignored by prot_call() anyway.
The most visible symptom of this regression was that SAN booting would
fail since every INT 13 call would be chained to the original INT 13
vector.
Reported-by: Vishvananda Ishaya <vishvananda@gmail.com>
Reported-by: Jamie Thompson <forum.ipxe@jamie-thompson.co.uk>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Add support for running in 64-bit long mode
Add support for running the BIOS version of iPXE in 64-bit long mode.
A 64-bit BIOS version of iPXE can be built using e.g.
make bin-x86_64-pcbios/ipxe.usb
make bin-x86_64-pcbios/8086100e.mrom
The 64-bit BIOS version should appear to function identically to the
normal 32-bit BIOS version. The physical memory layout is unaltered:
iPXE is still relocated to the top of the available 32-bit address
space. The code is linked to a virtual address of 0xffffffffeb000000
(in the negative 2GB as required by -mcmodel=kernel), with 4kB pages
created to cover the whole of .textdata. 2MB pages are created to
cover the whole of the 32-bit address space.
The 32-bit portions of the code run with VIRTUAL_CS and VIRTUAL_DS
configured such that truncating a 64-bit virtual address gives a
32-bit virtual address pointing to the same physical location.
The stack pointer remains as a physical address when running in long
mode (although the .stack section is accessible via the negative 2GB
virtual address); this is done in order to simplify the handling of
interrupts occurring while executing a portion of 32-bit code with
flat physical addressing via PHYS_CODE().
Interrupts may be enabled in either 64-bit long mode, 32-bit protected
mode with virtual addresses, 32-bit protected mode with physical
addresses, or 16-bit real mode. Interrupts occurring in any mode
other than real mode will be reflected down to real mode and handled
by whichever ISR is hooked into the BIOS interrupt vector table.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
No callers of prot_to_phys, phys_to_prot, or intr_to_prot require the
flags to be preserved. Remove the unnecessary pushfl/popfl pairs.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Add phys_call() wrapper for calling code with physical addressing
Add a phys_call() wrapper function (analogous to the existing
real_call() wrapper function) for calling code with flat physical
addressing, and use this wrapper within the PHYS_CODE() macro.
Move the relevant functionality inside librm.S, where it more
naturally belongs.
The COMBOOT code currently uses explicit calls to _virt_to_phys and
_phys_to_virt. These will need to be rewritten if our COMBOOT support
is ever generalised to be able to run in a 64-bit build.
Specifically:
- com32_exec_loop() should be restructured to use PHYS_CODE()
- com32_wrapper.S should be restructured to use an equivalent of
prot_call(), passing parameters via a struct i386_all_regs
- there appears to be no need for com32_wrapper.S to switch between
external and internal stacks; this could be omitted to simplify
the design.
For now, librm.S continues to expose _virt_to_phys and _phys_to_virt
for use by com32.c and com32_wrapper.S. Similarly, librm.S continues
to expose _intr_to_virt for use by gdbidt.S.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The bulk of the iPXE binary (the .textdata section) is physically
relocated at runtime to the top of the 32-bit address space in order
to allow space for an OS to be loaded. The relocation is achieved
with the assistance of segmentation: we adjust the code and data
segment bases so that the link-time addresses remain valid.
Segmentation is not available (for normal code and data segments) in
long mode. We choose to compile the C code with -mcmodel=kernel and
use a link-time address of 0xffffffffeb000000. This choice allows us
to identity-map the entirety of the 32-bit address space, and to alias
our chosen link-time address to the physical location of our .textdata
section. (This requires the .textdata section to always be aligned to
a page boundary.)
We simultaneously choose to set the 32-bit virtual address segment
bases such that the link-time addresses may simply be truncated to 32
bits in order to generate a valid 32-bit virtual address. This allows
symbols in .textdata to be trivially accessed by both 32-bit and
64-bit code.
There is no (sensible) way in 32-bit assembly code to generate the
required R_X86_64_32S relocation records for these truncated symbols.
However, subtracting the fixed constant 0xffffffff00000000 has the
same effect as truncation, and can be represented in a standard
R_X86_64_32 relocation record. We define the VIRTUAL() macro to
abstract away this truncation operation, and apply it to all
references by 32-bit (or 16-bit) assembly code to any symbols within
the .textdata section.
We define "virt_offset" for a 64-bit build as "the value to be added
to an address within .textdata in order to obtain its physical
address". With this definition, the low 32 bits of "virt_offset" can
be treated by 32-bit code as functionally equivalent to "virt_offset"
in a 32-bit build.
We define "text16" and "data16" for a 64-bit build as the physical
addresses of the .text16 and .data16 sections. Since a physical
address within the 32-bit address space may be used directly as a
64-bit virtual address (thanks to the identity map), this definition
provides the most natural access to variables in .text16 and .data16.
Note that this requires a minor adjustment in prot_to_real(), which
accesses .text16 using 32-bit virtual addresses.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Transition to protected mode within init_librm()
Long-mode operation will require page tables, which are too large to
sensibly fit in our .data16 segment in base memory.
Add a portion of init_librm() running in 32-bit protected mode to
provide access to high memory. Use this portion of init_librm() to
initialise the .textdata variables "virt_offset", "text16", and
"data16", eliminating the redundant (re)initialisation currently
performed on every mode transition as part of real_to_prot().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Move most arch/i386 files to arch/x86, and adjust the contents of the
Makefiles and the include/bits/*.h headers to reflect the new
locations.
This patch makes no substantive code changes, as can be seen using a
rename-aware diff (e.g. "git show -M5").
This patch does not make the pcbios platform functional for x86_64; it
merely allows it to compile without errors.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[legal] Relicense files under GPL2_OR_LATER_OR_UBDL
These files cannot be automatically relicensed by util/relicense.pl
since they either contain unusual but trivial contributions (such as
the addition of __nonnull function attributes), or contain lines
dating back to the initial git revision (and so require manual
knowledge of the code's origin).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When making a call from real mode to protected mode, we save and
restore the global and interrupt descriptor table registers. The
restore currently takes place after returning to real mode, which
generates two EXCEPTION_NMIs and corresponding VM exits when running
under KVM on an Intel CPU.
Avoid the VM exits by restoring the descriptor table registers inside
prot_to_real, while still running in protected mode.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Speed up real-to-protected mode transition under KVM
Ensure that all segment registers have zero in the low two bits before
transitioning to protected mode. This allows the CPU state to
immediately be deemed to be "valid", and eliminates the need for any
further emulated instructions.
Load the protected-mode interrupt descriptor table after switching to
protected mode, since this avoids triggering an EXCEPTION_NMI and
corresponding VM exit.
This reduces the time taken by real_to_prot under KVM by around 50%.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Speed up protected-to-real mode transition under KVM
On an Intel CPU supporting VMX, KVM will emulate instructions while
the CPU state remains "invalid". In real mode, the CPU state is
defined to be "invalid" if any segment register has a base which is
not equal to (sreg<<4) or a limit which is not equal to 64kB.
We don't actually use the base stored in the REAL_DS descriptor for
any significant purpose. Change the base stored in this descriptor to
be equal to (REAL_DS<<4). A segment register loaded with REAL_DS is
then automatically valid in both real and protected modes. This
allows KVM to stop emulating instructions much sooner.
The only use of REAL_DS for memory accesses currently occurs in the
indirect ljmp within prot_to_real. Change this to a direct ljmp,
storing rm_cs in .text16 as part of the ljmp instruction. This
removes the only memory access via REAL_DS (thereby allowing for the
above descriptor base address hack), and also simplifies the ljmp
instruction (which will still have to be emulated).
Load the real-mode interrupt descriptor table register before
switching to real mode, since this avoids triggering an EXCEPTION_NMI
and corresponding VM exit.
This reduces the time taken by prot_to_real under KVM by around 65%.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The mode-transition code involves paths which switch back and forth
between the .text and .text16 sections. At present, only the start of
each function is labelled, which makes it difficult to decode
addresses within the parts of the function existing in a different
section.
Add explicit labels at the start of each section change, so that
addresses can be meaningfully decoded to the nearest label.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When running in a virtual machine, switching to real mode may be
expensive. Allow interrupts to be enabled while in protected mode and
reflected down to the real-mode interrupt handlers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Use genuine real mode to accelerate operation in virtual machines
We currently use flat real mode wherever real mode is required. This
guarantees that we will not surprise some unsuspecting external caller
which has carefully set up flat real mode by suddenly reducing the
segment limits to 64kB.
However, operating in flat real mode imposes a severe performance
penalty in some virtualisation environments, since some CPUs cannot
fully virtualise flat real mode and so the hypervisor must fall back
to emulation. In particular, operating under KVM on a pre-Westmere
Intel CPU will be at least an order of magnitude slower, to the point
that there is a visible teletype effect when printing anything to the
BIOS console. (Older versions of KVM used to cheat and ignore the
"flat" part of flat real mode, which masked the problem.)
Switch (back) to using genuine real mode with 64kB segment limits
instead of flat real mode. Hopefully this won't break anything.
Add an explicit switch to flat real mode before returning to the BIOS
from the ROM prefix, since we know that a PMM BIOS will call the ROM
initialisation point (and potentially the BEV) in flat real mode.
As noted in previous commit messages, it is not possible to restore
the real-mode segment limits after a transition to protected mode,
since there is no way to know which protected-mode segment descriptor
was originally used to initialise the limit portion of the segment
register.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Avoid (harmless) collisions with linker symbols
The symbol_text16 is defined globally by the linker. Use rm_text16
instead of _text16 for the local variable within librm.S to avoid
confusion when reading linker maps.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[librm] Use libflat to enable A20 line on each real-to-protected transition
Use the shared code in libflat to perform the A20 transitions
automatically on each transition from real to protected mode. This
allows us to remove all explicit calls to gateA20_set().
The old warnings about avoiding automatically enabling A20 are
essentially redundant; they date back to the time when we would always
start hammering the keyboard controller without first checking to see
if gate A20 was already enabled (which it almost always is).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When returning to real mode, set 4GB segment limits instead of 64kB
limits. This change improves our chances of successfully returning to
a PMM-capable BIOS aftering entering iPXE during POST; the BIOS will
have set up flat real mode before calling our initialisation point,
and may be disconcerted if we then return in genuine real mode.
This change is unlikely to break anything, since any code that might
potentially access beyond 64kB must use addr32 prefixes to do so; if
this is the case then it is almost certainly code written to expect
flat real mode anyway.
Note that it is not possible to restore the real-mode segment limits
to their original values, since it is not possible to know which
protected-mode segment descriptor was originally used to initialise
the limit portion of the segment register.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
[i386] Add data32 prefixes to all lgdt/lidt instructions
With a 16-bit operand, lgdt/lidt will load only a 24-bit base address,
ignoring the high-order bits. This meant that we could fail to fully
restore the GDT across a call into gPXE, if the GDT happened to be
located above the 16MB mark.
Not all of our lgdt/lidt instructions require a data32 prefix (for
example, reloading the real-mode IDT can never require a 32-bit base
address), but by adding them everywhere we will hopefully not forget
the necessary ones in future.
Leave protected-mode return address on PM stack when issuing a
real_call(), rather than moving it to the RM stack and back again.
This allows the real-mode function to completely destroy the stack
contents, provided that it manages to return to real_call().