[infiniband] Respect hop pointer when building directed route SMP return path
The return path in directed route SMPs lists the egress ports in order
from SM to node, rather than from node to SM.
To write to the correct offset within the return path, we need to
parse the hop pointer. This is held within the class-specific data
portion of the MAD header, which was previously unused by us and
defined to be a uint16_t. Define this field to be a union type; this
requires some rearrangement of ib_mad.h and corresponding changes to
ipoib.c.
[romprefix] Use smaller PMM allocations if possible
The only way that PMM allows us to request a block in a region with
A20=0 is to ask for a block with an alignment of 2MB. Due to the PMM
API design, the only way we can do this is to ask for a block with a
size of 2MB.
Unfortunately, some BIOSes will hit problems if we allocate a 2MB
block. In particular, it may not be possible to enter the BIOS setup
screen; the BIOS setup code attempts a PMM allocation, fails, and
hangs the machine.
We now try allocating only as much as we need via PMM. If the
allocated block has A20=1, we free the allocated block, double the
allocation size, and try again. Repeat until either we obtain a block
with A20=0 or allocation fails. (This is guaranteed to terminate by
the time we reach an allocation size of 2MB.)
[linda] Add support for QLogic 7220-based Infiniband HCAs
These cards very nearly support our current IB Verbs model. There is
one minor difference: multicast packets will always be delivered by
the hardware to QP0, so the driver has to redirect them to the
appropriate QP. This means that QP owners may see receive completions
for buffers that they never posted. Nothing in our current codebase
will break because of this.
[infiniband] Add raw packet parser and constructor
This can be used with cards that require the driver to construct and
parse packet headers manually. Headers are optionally handled
out-of-line from the packet payload, since some such cards will split
received headers into a separate ring buffer.
Some Infiniband cards will not be as accommodating as the Arbel and
Hermon cards in providing enough space for us to push a fake extra
header at the start of the received packet. We must therefore make do
with squeezing enough information to identify source and destination
addresses into the two bytes of padding within a genuine IPoIB
link-layer header.
[infiniband] Split subnet management agent client out into ib_smc.c
Not all Infiniband cards have embedded subnet management agents.
Split out the code that communicates with such an embedded SMA into a
separate ib_smc.c file, and have drivers call ib_smc_update()
explicitly when they suspect that the answers given by the embedded
SMA may have changed.
[infiniband] Pass address vector in receive completions
Receive completion handlers now get passed an address vector
containing the information extracted from the packet headers
(including the GRH, if present), and only the payload remains in the
I/O buffer.
This breaks the symmetry between transmit and receive completions, so
remove the ib_completer_t type and use an ib_completion_queue_operations
structure instead.
Rename the "destination QPN" and "destination LID" fields in struct
ib_address_vector to reflect its new dual usage.
Since the ib_completion structure now contains only an IB status code,
("syndrome") replace it with a generic gPXE integer status code.
[infiniband] Flush uncompleted work queue entries at QP teardown
Avoid leaking I/O buffers in ib_destroy_qp() by completing any
outstanding work queue entries with a generic error code. This
requires the completion handlers to be available to ib_destroy_qp(),
which is done by making them static configuration parameters of the CQ
(set by ib_create_cq()) rather than being provided on each call to
ib_poll_cq().
This mimics the functionality of netdev_{tx,rx}_flush(). The netdev
flush functions would previously have been catching any I/O buffers
leaked by the IPoIB data queue (though not by the IPoIB metadata
queue).
Add the simplified ne2k_isa driver. It is just a selective copy+paste
of the relevant parts from ns8390.c plus a little trivial hacking to
make it actually work.
It is true that the code is pretty ugly, but:
a) ns8390.c is worse
b) It is only 372 lines and no #ifdefs
c) It works both in qemu/bochs and in real hardware
and we all know it is easier to cleanup working code
Hope someone will find the time to rewrite this driver properly,
but until then at least for me this is an ok solution.
Signed-off-by: Pantelis Koukousoulas <pktoss@gmail.com>
[netdevice] Retain and report detailed error breakdowns
netdev_rx_err() and netdev_tx_complete_err() get passed the error
code, but currently use it only in debug messages.
Retain error numbers and frequencey counts for up to
NETDEV_MAX_UNIQUE_ERRORS (4) different errors for each of TX and RX.
This allows the "ifstat" command to report the reasons for TX/RX
errors in most cases, even in non-debug builds.
Halting the PEGs breaks platforms where there is sideband access to
the NIC (e.g. HP machines using iLO). (We have to retain the
unhalting code because on some other platforms (e.g. IBM blades with
BOFM) the pre-PXE firmware must halt the PEGs to avoid issues with the
BIOS rereading via the expansion ROM BAR.)
[aoe] Start retry timer before potential temporary transmission failure
The retry timer needs to be running as soon as we know that we are
trying to transmit a command. If transmission fails because of a
temporary error condition, then the timer will allow us to retry the
transmission later.
[settings] Ensure fetch_string_setting() returns a NUL-terminated string
This fixes a regression introduced in commit 612f4e7:
[settings] Avoid returning uninitialised data on error in fetch_xxx_setting()
in which the memset() was moved from fetch_string_setting() to
fetch_setting(), in order that it would be useful for non-string
setting types. However, this neglects to take into account the fact
that fetch_string_setting() shrinks its buffer by one byte (to allow
for the NUL) before calling fetch_setting().
Restore the memset() in fetch_string_setting(), so that the
terminating NUL is guaranteed to actually be a NUL.
[i386] Add data32 prefixes to all lgdt/lidt instructions
With a 16-bit operand, lgdt/lidt will load only a 24-bit base address,
ignoring the high-order bits. This meant that we could fail to fully
restore the GDT across a call into gPXE, if the GDT happened to be
located above the 16MB mark.
Not all of our lgdt/lidt instructions require a data32 prefix (for
example, reloading the real-mode IDT can never require a 32-bit base
address), but by adding them everywhere we will hopefully not forget
the necessary ones in future.
[phantom] Allow for PXE boot to be enabled/disabled on a per-port basis
This is something of an ugly hack to accommodate an OEM requirement.
The NIC has only one expansion ROM BAR, rather than one per port. To
allow individual ports to be selectively enabled/disabled for PXE boot
(as required), we must therefore leave the expansion ROM always
enabled, and place the per-port enable/disable logic within the gPXE
driver.
[romprefix] Add vendor branding facilities and guidelines
Some hardware vendors have been known to remove all gPXE-related
branding from ROMs that they build. While this is not prohibited by
the GPL, it is a little impolite.
Add a facility for adding branding messages via two #defines
(PRODUCT_NAME and PRODUCT_SHORT_NAME) in config/general.h. This
should accommodate all known OEM-mandated branding requirements.
Vendors with branding requirements that cannot be satisfied by using
PRODUCT_NAME and/or PRODUCT_SHORT_NAME should contact us so that we
can extended this facility as necessary.
The Phantom firmware selectively disables PCI functions based on the
board type, with the end result that we see one PCI function for each
network port. This allows us to eliminate the code for reading from
flash and, more importantly, removes knowledge of the board type magic
number from the gPXE driver.
This function is a major kludge, but can be made slightly more
accurate by ignoring net devices that aren't open. Eventually it
needs to be removed entirely.
[settings] Add the notion of a "tag magic" to numbered settings
Settings can be constructed using a dotted-decimal notation, to allow
for access to unnamed settings. The default interpretation is as a
DHCP option number (with encapsulated options represented as
"<encapsulating option>.<encapsulated option>".
In several contexts (e.g. SMBIOS, Phantom CLP), it is useful to
interpret the dotted-decimal notation as referring to non-DHCP
options. In this case, it becomes necessary for these contexts to
ignore standard DHCP options, otherwise we end up trying to, for
example, retrieve the boot filename from SMBIOS.
Allow settings blocks to specify a "tag magic". When dotted-decimal
notation is used to construct a setting, the tag magic value of the
originating settings block will be ORed in to the tag number.
Store/fetch methods can then check for the magic number before
interpreting arbitrarily-numbered settings.
[romprefix] Further sanity checks for the PCI 3 runtime segment address
This extends the sanity checks on the runtime segment address provided
in %bx, first implemented in commit 5600955.
We now allow the ROM to be placed anywhere above a000:0000 (rather
than c000:0000, as before), since this is the region allowed by the
PCI 3 spec. If the BIOS asks us to place the runtime image such that
it would overlap with the init-time image (which is explicitly
prohibited by the PCI 3 spec), then we assume that the BIOS is faulty
and ignore the provided runtime segment address.
Testing on a SuperMicro BIOS providing overlapping segment addresses
shows that ignoring the provided runtime segment address is safe to do
in these circumstances.
This interface provides access to firmware settings (e.g. MAC address)
that will apply to all drivers loaded for the duration of the current
system boot.
[phantom] Unhalt/halt all PEGs during driver startup/shutdown
A hardware bug means that reads through the expansion ROM BAR can
return corrupted data if the PEGs are running. This breaks platforms
that re-read the expansion ROM after invoking gPXE code, such as IBM
blade servers.
Halt PEGs during driver shutdown, and unhalt PEGs during driver
startup if we detect that this is not the first startup since
power-on.
[uri] Avoid interpreting DOS-style path names as opaque URIs
A DOS-style full path name such as "C:\Program Files\tftpboot\nbp.0"
satisfies the syntax requirements for a URI with a scheme of "C" and
an opaque portion of "\Program Files\tftpboot\nbp.0".
Add a check in parse_uri() to ignore schemes that are apparently only
a single character long; this avoids interpreting DOS-style paths in
this way, and shouldn't affect any practical URI scheme.
[phantom] Change register space abstraction to match other drivers
Most other Phantom drivers define a register space in terms of a 64M
virtual address space. While this doesn't map in any meaningful way
to the actual addresses used on the latest cards, it makes maintenance
easier if we do the same.
[pcbios] Guard against register corruption in INT 15,e820 implementations
Someone at Dell must have a full-time job designing ways to screw up
implementations of INT 15,e820. This latest gem is courtesy of a Dell
Xanadu system, which arbitrarily decides to obliterate the contents of
%esi.
Preserve %esi, %edi and %ebp across calls to INT 15,e820, in case
someone tries a variation on this trick in future.
[settings] Avoid returning uninitialised data on error in fetch_xxx_setting()
Callers (e.g. usr/autoboot.c) may not check the return values from
fetch_xxx_setting(), assuming that in error cases the returned setting
value will be "empty" (for some sensible value of "empty").
In particular, if the DHCP server did not specify a next-server
address, this would result in gPXE using uninitialised data for the
TFTP server IP address.
FreeBSD requires the object format to be specified as elf_i386_fbsd,
rather than elf_i386.
Based on a patch from Eygene Ryabinkin <rea-fbsd@codelabs.ru>
[romprefix] Sanity-check the runtime segment address for PCI 3
Some PCI 3 BIOSes seem to provide a garbage value in %bx, which should
contain the runtime segment address. Perform a basic sanity check: we
reject the segment if it is below the start of option ROM space. If
the sanity check fails, we assume that the BIOS was not expecting us
to be a PCI 3 ROM, and we just leave our image in situ.
[build] Use ".bss.*" names for uninitialised-data sections
The section name seems to have significance for some versions of
binutils.
There is no way to instruct gcc that sections such as .bss16 contain
uninitialised data; it will emit them with contents explicitly set to
zero. We therefore have to rely on the linker script to force these
sections to become uninitialised-data sections. We do this by marking
them as NOLOAD; this seems to be the closest semantic equivalent in the
linker script language.
However, this gets ignored by some versions of ld (including 2.17 as
shipped with Debian Etch), which mark the resulting sections with
(CONTENTS,ALLOC,LOAD,DATA). Combined with the fact that this version of
ld seems to ignore the specified LMA for these sections, this means that
they end up overlapping other sections, and so parts of .prefix (for
example) get obliterated by .data16's bss section.
Rename the .bss sections from .section_bss to .bss.section; this seems to
cause these versions of ld to treat them as uninitialised data.
Not fully understood, but it seems that the LMA of bss sections matters
for some newer binutils builds. Force all bss sections to have an LMA
at the end of the file, so that they don't interfere with other
sections.
The symptom was that objcopy -O binary -j .zinfo would extract the
.zinfo section from bin/xxx.tmp as a blob of the correct length, but
with zero contents. This would then cause the [ZBIN] stage of the
build to fail.
Also explicitly state that .zinfo(.*) sections have @progbits, in case
some future assembler or linker variant decides to omit them.
The virtnet_transmit() logic for waiting the packet to be transmitted is
reversed: we can't wait the packet to be transmitted if we didn't kick()
the ring yet. The vring_more_used() while loop logic is reversed also,
that explains why the code works today.
The current code risks trying to free a buffer from the used ring
when none was available, that will happen most times because KVM
doesn't handle the packet immediately on kick(). Luckily it was working
because it was unlikely to have a buffer still queued for transmit when
virtnet_transmit() was called.
Also, adds a BUG_ON() to vring_get_buf(), to catch cases where we try
to free a buffer from the used ring when there was none available.
Patch for Etherboot. gPXE has the same problem on the code, but I hadn't
a chance to test gPXE using virtio-net yet.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>
[netdevice] Change link-layer push() and pull() methods to take raw types
EFI requires us to be able to specify the source address for
individual transmitted packets, and to be able to extract the
destination address on received packets.
Take advantage of this to rationalise the push() and pull() methods so
that push() takes a (dest,source,proto) tuple and pull() returns a
(dest,source,proto) tuple.
[netdevice] Split multicast hashing out into an mc_hash method
Multicast hashing is an ugly overlap between network and link layers.
EFI requires us to provide access to this functionality, so move it
out of ipv4.c and expose it as a method of the link layer.
Some versions of ld choke on the "AT ( _xxx_lma )" in efi.lds with an
error saying "nonconstant expression for load base". Since these were
only explicitly setting the LMA to the address that it would have had
anyway, they can be safely omitted.
[efi] Add EFI image format and basic runtime environment
We have EFI APIs for CPU I/O, PCI I/O, timers, console I/O, user
access and user memory allocation.
EFI executables are created using the vanilla GNU toolchain, with the
EXE header handcrafted in assembly and relocations generated by a
custom efilink utility.
monojob_wait() was holding a reference to the completed job, meaning that
various objects would not be freed until the next job was plugged in to
the monojob interface.