This is just a glossary for some frequent terms you will encounter in the area of operating systems (we only do Linux stuffs.)

VFIO, KVM, and IRQ

There are some legitimate reasons why the physical device might be dedicated to a specific VM, sometimes for security while sometimes it might solely due to optimization issues. Anyway, no matter what specific reason it migh be, passing though a device is important and is supported by KVM and VFIO.

Here, VFIO is a very important kernel driver that exposes to the userspace the required operations for passing through a device. It takes over the control of the device and re-registers the interrupt handler (Linux provides us with such a kernel API called request_irq) for that device; and it allows the user to install an event_fd-like async IRQ fd to relay the IRQ generated by the physical device directly into the VM if someones creates a valid irqfd which listens to a specific GSI (= IRQ if number > 16), and VFIO knows which irqfd it should use by vfio_pci_set_intx_trigger. This avoids bouncing the interrupt out to QEMU userspace for injection into the guest, though that path is an option if KVM irqfd support is not available.

Screenshot 2024-04-08 at 8.46.55 PM.png

Thus, whenever the IRQ is triggered, vfio_intx_handler will be invoked and then its called vfio_send_inx_eventfd which forwards the interrupt to the corresponding event fd. The irqfd will have a callback when it gets awaken.

To set up the VFIO interrupts, the ioctl to the vfio object is required for VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)

// Set signaling, masking, and unmasking of interrupts.  Caller provides
// struct vfio_irq_set with all fields set.
// DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
struct vfio_irq_set {
	__u32	argsz;
	__u32	flags;
#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
	__u32	index;
	__u32	start;
	__u32	count;
	__u8	data[];
};

NVIDIA driver will register the IRQ handler via request_threaded_irq. This is because traditionally, interrupt handling has been done with top half (i.e. the "hard" irq) that actually responds to the hardware interrupt and a bottom half (or "soft" irq) that is scheduled by the top half to do additional processing (BTW the top and bottom things can be found in the GPU fault service routines xxx_isr_bottom_half). The top half may just do some quick sanity check and sends EOI back to the device that generates the IRQ and then wakes the kernel thread to handle the interrupt afterwards to reduce the time when interrupts are blocked.

Screenshot 2024-04-08 at 9.34.42 PM.png

The function prototype looks like this:

extern int __must_check
request_threaded_irq(unsigned int irq, irq_handler_t handler,
		     irq_handler_t thread_fn,
		     unsigned long flags, const char *name, void *dev);

MSI and MSI-X

Legacy APICs (xAPIC, x2APIC) used by x86 processors have limited number of IRQ supports and IRQs must be shared by the IOAPIC thing whereas MSI allows interrupts to be signaled directly to the CPU via a memory write, bypassing traditional interrupt controllers and reducing the path length and latency involved in interrupt handling.

$ sudo lspci -vvv | grep NVIDIA -A20
# You'll see this
Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+ # MSI disabled "-"
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [60] Express (v2) Endpoint, MSI 00

You can enable the MSI via the setpci command.

Linux specific: GNU Indirect FUNC

The notorious backdoor inside the xz-util uses this feature to let the public key verification function be linked against a maliciously crafted object file. In GNU, it is allowed for function names that are resolved at runtime.

/* Dispatching via IFUNC ELF Extension */
# include <stddef.h>

extern void foo(unsigned *data, size_t len);

void foo_c(unsigned *data, size_t len) { /* ... */ }
void foo_sse42(unsigned *data, size_t len) { /* ... */ }
void foo_avx2(unsigned *data, size_t len) { /* ... */ }

extern int cpu_has_sse42(void);
extern int cpu_has_avx2(void);

void foo(unsigned *data, size_t len) __attribute__((ifunc ("resolve_foo")));

static void *resolve_foo(void)
{
        if (cpu_has_avx2())
                return foo_avx2;
        else if (cpu_has_sse42());
                return foo_sse42;
        else
                return foo_c;
}