Commit Graph

147921 Commits

Author SHA1 Message Date
Rusty Russell 659a0e6633 lguest: have example Launcher service all devices in separate threads
Currently lguest has three threads: the main Launcher thread, a Waker
thread, and a thread for the block device (because synchronous block
was simply too painful to bear).

The Waker selects() on all the input file descriptors (eg. stdin, net
devices, pipe to the block thread) and when one becomes readable it calls
into the kernel to kick the Launcher thread out into userspace, which
repeats the poll, services the device(s), and then tells the kernel to
release the Waker before re-entering the kernel to run the Guest.

Also, to make a slightly-decent network transmit routine, the Launcher
would suppress further network interrupts while it set a timer: that
signal handler would write to a pipe, which would rouse the Waker
which would prod the Launcher out of the kernel to check the network
device again.

Now we can convert all our virtqueues to separate threads: each one has
a separate eventfd for when the Guest pokes the device, and can trigger
interrupts in the Guest directly.

The linecount shows how much this simplifies, but to really bring it
home, here's an strace analysis of single Guest->Host ping before:

* Guest sends packet, notifies xmit vq, return control to Launcher
* Launcher clears notification flag on xmit ring
* Launcher writes packet to TUN device
	writev(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"\366\r\224`\2058\272m\224vf\274\10\0E\0\0T\0\0@\0@\1\265"..., 98}], 2) = 108
* Launcher sets up interrupt for Guest (xmit ring is empty)
	write(10, "\2\0\0\0\3\0\0\0", 8) = 0
* Launcher sets up timer for interrupt mitigation
	setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 505}}, NULL) = 0
* Launcher re-runs guest
	pread64(10, 0xbfa5f4d4, 4, 0) ...
* Waker notices reply packet in tun device (it was in select)
	select(12, [0 3 4 6 11], NULL, NULL, NULL) = 1 (in [4])
* Waker kicks Launcher out of guest:
	pwrite64(10, "\3\0\0\0\1\0\0\0", 8, 0) = 0
* Launcher returns from running guest:
	... = -1 EAGAIN (Resource temporarily unavailable)
* Launcher looks at input fds:
	select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 1 (in [4], left {0, 0})
* Launcher reads pong from tun device:
	readv(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"\272m\224vf\274\366\r\224`\2058\10\0E\0\0T\364\26\0\0@"..., 1518}], 2) = 108
* Launcher injects guest notification:
	write(10, "\2\0\0\0\2\0\0\0", 8) = 0
* Launcher rechecks fds:
	select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 0 (Timeout)
* Launcher clears Waker:
	pwrite64(10, "\3\0\0\0\0\0\0\0", 8, 0) = 0
* Launcher reruns Guest:
	pread64(10, 0xbfa5f4d4, 4, 0) = ? ERESTARTSYS (To be restarted)
* Signal comes in, uses pipe to wake up Launcher:
	--- SIGALRM (Alarm clock) @ 0 (0) ---
	write(8, "\0", 1)       = 1
	sigreturn()             = ? (mask now [])
* Waker sees write on pipe:
	select(12, [0 3 4 6 11], NULL, NULL, NULL) = 1 (in [6])
* Waker kicks Launcher out of Guest:
	pwrite64(10, "\3\0\0\0\1\0\0\0", 8, 0) = 0
* Launcher exits from kernel:
	pread64(10, 0xbfa5f4d4, 4, 0) = -1 EAGAIN (Resource temporarily unavailable)
* Launcher looks to see what fd woke it:
	select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 1 (in [6], left {0, 0})
* Launcher reads timeout fd, sets notification flag on xmit ring
	read(6, "\0", 32)       = 1
* Launcher rechecks fds:
	select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 0 (Timeout)
* Launcher clears Waker:
	pwrite64(10, "\3\0\0\0\0\0\0\0", 8, 0) = 0
* Launcher resumes Guest:
	pread64(10, "\0p\0\4", 4, 0) ....

strace analysis of single Guest->Host ping after:

* Guest sends packet, notifies xmit vq, creates event on eventfd.
* Network xmit thread wakes from read on eventfd:
	read(7, "\1\0\0\0\0\0\0\0", 8)          = 8
* Network xmit thread writes packet to TUN device
	writev(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"J\217\232FI\37j\27\375\276\0\304\10\0E\0\0T\0\0@\0@\1\265"..., 98}], 2) = 108
* Network recv thread wakes up from read on tunfd:
	readv(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"j\27\375\276\0\304J\217\232FI\37\10\0E\0\0TiO\0\0@\1\214"..., 1518}], 2) = 108
* Network recv thread sets up interrupt for the Guest
	write(6, "\2\0\0\0\2\0\0\0", 8) = 0
* Network recv thread goes back to reading tunfd
	13:39:42.460285 readv(4,  <unfinished ...>
* Network xmit thread sets up interrupt for Guest (xmit ring is empty)
	write(6, "\2\0\0\0\3\0\0\0", 8) = 0
* Network xmit thread goes back to reading from eventfd
	read(7, <unfinished ...>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:10 +09:30
Rusty Russell df60aeef4f lguest: use eventfds for device notification
Currently, when a Guest wants to perform I/O it calls LHCALL_NOTIFY with
an address: the main Launcher process returns with this address, and figures
out what device to run.

A far nicer model is to let processes bind an eventfd to an address: if we
find one, we simply signal the eventfd.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Davide Libenzi <davidel@xmailserver.org>
2009-06-12 22:27:10 +09:30
Rusty Russell 5718607bb6 eventfd: export eventfd_signal and eventfd_fget for lguest
lguest wants to attach eventfds to guest notifications, and lguest is
usually a module.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Davide Libenzi <davidel@xmailserver.org>
2009-06-12 22:27:09 +09:30
Rusty Russell 9f155a9b3d lguest: allow any process to send interrupts
We currently only allow the Launcher process to send interrupts, but it
as we already send interrupts from the hrtimer, it's a simple matter of
extracting that code into a common set_interrupt routine.

As we switch to a thread per virtqueue, this avoids a bottleneck through the
main Launcher process.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:09 +09:30
Rusty Russell 92b4d8df84 lguest: PAE fixes
1) j wasn't initialized in setup_pagetables, so they weren't set up for me
   causing immediate guest crashes.

2) gpte_addr should not re-read the pmd from the Guest.  Especially
   not BUG_ON() based on the value.  If we ever supported SMP guests,
   they could trigger that.  And the Launcher could also trigger it
   (tho currently root-only).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:08 +09:30
Matias Zabaljauregui acdd0b6292 lguest: PAE support
This version requires that host and guest have the same PAE status.
NX cap is not offered to the guest, yet.

Signed-off-by: Matias Zabaljauregui <zabaljauregui@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:08 +09:30
Matias Zabaljauregui cefcad1773 lguest: Add support for kvm_hypercall4()
Add support for kvm_hypercall4(); PAE wants it.

Signed-off-by: Matias Zabaljauregui <zabaljauregui at gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:07 +09:30
Matias Zabaljauregui ebe0ba84f5 lguest: replace hypercall name LHCALL_SET_PMD with LHCALL_SET_PGD
replace LHCALL_SET_PMD with LHCALL_SET_PGD hypercall name
(That's really what it is, and the confusion gets worse with PAE support)

Signed-off-by: Matias Zabaljauregui <zabaljauregui@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Jeremy Fitzhardinge <jeremy@goop.org>
2009-06-12 22:27:07 +09:30
Matias Zabaljauregui 90603d15fa lguest: use native_set_* macros, which properly handle 64-bit entries when PAE is activated
Some cleanups and replace direct assignment with native_set_* macros which properly handle 64-bit entries when PAE is activated

Signed-off-by: Matias Zabaljauregui <zabaljauregui@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:06 +09:30
Matias Zabaljauregui ed1dc77810 lguest: map switcher with executable page table entries
Map switcher with executable page table entries.
(This bug didn't matter before PAE and hence NX support -- RR)

Signed-off-by: Matias Zabaljauregui <zabaljauregui@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:06 +09:30
Rusty Russell 7b5c806c35 lguest: fix writev returning short on console output
I've never seen it here, but I can't find anywhere that says writev
will write everything.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:05 +09:30
Rusty Russell e606490c44 lguest: clean up length-used value in example launcher
The "len" field in the used ring for virtio indicates the number of
bytes *written* to the buffer.  This means the guest doesn't have to
zero the buffers in advance as it always knows the used length.

Erroneously, the console and network example code puts the length
*read* into that field.  The guest ignores it, but it's wrong.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:05 +09:30
Matias Zabaljauregui f086122bb6 lguest: Segment selectors are 16-bit long. Fix lg_cpu.ss1 definition.
If GDT_ENTRIES were every > 256, this could become a problem.

Signed-off-by: Matias Zabaljauregui <zabaljauregui at gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:04 +09:30
Roel Kluin 81b79b01d0 lguest: beyond ARRAY_SIZE of cpu->arch.gdt
Do not go beyond ARRAY_SIZE of cpu->arch.gdt

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:04 +09:30
Rusty Russell 2644f17d6c lguest: clean up example launcher compile flags.
18 months ago 5bbf89fc26 changed to loading
bzImages directly, and no longer manually ungzipping them, so we no longer
need libz.

Also, -m32 is useful for those on 64-bit platforms (and harmless on
32-bit).

Reported-by: Ron Minnich <rminnich@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:03 +09:30
Rusty Russell 61f4bc83fe lguest: optimize by coding restore_flags and irq_enable in assembler.
The downside of the last patch which made restore_flags and irq_enable
check interrupts is that they are now too big to be patched directly
into the callsites, so the C versions are always used.

But the C versions go via PV_CALLEE_SAVE_REGS_THUNK which saves all
the registers.  In fact, we don't need any registers in the fast path,
so we can do better than this if we actually code them in assembler.

The results are in the noise, but since it's about the same amount of
code, it's worth applying.

1GB Guest->Host: input(suppressed),output(suppressed)
Before:
	Seconds: 0:16.53
	Packets: 377268,753673
	Interrupts: 22461,24297
	Notifications: 1(5245),21303(732370)
	Net IRQs triggered: 377023(245),42578(711095)

After:
	Seconds: 0:16.48
	Packets: 377289,753673
	Interrupts: 22281,24465
	Notifications: 1(5245),21296(732377)
	Net IRQs triggered: 377060(229),42564(711109)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:03 +09:30
Rusty Russell a32a8813d0 lguest: improve interrupt handling, speed up stream networking
lguest never checked for pending interrupts when enabling interrupts, and
things still worked.  However, it makes a significant difference to TCP
performance, so it's time we fixed it by introducing a pending_irq flag
and checking it on irq_restore and irq_enable.

These two routines are now too big to patch into the 8/10 bytes
patch space, so we drop that code.

Note: The high latency on interrupt delivery had a very curious
effect: once everything else was optimized, networking without GSO was
faster than networking with GSO, since more interrupts were sent and
hence a greater chance of one getting through to the Guest!

Note2: (Almost) Closing the same loophole for iret doesn't have any
measurable effect, so I'm leaving that patch for the moment.

Before:
	1GB tcpblast Guest->Host:		30.7 seconds
	1GB tcpblast Guest->Host (no GSO):	76.0 seconds

After:
	1GB tcpblast Guest->Host:		6.8 seconds
	1GB tcpblast Guest->Host (no GSO):	27.8 seconds

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:03 +09:30
Rusty Russell abd41f037e lguest: fix race in halt code
When the Guest does the LHCALL_HALT hypercall, we go to sleep, expecting
that a timer or the Waker will wake_up_process() us.

But we do it in a stupid way, leaving a classic missing wakeup race.

So split maybe_do_interrupt() into interrupt_pending() and
try_deliver_interrupt(), and check maybe_do_interrupt() and the
"break_out" flag before calling schedule.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:02 +09:30
Rusty Russell ebf9a5a99c lguest: remove invalid interrupt forcing logic.
2088761152 (lguest: notify on empty) introduced
lguest support for the VIRTIO_F_NOTIFY_ON_EMPTY flag, but in fact it turned on
interrupts all the time.

Because we always process one buffer at a time, the inflight count is always 0
when call trigger_irq and so we always ignore VRING_AVAIL_F_NO_INTERRUPT from
the Guest.

It should be looking to see if there are more buffers in the Guest's queue:
if it's empty, then we force an interrupt.

This makes little difference, since we usually have an empty queue; but
that's the subject of another patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:02 +09:30
Rusty Russell a6c372de6e lguest: fix lguest wake on guest clock tick, or fd activity
The Launcher could be inside the Guest on another CPU; wake_up_process
will do nothing because it is "running".  kick_process will knock it
back into our kernel in this case, otherwise we'll miss it until the
next guest exit.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:01 +09:30
Rusty Russell b43e352139 sched: export kick_process
lguest needs kick_process: wake_up_process() does nothing if a process
is running, which isn't sufficient (we need it in the kernel).

And lguest support is usually modular.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
2009-06-12 22:27:01 +09:30
Rusty Russell f7027c6387 lguest: get more serious about wmb() in example Launcher code
Since the Launcher process runs the Guest, it doesn't have to be very
serious about its barriers: the Guest isn't running while we are (Guest
is UP).

Before we change to use threads to service devices, we need to fix this.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:00 +09:30
Rusty Russell 1028375e93 lguest: clean up lguest_init_IRQ
Copy from arch/x86/kernel/irqinit_32.c: we don't use the vectors beyond
LGUEST_IRQS (if any), but we might as well set them all.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:27:00 +09:30
Rusty Russell 56739c802c lguest: cleanup passing of /dev/lguest fd around example launcher.
We hand the /dev/lguest fd everywhere; it's far neater to just make it
a global (it already is, in fact, hidden in the waker_fds struct).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:26:59 +09:30
Rusty Russell 713b15b378 lguest: be paranoid about guest playing with device descriptors.
We can't trust the values in the device descriptor table once the
guest has booted, so keep local copies.  They could set them to
strange values then cause us to segv (they're 8 bit values, so they
can't make our pointers go too wild).

This becomes more important with the following patches which read them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 22:26:59 +09:30