summaryrefslogtreecommitdiffstats
path: root/src/docs/specs/ivshmem_device_spec.txt
blob: d318d65c3259fb0d2bf0721abdf924ec1d388d2c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161

Device Specification for Inter-VM shared memory device
------------------------------------------------------

The Inter-VM shared memory device is designed to share a memory region (created
on the host via the POSIX shared memory API) between multiple QEMU processes
running different guests. In order for all guests to be able to pick up the
shared memory area, it is modeled by QEMU as a PCI device exposing said memory
to the guest as a PCI BAR.
The memory region does not belong to any guest, but is a POSIX memory object on
the host. The host can access this shared memory if needed.

The device also provides an optional communication mechanism between guests
sharing the same memory object. More details about that in the section 'Guest to
guest communication' section.


The Inter-VM PCI device
-----------------------

From the VM point of view, the ivshmem PCI device supports three BARs.

- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is
  not used.
- BAR1 is used for MSI-X when it is enabled in the device.
- BAR2 is used to access the shared memory object.

It is your choice how to use the device but you must choose between two
behaviors :

- basically, if you only need the shared memory part, you will map BAR2.
  This way, you have access to the shared memory in guest and can use it as you
  see fit (memnic, for example, uses it in userland
  http://dpdk.org/browse/memnic).

- BAR0 and BAR1 are used to implement an optional communication mechanism
  through interrupts in the guests. If you need an event mechanism between the
  guests accessing the shared memory, you will most likely want to write a
  kernel driver that will handle interrupts. See details in the section 'Guest
  to guest communication' section.

The behavior is chosen when starting your QEMU processes:
- no communication mechanism needed, the first QEMU to start creates the shared
  memory on the host, subsequent QEMU processes will use it.

- communication mechanism needed, an ivshmem server must be started before any
  QEMU processes, then each QEMU process connects to the server unix socket.

For more details on the QEMU ivshmem parameters, see qemu-doc documentation.


Guest to guest communication
----------------------------

This section details the communication mechanism between the guests accessing
the ivhsmem shared memory.

*ivshmem server*

This server code is available in qemu.git/contrib/ivshmem-server.

The server must be started on the host before any guest.
It creates a shared memory object then waits for clients to connect on a unix
socket. All the messages are little-endian int64_t integer.

For each client (QEMU process) that connects to the server:
- the server sends a protocol version, if client does not support it, the client
  closes the communication,
- the server assigns an ID for this client and sends this ID to him as the first
  message,
- the server sends a fd to the shared memory object to this client,
- the server creates a new set of host eventfds associated to the new client and
  sends this set to all already connected clients,
- finally, the server sends all the eventfds sets for all clients to the new
  client.

The server signals all clients when one of them disconnects.

The client IDs are limited to 16 bits because of the current implementation (see
Doorbell register in 'PCI device registers' subsection). Hence only 65536
clients are supported.

All the file descriptors (fd to the shared memory, eventfds for each client)
are passed to clients using SCM_RIGHTS over the server unix socket.

Apart from the current ivshmem implementation in QEMU, an ivshmem client has
been provided in qemu.git/contrib/ivshmem-client for debug.

*QEMU as an ivshmem client*

At initialisation, when creating the ivshmem device, QEMU first receives a
protocol version and closes communication with server if it does not match.
Then, QEMU gets its ID from the server then makes it available through BAR0
IVPosition register for the VM to use (see 'PCI device registers' subsection).
QEMU then uses the fd to the shared memory to map it to BAR2.
eventfds for all other clients received from the server are stored to implement
BAR0 Doorbell register (see 'PCI device registers' subsection).
Finally, eventfds assigned to this QEMU process are used to send interrupts in
this VM.

*PCI device registers*

From the VM point of view, the ivshmem PCI device supports 4 registers of
32-bits each.

enum ivshmem_registers {
    IntrMask = 0,
    IntrStatus = 4,
    IVPosition = 8,
    Doorbell = 12
};

The first two registers are the interrupt mask and status registers.  Mask and
status are only used with pin-based interrupts.  They are unused with MSI
interrupts.

Status Register: The status register is set to 1 when an interrupt occurs.

Mask Register: The mask register is bitwise ANDed with the interrupt status
and the result will raise an interrupt if it is non-zero.  However, since 1 is
the only value the status will be set to, it is only the first bit of the mask
that has any effect.  Therefore interrupts can be masked by setting the first
bit to 0 and unmasked by setting the first bit to 1.

IVPosition Register: The IVPosition register is read-only and reports the
guest's ID number.  The guest IDs are non-negative integers.  When using the
server, since the server is a separate process, the VM ID will only be set when
the device is ready (shared memory is received from the server and accessible
via the device).  If the device is not ready, the IVPosition will return -1.
Applications should ensure that they have a valid VM ID before accessing the
shared memory.

Doorbell Register:  To interrupt another guest, a guest must write to the
Doorbell register.  The doorbell register is 32-bits, logically divided into
two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
16-bits are the interrupt vector to trigger.  The semantics of the value
written to the doorbell depends on whether the device is using MSI or a regular
pin-based interrupt.  In short, MSI uses vectors while regular interrupts set
the status register.

Regular Interrupts

If regular interrupts are used (due to either a guest not supporting MSI or the
user specifying not to use them on startup) then the value written to the lower
16-bits of the Doorbell register results is arbitrary and will trigger an
interrupt in the destination guest.

Message Signalled Interrupts

An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
written to the Doorbell register must be between 0 and the maximum number of
vectors the guest supports.  The lower 16 bits written to the doorbell is the
MSI vector that will be raised in the destination guest.  The number of MSI
vectors is configurable but it is set when the VM is started.

The important thing to remember with MSI is that it is only a signal, no status
is set (since MSI interrupts are not shared).  All information other than the
interrupt itself should be communicated via the shared memory region.  Devices
supporting multiple MSI vectors can use different vectors to indicate different
events have occurred.  The semantics of interrupt vectors are left to the
user's discretion.
OpenPOWER on IntegriCloud