Discussion:
[PATCH RFC wayland-protocols] unstable/linux-dmabuf: add wp_linux_dmabuf_device_hint
Simon Ser
2018-11-01 16:44:58 UTC
Permalink
On multi-GPU setups, multiple devices can be used for rendering. Clients need
hints about the device in use by the compositor. For instance, if they render
on another GPU, then they need to make sure the memory is accessible between
devices and that their buffers are not placed in hidden memory.

This commit introduces a new wp_linux_dmabuf_device_hints object. This object
advertizes a preferred device via a file descriptor and a set of preferred
formats/modifiers.

Each object is bound to a wl_surface and can dynamically update its hints. This
enables fine-grained per-surface optimizations. For instance, when a surface is
scanned out on a GPU the compositor isn't compositing with, the preferred
device can be set to this GPU to avoid unnecessary roundtrips.

Signed-off-by: Simon Ser <***@emersion.fr>
---

These additions are inspired from [1]. The goal here is to be able to get rid
of wl_drm, enabling more use-cases in the process.

I'm not a DRM/Mesa specialist, so let me know if I've made horrible mistakes.
As always, comments and questions are welcome.

[1]: https://gitlab.freedesktop.org/wayland/wayland/issues/59

.../linux-dmabuf/linux-dmabuf-unstable-v1.xml | 67 ++++++++++++++++++-
1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml b/unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml
index 154afe2..eafb559 100644
--- a/unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml
+++ b/unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml
@@ -24,7 +24,7 @@
DEALINGS IN THE SOFTWARE.
</copyright>

- <interface name="zwp_linux_dmabuf_v1" version="3">
+ <interface name="zwp_linux_dmabuf_v1" version="4">
<description summary="factory for creating dmabuf-based wl_buffers">
Following the interfaces from:
https://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_import.txt
@@ -35,6 +35,9 @@
the set of supported formats and format modifiers is sent with
'format' and 'modifier' events.

+ Clients can use the get_surface_device_hints request to get dmabuf hints
+ for a particular surface.
+
The following are required from clients:

- Clients must ensure that either all data in the dma-buf is
@@ -138,9 +141,19 @@
<arg name="modifier_lo" type="uint"
summary="low 32 bits of layout modifier"/>
</event>
+
+ <request name="get_surface_device_hints" since="4">
+ <description summary="get device hints for a surface">
+ This request creates a new wp_linux_dmabuf_device_hints object for the
+ specified wl_surface. This object will deliver hints about dmabuf
+ parameters to use for buffers attached to this surface.
+ </description>
+ <arg name="id" type="new_id" interface="zwp_linux_dmabuf_device_hints_v1"/>
+ <arg name="surface" type="object" interface="wl_surface"/>
+ </request>
</interface>

- <interface name="zwp_linux_buffer_params_v1" version="3">
+ <interface name="zwp_linux_buffer_params_v1" version="4">
<description summary="parameters for creating a dmabuf-based wl_buffer">
This temporary object is a collection of dmabufs and other
parameters that together form a single logical buffer. The temporary
@@ -345,4 +358,54 @@

</interface>

+ <interface name="zwp_linux_dmabuf_device_hints_v1" version="4">
+ <description summary="dmabuf device hints">
+ This object advertizes dmabuf hints for a surface. Such hints include the
+ primary device and the formats that are preferred for this surface.
+
+ These hints are sent once when this object is created and whenever they
+ change. The done event is always sent once after all hints have been sent.
+ </description>
+
+ <request name="destroy" type="destructor">
+ <description summary="destroy the device hints">
+ Using this request a client can tell the server that it is not going to
+ use the wp_linux_dmabuf_device_hints object anymore.
+ </description>
+ </request>
+
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
+ </description>
+ <arg name="fd" type="fd" summary="device file descriptor"/>
+ </event>
+
+ <event name="modifier">
+ <description summary="preferred buffer format modifier">
+ This event advertises the formats that the server prefers, along with
+ the modifiers preferred for each format.
+
+ For the definition of the format and modifier codes, see the
+ wp_linux_buffer_params::create request.
+ </description>
+ <arg name="format" type="uint" summary="DRM_FORMAT code"/>
+ <arg name="modifier_hi" type="uint"
+ summary="high 32 bits of layout modifier"/>
+ <arg name="modifier_lo" type="uint"
+ summary="low 32 bits of layout modifier"/>
+ </event>
+
+ <event name="done">
+ <description summary="all hints have been sent">
+ This event is sent after all properties of a
+ wp_linux_dmabuf_device_hints have been sent.
+
+ This allows changes to the wp_linux_dmabuf_device_hints properties to be
+ seen as atomic, even if they happen via multiple events.
+ </description>
+ </event>
+ </interface>
+
</protocol>
--
2.19.1
Daniel Stone
2018-11-01 17:04:51 UTC
Permalink
Hi Simon,
Thanks a lot for taking this on! :)
Post by Simon Ser
This commit introduces a new wp_linux_dmabuf_device_hints object. This object
advertizes a preferred device via a file descriptor and a set of preferred
formats/modifiers.
s/advertizes/advertises/g (including in the XML doc)

I also think this would be better called
wp_linux_dmabuf_surface_hints, since the change over the dmabuf
protocol is that it's surface-specific.
Post by Simon Ser
+ <interface name="zwp_linux_dmabuf_device_hints_v1" version="4">
+ <description summary="dmabuf device hints">
+ This object advertizes dmabuf hints for a surface. Such hints include the
*advertises
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
+ </description>
+ <arg name="fd" type="fd" summary="device file descriptor"/>
+ </event>
I _think_ this might want to refer to separate objects.

When we receive an FD from the server, we don't know what device it
refers to, so we have to open the device to probe it. Opening the
device can be slow: if a device is in a low PCI power state, it can be
a couple of seconds to physically power up the device and then wait
for it to initialise before we can interrogate it.

One way around this would be to have a separate wp_linux_dmabuf_device
object, lazily sent as a new object in an event by the root
wp_linux_dmabuf object, with the per-surface hints then referring to a
previously-sent device. This would allow clients to only probe each
device once per EGLDisplay, rather than once per EGLSurface.
Post by Simon Ser
+ <event name="modifier">
+ <description summary="preferred buffer format modifier">
+ This event advertises the formats that the server prefers, along with
+ the modifiers preferred for each format.
+
+ For the definition of the format and modifier codes, see the
+ wp_linux_buffer_params::create request.
+ </description>
+ <arg name="format" type="uint" summary="DRM_FORMAT code"/>
+ <arg name="modifier_hi" type="uint"
+ summary="high 32 bits of layout modifier"/>
+ <arg name="modifier_lo" type="uint"
+ summary="low 32 bits of layout modifier"/>
+ </event>
I think we want another event here, to group sets of modifiers
together by preference.

For example, say the surface could be directly scanned out, but only
if it uses the linear or X-tiled modifiers. Our surface-preferred
modifiers would be LINEAR + X_TILED. However, the client may not be
able to produce that combination. If the GPU still supports Y_TILED,
then we want to indicate that the client _can_ use Y_TILED if it needs
to, but _should_ use LINEAR or X_TILED.

DRI3 implements this by sending sets of modifiers in 'tranches', which
are arrays of arrays, which in this case would be:
tranches = {
[0 /* optimal */] = {
{ .format = XRGB8888, .modifier = LINEAR }
{ .format = XRGB8888, .modifier = X_TILED }
},
[1 /* less optimal */] = {
{ .format = XRGB8888, .modifier = Y_TILED }
}
}

I imagine the best way to do it with Wayland events would be to add a
'marker' event to indicate the border between these tranches. So we
would send:
modifier(XRGB8888, LINEAR)
modifier(XRGB8888, X_TILED)
barrier()
modifier(XRGB8888, Y_TILED)
barrier()
done()

For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.

What do you think?

Cheers,
Daniel
Pekka Paalanen
2018-11-02 08:53:28 UTC
Permalink
On Thu, 1 Nov 2018 17:04:51 +0000
Post by Daniel Stone
Hi Simon,
Thanks a lot for taking this on! :)
Post by Simon Ser
This commit introduces a new wp_linux_dmabuf_device_hints object. This object
advertizes a preferred device via a file descriptor and a set of preferred
formats/modifiers.
s/advertizes/advertises/g (including in the XML doc)
I also think this would be better called
wp_linux_dmabuf_surface_hints, since the change over the dmabuf
protocol is that it's surface-specific.
Post by Simon Ser
+ <interface name="zwp_linux_dmabuf_device_hints_v1" version="4">
+ <description summary="dmabuf device hints">
+ This object advertizes dmabuf hints for a surface. Such hints include the
*advertises
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
+ </description>
+ <arg name="fd" type="fd" summary="device file descriptor"/>
+ </event>
I _think_ this might want to refer to separate objects.
When we receive an FD from the server, we don't know what device it
refers to, so we have to open the device to probe it. Opening the
device can be slow: if a device is in a low PCI power state, it can be
a couple of seconds to physically power up the device and then wait
for it to initialise before we can interrogate it.
Hi,

wouldn't drmGetDevice2() with flags=0 gets us everything needed without
waking up a sleeping PCI device?

I just read it from Emil:
https://lists.freedesktop.org/archives/mesa-dev/2018-October/207447.html
Post by Daniel Stone
One way around this would be to have a separate wp_linux_dmabuf_device
object, lazily sent as a new object in an event by the root
wp_linux_dmabuf object, with the per-surface hints then referring to a
previously-sent device. This would allow clients to only probe each
device once per EGLDisplay, rather than once per EGLSurface.
This optimization does sound attractive to me in any case.
Post by Daniel Stone
Post by Simon Ser
+ <event name="modifier">
+ <description summary="preferred buffer format modifier">
+ This event advertises the formats that the server prefers, along with
+ the modifiers preferred for each format.
+
+ For the definition of the format and modifier codes, see the
+ wp_linux_buffer_params::create request.
+ </description>
+ <arg name="format" type="uint" summary="DRM_FORMAT code"/>
+ <arg name="modifier_hi" type="uint"
+ summary="high 32 bits of layout modifier"/>
+ <arg name="modifier_lo" type="uint"
+ summary="low 32 bits of layout modifier"/>
+ </event>
I think we want another event here, to group sets of modifiers
together by preference.
For example, say the surface could be directly scanned out, but only
if it uses the linear or X-tiled modifiers. Our surface-preferred
modifiers would be LINEAR + X_TILED. However, the client may not be
able to produce that combination. If the GPU still supports Y_TILED,
Combination? I thought modifiers are never combined with other
modifiers?
Post by Daniel Stone
then we want to indicate that the client _can_ use Y_TILED if it needs
to, but _should_ use LINEAR or X_TILED.
DRI3 implements this by sending sets of modifiers in 'tranches', which
tranches = {
[0 /* optimal */] = {
{ .format = XRGB8888, .modifier = LINEAR }
{ .format = XRGB8888, .modifier = X_TILED }
},
[1 /* less optimal */] = {
{ .format = XRGB8888, .modifier = Y_TILED }
}
}
I imagine the best way to do it with Wayland events would be to add a
'marker' event to indicate the border between these tranches. So we
modifier(XRGB8888, LINEAR)
modifier(XRGB8888, X_TILED)
barrier()
modifier(XRGB8888, Y_TILED)
barrier()
done()
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.

I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
Post by Daniel Stone
For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.
What do you think?
At first I didn't understand this at all. I wonder if Simon is as
puzzled as I was. :-)

Is the idea of tranches such that within a tranch, a client will be able
to pick a modifier that is optimal for its rendering? This would convey
the knowledge that all modifiers withing a tranch are equally good
for the compositor, so the client can pick what it can use the best.

This is contrary to a flat preference list, where a client would pick
the first modifier it can use, even if it is less optimal than a later
modifer for its rendering while for compositor it would not make a
difference.

I'm also not sure I understand your tranch categories. Are you thinking
that, for instance, if a client uses same-GPU-composition modifers
which exclude cross-GPU-composition that a compositor would start
copy-converting buffers if the composition no longer happens on the
same GPU, until the client adjusts to the new preference? That makes
sense, if I guessed right what you meant.

I'm wondering how the requirement "a compositor must always be able to
consume the buffer regardless of where it will be shown" is accounted
for here. Do we need a reminder about that in the spec?


Thanks,
pq
Simon Ser
2018-11-02 18:38:10 UTC
Permalink
Post by Pekka Paalanen
Post by Daniel Stone
I think we want another event here, to group sets of modifiers
together by preference.
For example, say the surface could be directly scanned out, but only
if it uses the linear or X-tiled modifiers. Our surface-preferred
modifiers would be LINEAR + X_TILED. However, the client may not be
able to produce that combination. If the GPU still supports Y_TILED,
Combination? I thought modifiers are never combined with other
modifiers?
I think Daniel refers to the format + modifier combination. Yes, modifiers
cannot be mixed with each other.
Post by Pekka Paalanen
Post by Daniel Stone
then we want to indicate that the client _can_ use Y_TILED if it needs
to, but _should_ use LINEAR or X_TILED.
DRI3 implements this by sending sets of modifiers in 'tranches', which
tranches = {
[0 /* optimal */] = {
{ .format = XRGB8888, .modifier = LINEAR }
{ .format = XRGB8888, .modifier = X_TILED }
},
[1 /* less optimal */] = {
{ .format = XRGB8888, .modifier = Y_TILED }
}
}
I imagine the best way to do it with Wayland events would be to add a
'marker' event to indicate the border between these tranches. So we
modifier(XRGB8888, LINEAR)
modifier(XRGB8888, X_TILED)
barrier()
modifier(XRGB8888, Y_TILED)
barrier()
done()
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.
True. Any reason why this hasn't been done in the global?
Post by Pekka Paalanen
I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
What do you mean by "the currently used format"?

I expect clients to bind to this interface and create a surface hints object
before the surface is mapped. In this case there's no "currently used format".

It will be a fair amount of data, yes. However it's just a list of integers.
When we send strings over the protocol (e.g. toplevel title in xdg-shell) it's
about the same amount of data I guess.
Post by Pekka Paalanen
Post by Daniel Stone
For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.
What do you think?
At first I didn't understand this at all. I wonder if Simon is as
puzzled as I was. :-)
Is the idea of tranches such that within a tranch, a client will be able
to pick a modifier that is optimal for its rendering? This would convey
the knowledge that all modifiers withing a tranch are equally good
for the compositor, so the client can pick what it can use the best.
This is contrary to a flat preference list, where a client would pick
the first modifier it can use, even if it is less optimal than a later
modifer for its rendering while for compositor it would not make a
difference.
Yeah, that's what I've understood too.
Post by Pekka Paalanen
I'm also not sure I understand your tranch categories. Are you thinking
that, for instance, if a client uses same-GPU-composition modifers
which exclude cross-GPU-composition that a compositor would start
copy-converting buffers if the composition no longer happens on the
same GPU, until the client adjusts to the new preference? That makes
sense, if I guessed right what you meant.
Right. I don't think we can do any better.
Post by Pekka Paalanen
I'm wondering how the requirement "a compositor must always be able to
consume the buffer regardless of where it will be shown" is accounted
for here. Do we need a reminder about that in the spec?
A reminder might be a good idea. The whole surface hints are just hints. The
client can choose to use another device or another format, and in the worst case
it'll just be more work and more copies on the compositor side.
Pekka Paalanen
2018-11-05 08:57:34 UTC
Permalink
On Fri, 02 Nov 2018 18:38:10 +0000
Post by Simon Ser
Post by Pekka Paalanen
Post by Daniel Stone
I think we want another event here, to group sets of modifiers
together by preference.
For example, say the surface could be directly scanned out, but only
if it uses the linear or X-tiled modifiers. Our surface-preferred
modifiers would be LINEAR + X_TILED. However, the client may not be
able to produce that combination. If the GPU still supports Y_TILED,
Combination? I thought modifiers are never combined with other
modifiers?
I think Daniel refers to the format + modifier combination. Yes, modifiers
cannot be mixed with each other.
Post by Pekka Paalanen
Post by Daniel Stone
then we want to indicate that the client _can_ use Y_TILED if it needs
to, but _should_ use LINEAR or X_TILED.
DRI3 implements this by sending sets of modifiers in 'tranches', which
tranches = {
[0 /* optimal */] = {
{ .format = XRGB8888, .modifier = LINEAR }
{ .format = XRGB8888, .modifier = X_TILED }
},
[1 /* less optimal */] = {
{ .format = XRGB8888, .modifier = Y_TILED }
}
}
I imagine the best way to do it with Wayland events would be to add a
'marker' event to indicate the border between these tranches. So we
modifier(XRGB8888, LINEAR)
modifier(XRGB8888, X_TILED)
barrier()
modifier(XRGB8888, Y_TILED)
barrier()
done()
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.
True. Any reason why this hasn't been done in the global?
For formats? Well, it is simpler without a wl_array, and there might be
a lot of formats.

Could there be a lot of modifiers per format? Would a wl_array make
anything easier? Just a thought.
Post by Simon Ser
Post by Pekka Paalanen
I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
What do you mean by "the currently used format"?
This interface is used to send clients hints after they are already
presenting, which means they already have a format chosen and probably
want to stick with it, just changing the modifiers to be more optimal.
Post by Simon Ser
I expect clients to bind to this interface and create a surface hints object
before the surface is mapped. In this case there's no "currently used format".
Right, that's another use case.
Post by Simon Ser
It will be a fair amount of data, yes. However it's just a list of integers.
When we send strings over the protocol (e.g. toplevel title in xdg-shell) it's
about the same amount of data I guess.
If the EGLConfig or GLXFBConfig or GLX visual lists are of any
indication... yes, they account for depth, stencil, aux, etc. but then
we will have modifiers.

We already advertise the list of everything supported of format+modifer
in the linux_dmabuf extension. Could we somehow minimize the number of
recommended format+modifiers in hints? Or maybe that's not a concern
for the protocol spec?
Post by Simon Ser
Post by Pekka Paalanen
Post by Daniel Stone
For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.
What do you think?
At first I didn't understand this at all. I wonder if Simon is as
puzzled as I was. :-)
Is the idea of tranches such that within a tranch, a client will be able
to pick a modifier that is optimal for its rendering? This would convey
the knowledge that all modifiers withing a tranch are equally good
for the compositor, so the client can pick what it can use the best.
This is contrary to a flat preference list, where a client would pick
the first modifier it can use, even if it is less optimal than a later
modifer for its rendering while for compositor it would not make a
difference.
Yeah, that's what I've understood too.
Post by Pekka Paalanen
I'm also not sure I understand your tranch categories. Are you thinking
that, for instance, if a client uses same-GPU-composition modifers
which exclude cross-GPU-composition that a compositor would start
copy-converting buffers if the composition no longer happens on the
same GPU, until the client adjusts to the new preference? That makes
sense, if I guessed right what you meant.
Right. I don't think we can do any better.
Post by Pekka Paalanen
I'm wondering how the requirement "a compositor must always be able to
consume the buffer regardless of where it will be shown" is accounted
for here. Do we need a reminder about that in the spec?
A reminder might be a good idea. The whole surface hints are just hints. The
client can choose to use another device or another format, and in the worst case
it'll just be more work and more copies on the compositor side.
Yeah. What I precisely mean is that even if a client chooses a
recommended format+modifier, the compositor will not be exempt from the
requirement that it must work always. I.e. a compositor cannot
advertise a format+modifier that would work only for scanout but not
for fallback composition, even if the surface is on scanout right now.


Thanks,
pq
Simon Ser
2018-11-10 13:34:31 UTC
Permalink
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.
True. Any reason why this hasn't been done in the global?
For formats? Well, it is simpler without a wl_array, and there might be
a lot of formats.
Could there be a lot of modifiers per format? Would a wl_array make
anything easier? Just a thought.
It's true that for this list of formats sorted by preference, we'll probably
need to split modifiers anyway so I don't think we'd benefit a lot from this
approach.
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
What do you mean by "the currently used format"?
This interface is used to send clients hints after they are already
presenting, which means they already have a format chosen and probably
want to stick with it, just changing the modifiers to be more optimal.
If we only send the modifiers for the current format, how do clients tell the
difference between the initial hints (which don't have a "currently used
format") and the subsequent hints?
Post by Pekka Paalanen
Post by Simon Ser
I expect clients to bind to this interface and create a surface hints object
before the surface is mapped. In this case there's no "currently used format".
Right, that's another use case.
Post by Simon Ser
It will be a fair amount of data, yes. However it's just a list of integers.
When we send strings over the protocol (e.g. toplevel title in xdg-shell) it's
about the same amount of data I guess.
If the EGLConfig or GLXFBConfig or GLX visual lists are of any
indication... yes, they account for depth, stencil, aux, etc. but then
we will have modifiers.
We already advertise the list of everything supported of format+modifer
in the linux_dmabuf extension. Could we somehow minimize the number of
recommended format+modifiers in hints? Or maybe that's not a concern
for the protocol spec?
I'm not sure.

After this patch, I'm not even sure how the formats+modifiers advertised by the
global work. Are these formats+modifiers supported on the GPU the compositor
uses for rendering? Intersection or union of formats+modifiers supported on all
GPUs?
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
Post by Daniel Stone
For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.
What do you think?
At first I didn't understand this at all. I wonder if Simon is as
puzzled as I was. :-)
Is the idea of tranches such that within a tranch, a client will be able
to pick a modifier that is optimal for its rendering? This would convey
the knowledge that all modifiers withing a tranch are equally good
for the compositor, so the client can pick what it can use the best.
This is contrary to a flat preference list, where a client would pick
the first modifier it can use, even if it is less optimal than a later
modifer for its rendering while for compositor it would not make a
difference.
Yeah, that's what I've understood too.
Post by Pekka Paalanen
I'm also not sure I understand your tranch categories. Are you thinking
that, for instance, if a client uses same-GPU-composition modifers
which exclude cross-GPU-composition that a compositor would start
copy-converting buffers if the composition no longer happens on the
same GPU, until the client adjusts to the new preference? That makes
sense, if I guessed right what you meant.
Right. I don't think we can do any better.
Post by Pekka Paalanen
I'm wondering how the requirement "a compositor must always be able to
consume the buffer regardless of where it will be shown" is accounted
for here. Do we need a reminder about that in the spec?
A reminder might be a good idea. The whole surface hints are just hints. The
client can choose to use another device or another format, and in the worst case
it'll just be more work and more copies on the compositor side.
Yeah. What I precisely mean is that even if a client chooses a
recommended format+modifier, the compositor will not be exempt from the
requirement that it must work always. I.e. a compositor cannot
advertise a format+modifier that would work only for scanout but not
for fallback composition, even if the surface is on scanout right now.
Yeah, this makes sense.
Pekka Paalanen
2018-11-12 09:18:16 UTC
Permalink
On Sat, 10 Nov 2018 13:34:31 +0000
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.
True. Any reason why this hasn't been done in the global?
For formats? Well, it is simpler without a wl_array, and there might be
a lot of formats.
Could there be a lot of modifiers per format? Would a wl_array make
anything easier? Just a thought.
It's true that for this list of formats sorted by preference, we'll probably
need to split modifiers anyway so I don't think we'd benefit a lot from this
approach.
Hi Simon,

just to be clear, I was thinking of something like:

event(uint format, wl_array(modifiers))

But I definitely do not insist on it if you don't see any obvious
benefits with it.

It seems you and I made very different assumptions on how the hints
would be sent, I only realized it just now. More about that below.
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
What do you mean by "the currently used format"?
This interface is used to send clients hints after they are already
presenting, which means they already have a format chosen and probably
want to stick with it, just changing the modifiers to be more optimal.
If we only send the modifiers for the current format, how do clients tell the
difference between the initial hints (which don't have a "currently used
format") and the subsequent hints?
I'm not sure I understand why they would need to see the difference.
But yes, I was short-sighted here and didn't consider the
initialization when a surface is not mapped yet. I didn't expect that
hints can be calculated if the surface is not mapped, but of course a
compositor can provide some defaults. I suppose the initial default
hints would boil down to what is most efficient to composite.
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
I expect clients to bind to this interface and create a surface hints object
before the surface is mapped. In this case there's no "currently used format".
Right, that's another use case.
Post by Simon Ser
It will be a fair amount of data, yes. However it's just a list of integers.
When we send strings over the protocol (e.g. toplevel title in xdg-shell) it's
about the same amount of data I guess.
If the EGLConfig or GLXFBConfig or GLX visual lists are of any
indication... yes, they account for depth, stencil, aux, etc. but then
we will have modifiers.
We already advertise the list of everything supported of format+modifer
in the linux_dmabuf extension. Could we somehow minimize the number of
recommended format+modifiers in hints? Or maybe that's not a concern
for the protocol spec?
I'm not sure.
After this patch, I'm not even sure how the formats+modifiers advertised by the
global work. Are these formats+modifiers supported on the GPU the compositor
uses for rendering? Intersection or union of formats+modifiers supported on all
GPUs?
The format+modifier advertised by the global before this patch are the
ones that can work at all, or the compositor is willing to make them
work at least in the worst fallback case. This patch must not change
that meaning. These formats also must always work regardless of which
GPU a client decides to use, but that is already implied by the
compositor being able to import a dmabuf. The compositor does not need
to try to factor in what other GPUs on the system might be able to
render or not, that is for the client to figure out when it knows the
formats the compositor can accept and is choosing a GPU to render with.
It is theoretically possible that a client tries to use a GPU that
cannot render any formats the compositor can use, but that is the
client's responsibility to figure out.

So clearly the formats from the global can be used by a client at any
time. The hint formats OTOH has no reason to list absolutely
everything the compositor supports, but a compositor can choose on its
own judgement to send only a sub-set it would prefer.

However, after a client has picked a format and used it, then there
should be hints with that format, at least if they can make any
difference.

I'm not sure. Not listing everything always was my intuitive
assumption, and I believe you perhaps assumed the opposite so that a
client has absolutely all the information to e.g. optimize the modifier
of a format that the compositor would not prefer at all even though it
does work.

It would be simpler to always send everything, but that will be much
more protocol traffic. Would it be too much? I don't know, could you
calculate some examples of how many bytes a typical hints update would
be if sending everything always?


Thanks,
pq
Simon Ser
2018-11-12 12:16:04 UTC
Permalink
Post by Pekka Paalanen
On Sat, 10 Nov 2018 13:34:31 +0000
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.
True. Any reason why this hasn't been done in the global?
For formats? Well, it is simpler without a wl_array, and there might be
a lot of formats.
Could there be a lot of modifiers per format? Would a wl_array make
anything easier? Just a thought.
It's true that for this list of formats sorted by preference, we'll probably
need to split modifiers anyway so I don't think we'd benefit a lot from this
approach.
Hi Simon,
event(uint format, wl_array(modifiers))
But I definitely do not insist on it if you don't see any obvious
benefits with it.
Yeah. I think the benefits would not be substantial as we need to "split" these
to order them by preference. So it would look like so:

event(format1, wl_array(modifiers))
barrier()
event(format1, wl_array(modifiers))
event(format2, wl_array(modifiers))
barrier()
event(format1, wl_array(modifiers))
barrier()

Also this is not consistent with the rest of the protocol. Maybe we can discuss
this again for linux-dmabuf-unstable-v2.
Post by Pekka Paalanen
It seems you and I made very different assumptions on how the hints
would be sent, I only realized it just now. More about that below.
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
What do you mean by "the currently used format"?
This interface is used to send clients hints after they are already
presenting, which means they already have a format chosen and probably
want to stick with it, just changing the modifiers to be more optimal.
If we only send the modifiers for the current format, how do clients tell the
difference between the initial hints (which don't have a "currently used
format") and the subsequent hints?
I'm not sure I understand why they would need to see the difference.
But yes, I was short-sighted here and didn't consider the
initialization when a surface is not mapped yet. I didn't expect that
hints can be calculated if the surface is not mapped, but of course a
compositor can provide some defaults. I suppose the initial default
hints would boil down to what is most efficient to composite.
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
I expect clients to bind to this interface and create a surface hints object
before the surface is mapped. In this case there's no "currently used format".
Right, that's another use case.
Post by Simon Ser
It will be a fair amount of data, yes. However it's just a list of integers.
When we send strings over the protocol (e.g. toplevel title in xdg-shell) it's
about the same amount of data I guess.
If the EGLConfig or GLXFBConfig or GLX visual lists are of any
indication... yes, they account for depth, stencil, aux, etc. but then
we will have modifiers.
We already advertise the list of everything supported of format+modifer
in the linux_dmabuf extension. Could we somehow minimize the number of
recommended format+modifiers in hints? Or maybe that's not a concern
for the protocol spec?
I'm not sure.
After this patch, I'm not even sure how the formats+modifiers advertised by the
global work. Are these formats+modifiers supported on the GPU the compositor
uses for rendering? Intersection or union of formats+modifiers supported on all
GPUs?
The format+modifier advertised by the global before this patch are the
ones that can work at all, or the compositor is willing to make them
work at least in the worst fallback case. This patch must not change
that meaning. These formats also must always work regardless of which
GPU a client decides to use, but that is already implied by the
compositor being able to import a dmabuf. The compositor does not need
to try to factor in what other GPUs on the system might be able to
render or not, that is for the client to figure out when it knows the
formats the compositor can accept and is choosing a GPU to render with.
It is theoretically possible that a client tries to use a GPU that
cannot render any formats the compositor can use, but that is the
client's responsibility to figure out.
Okay, that makes sense. And if a GPU doesn't support direct scan-out for some
format+modifier, then it can always fallback to good ol' compositing.
Post by Pekka Paalanen
So clearly the formats from the global can be used by a client at any
time. The hint formats OTOH has no reason to list absolutely
everything the compositor supports, but a compositor can choose on its
own judgement to send only a sub-set it would prefer.
Yes, this makes sense.
Post by Pekka Paalanen
However, after a client has picked a format and used it, then there
should be hints with that format, at least if they can make any
difference.
Okay, I get it now. :)
Post by Pekka Paalanen
I'm not sure. Not listing everything always was my intuitive
assumption, and I believe you perhaps assumed the opposite so that a
client has absolutely all the information to e.g. optimize the modifier
of a format that the compositor would not prefer at all even though it
does work.
It would be simpler to always send everything, but that will be much
more protocol traffic. Would it be too much? I don't know, could you
calculate some examples of how many bytes a typical hints update would
be if sending everything always?
If I'm understanding the protocol marshalling right, each event would have:

* A 2*32-bit header for sender, size and opcode
* One 32-bit field for the format
* A 2*32-bit field for the modifier

Total 5*32 = 160 bits, ie. 20 bytes per event.

My current setup lists ~100 format+modifier combinations. So that means each
time a client binds to wp_linux_dmabuf, ~2KB of data is sent. That's a lot, your
concerns are correct.

(In a future version of the protocol, maybe we could use shared memory, just
like the wl_keyboard keymap.)

In the meantime, we could decide to do as you suggest. So the compositor would
always advertise a subset of the supported modifiers. When the hints object is
created, the compositor would send its preferred format+modifier pairs. When the
client submits a buffer with a new format, the compositor can decide to send the
preferred modifiers for this format. I wonder how we should phrase this in the
protocol (can/should/must?). Thoughts?

In the case where a client is scanned out and the GPU used for scan-out
supports more formats than the GPU used for compositing, as discussed before the
compositor won't be able to advertise these additional formats, because the
client could keep using them when not scanned out anymore.
Pekka Paalanen
2018-11-12 12:48:19 UTC
Permalink
On Mon, 12 Nov 2018 12:16:04 +0000
Post by Simon Ser
Post by Pekka Paalanen
On Sat, 10 Nov 2018 13:34:31 +0000
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
Yeah. Another option is to send a wl_array of modifiers per format and
tranch.
True. Any reason why this hasn't been done in the global?
For formats? Well, it is simpler without a wl_array, and there might be
a lot of formats.
Could there be a lot of modifiers per format? Would a wl_array make
anything easier? Just a thought.
It's true that for this list of formats sorted by preference, we'll probably
need to split modifiers anyway so I don't think we'd benefit a lot from this
approach.
Hi Simon,
event(uint format, wl_array(modifiers))
But I definitely do not insist on it if you don't see any obvious
benefits with it.
Yeah. I think the benefits would not be substantial as we need to "split" these
event(format1, wl_array(modifiers))
barrier()
event(format1, wl_array(modifiers))
event(format2, wl_array(modifiers))
barrier()
event(format1, wl_array(modifiers))
barrier()
Also this is not consistent with the rest of the protocol. Maybe we can discuss
this again for linux-dmabuf-unstable-v2.
Post by Pekka Paalanen
It seems you and I made very different assumptions on how the hints
would be sent, I only realized it just now. More about that below.
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
Post by Pekka Paalanen
I suppose it will be enough to send tranches for just the currently
used format? Otherwise it could be "a lot" of data.
What do you mean by "the currently used format"?
This interface is used to send clients hints after they are already
presenting, which means they already have a format chosen and probably
want to stick with it, just changing the modifiers to be more optimal.
If we only send the modifiers for the current format, how do clients tell the
difference between the initial hints (which don't have a "currently used
format") and the subsequent hints?
I'm not sure I understand why they would need to see the difference.
But yes, I was short-sighted here and didn't consider the
initialization when a surface is not mapped yet. I didn't expect that
hints can be calculated if the surface is not mapped, but of course a
compositor can provide some defaults. I suppose the initial default
hints would boil down to what is most efficient to composite.
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
I expect clients to bind to this interface and create a surface hints object
before the surface is mapped. In this case there's no "currently used format".
Right, that's another use case.
Post by Simon Ser
It will be a fair amount of data, yes. However it's just a list of integers.
When we send strings over the protocol (e.g. toplevel title in xdg-shell) it's
about the same amount of data I guess.
If the EGLConfig or GLXFBConfig or GLX visual lists are of any
indication... yes, they account for depth, stencil, aux, etc. but then
we will have modifiers.
We already advertise the list of everything supported of format+modifer
in the linux_dmabuf extension. Could we somehow minimize the number of
recommended format+modifiers in hints? Or maybe that's not a concern
for the protocol spec?
I'm not sure.
After this patch, I'm not even sure how the formats+modifiers advertised by the
global work. Are these formats+modifiers supported on the GPU the compositor
uses for rendering? Intersection or union of formats+modifiers supported on all
GPUs?
The format+modifier advertised by the global before this patch are the
ones that can work at all, or the compositor is willing to make them
work at least in the worst fallback case. This patch must not change
that meaning. These formats also must always work regardless of which
GPU a client decides to use, but that is already implied by the
compositor being able to import a dmabuf. The compositor does not need
to try to factor in what other GPUs on the system might be able to
render or not, that is for the client to figure out when it knows the
formats the compositor can accept and is choosing a GPU to render with.
It is theoretically possible that a client tries to use a GPU that
cannot render any formats the compositor can use, but that is the
client's responsibility to figure out.
Okay, that makes sense. And if a GPU doesn't support direct scan-out for some
format+modifier, then it can always fallback to good ol' compositing.
Post by Pekka Paalanen
So clearly the formats from the global can be used by a client at any
time. The hint formats OTOH has no reason to list absolutely
everything the compositor supports, but a compositor can choose on its
own judgement to send only a sub-set it would prefer.
Yes, this makes sense.
Post by Pekka Paalanen
However, after a client has picked a format and used it, then there
should be hints with that format, at least if they can make any
difference.
Okay, I get it now. :)
Post by Pekka Paalanen
I'm not sure. Not listing everything always was my intuitive
assumption, and I believe you perhaps assumed the opposite so that a
client has absolutely all the information to e.g. optimize the modifier
of a format that the compositor would not prefer at all even though it
does work.
It would be simpler to always send everything, but that will be much
more protocol traffic. Would it be too much? I don't know, could you
calculate some examples of how many bytes a typical hints update would
be if sending everything always?
* A 2*32-bit header for sender, size and opcode
* One 32-bit field for the format
* A 2*32-bit field for the modifier
Total 5*32 = 160 bits, ie. 20 bytes per event.
My current setup lists ~100 format+modifier combinations. So that means each
time a client binds to wp_linux_dmabuf, ~2KB of data is sent. That's a lot, your
concerns are correct.
(In a future version of the protocol, maybe we could use shared memory, just
like the wl_keyboard keymap.)
In the meantime, we could decide to do as you suggest. So the compositor would
always advertise a subset of the supported modifiers. When the hints object is
created, the compositor would send its preferred format+modifier pairs. When the
client submits a buffer with a new format, the compositor can decide to send the
preferred modifiers for this format. I wonder how we should phrase this in the
protocol (can/should/must?). Thoughts?
Yeah. It can be left to be decided by the compositor implementation on
which format+modifiers it suggests, while clients are free to pick any
at all supported format+modifier they want.

Quite likely we need to revisit this in any case. Using shared memory
feels complicated, but OTOH it *is* relatively lot of data. Even the
kernel UABI does not use a flat list of format+modifier but a fairly
"interesting" bitfield encoding. That's probably not appropriate for
Wayland though, so maybe we have to use shared memory for it.

I wonder if there could be a yet another option...
Post by Simon Ser
In the case where a client is scanned out and the GPU used for scan-out
supports more formats than the GPU used for compositing, as discussed before the
compositor won't be able to advertise these additional formats, because the
client could keep using them when not scanned out anymore.
Yes. It is actually unavoidable, because the compositor makes the
decision to stop scanning out first while compositing already, then
sends the new set of hints, and only afterwards the client could use the
new hints.


Thanks,
pq
Pekka Paalanen
2018-11-12 13:54:07 UTC
Permalink
On Mon, 12 Nov 2018 14:48:19 +0200
Post by Pekka Paalanen
Quite likely we need to revisit this in any case. Using shared memory
feels complicated, but OTOH it *is* relatively lot of data. Even the
kernel UABI does not use a flat list of format+modifier but a fairly
"interesting" bitfield encoding. That's probably not appropriate for
Wayland though, so maybe we have to use shared memory for it.
Hi,

having thought about this, I have the feeling that Wayland handles well
tiny bits of data as protocol messages and large chunks of data as
shared memory file descriptors, but it seems we lack a good solution
for intermediate sized bits of data in the range 1 kB - 8 kB, just to
throw some random numbers up.

It is too risky to put these through the protocol messages in line, but
the trouble of setting up a shared memory file seems disproportionate
to the amount of data. Yet, it seems that setting up a shared memory
file is the only solution since the in line data is too risky.

I started wondering if we should have a generic shared memory
interface, something like the following sketch of a Wayland extension.

interface shm_factory
Is the global.

- request: create_shm_file(new shm_file, fd, size, seals, direction)
Creates a new shm_file object that refers to the memory backing
the fd, of the given size, and being sealed with the mentioned
seals. Direction means whether the server or the client will be
the writer, so this will be a one-way street but a re-usable
one.

(This is a good chance to get memfd and seals properly used.)

interface shm_file
Represents a piece of shared memory. Comes in two mutually
exclusive flavours:
- server-writable
- client-writable
Has a fixed size.

The usage pattern is that the writer signals the reader when it
needs to copy the data out. This is done by a custom protocol
message carrying a shm_file as an argument, which makes the
shm_file read-locked. The reader copies the data out of the
shared memory and sends client_read_done or server_read_done
ASAP, releasing the read-lock. While the shm_file is
read-locked, the writer may not write into it. While the
shm_file is not read-locked, the reader may not read it.

- request: client_read_done
Sent by the client when it has copied the data out. Releases
the read-lock.

- event: server_read_done
Sent by the server when it has copied the data out. Releases
the read-lock.


When e.g. zwp_linux_dmabuf would provide the list of pixel formats and
modifiers, the server needs to first send the required shared memory
size to the client, the client creates a server-writable shm_file, and
sends it to the server. The server fills in the data and sends an event
with the shm_file as an argument that tell the client to read it (sets
the read-lock). The rest goes according to the the generic protocol
above.

Why all the roundtripping to get the shm_file created?

Because I would prefer the memory allocation is always
accounted to the client, not the server. We should try to keep
server allocations on behalf of clients to a minimum so that
OOM killer etc. can find the right culprit.

Why so much copying?

Because the amount of data should be small enough that copying
it is insignificant. By assuming that readers maintain their
own copy, the protocol is simpler. No need to juggle multiple
shm_files like we do with wl_buffers.

Why unidirectional?

To keep it simple. Need bidirectional transfers? Make one
shm_file for each direction.

Isn't creating and tearing down shared memory relatively expensive?

Yes, but shm_file is meant to be repeatedly re-used. After
reader has read, the writer can write again. No need to tear it
down, if you expect repeated transfers.


While writing this, I have a strong feeling I am reinventing the wheel
here...

Just throwing this idea out there, not sure if it was a good one.


Thanks,
pq
Simon Ser
2018-11-02 09:00:09 UTC
Permalink
Hi,

Thanks for your review!
Post by Daniel Stone
Post by Simon Ser
This commit introduces a new wp_linux_dmabuf_device_hints object. This object
advertizes a preferred device via a file descriptor and a set of preferred
formats/modifiers.
s/advertizes/advertises/g (including in the XML doc)
Ah, it seems that for once British English and American English agree on the
spelling. Noted!
Post by Daniel Stone
I also think this would be better called
wp_linux_dmabuf_surface_hints, since the change over the dmabuf
protocol is that it's surface-specific.
Right. The intent was to be able to re-use this object for hints not bound to
surfaces in the future. But better not to try to think of all possible
extensions (which will probably have different requirements).

Updated to use wp_linux_dmabuf_surface_hints.
Post by Daniel Stone
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
+ </description>
+ <arg name="fd" type="fd" summary="device file descriptor"/>
+ </event>
I _think_ this might want to refer to separate objects.
When we receive an FD from the server, we don't know what device it
refers to, so we have to open the device to probe it. Opening the
device can be slow: if a device is in a low PCI power state, it can be
a couple of seconds to physically power up the device and then wait
for it to initialise before we can interrogate it.
One way around this would be to have a separate wp_linux_dmabuf_device
object, lazily sent as a new object in an event by the root
wp_linux_dmabuf object, with the per-surface hints then referring to a
previously-sent device. This would allow clients to only probe each
device once per EGLDisplay, rather than once per EGLSurface.
I see. One other way to fix this issue would be to keep the protocol as-is but
to make the client use stat(3p) to check if it doesn't already know about the
Post by Daniel Stone
The st_ino and st_dev fields taken together uniquely identify the file within
the system.
This would remove the overhead and complexity of server-allocated objects, which
are hard to teardown.

But I'm maybe missing some use-cases here?

[1]: http://pubs.opengroup.org/onlinepubs/009696699/basedefs/sys/stat.h.html
Post by Daniel Stone
Post by Simon Ser
+ <event name="modifier">
+ <description summary="preferred buffer format modifier">
+ This event advertises the formats that the server prefers, along with
+ the modifiers preferred for each format.
+
+ For the definition of the format and modifier codes, see the
+ wp_linux_buffer_params::create request.
+ </description>
+ <arg name="format" type="uint" summary="DRM_FORMAT code"/>
+ <arg name="modifier_hi" type="uint"
+ summary="high 32 bits of layout modifier"/>
+ <arg name="modifier_lo" type="uint"
+ summary="low 32 bits of layout modifier"/>
+ </event>
I think we want another event here, to group sets of modifiers
together by preference.
For example, say the surface could be directly scanned out, but only
if it uses the linear or X-tiled modifiers. Our surface-preferred
modifiers would be LINEAR + X_TILED. However, the client may not be
able to produce that combination. If the GPU still supports Y_TILED,
then we want to indicate that the client _can_ use Y_TILED if it needs
to, but _should_ use LINEAR or X_TILED.
DRI3 implements this by sending sets of modifiers in 'tranches', which
tranches = {
[0 /* optimal */] = {
{ .format = XRGB8888, .modifier = LINEAR }
{ .format = XRGB8888, .modifier = X_TILED }
},
[1 /* less optimal */] = {
{ .format = XRGB8888, .modifier = Y_TILED }
}
}
I imagine the best way to do it with Wayland events would be to add a
'marker' event to indicate the border between these tranches. So we
modifier(XRGB8888, LINEAR)
modifier(XRGB8888, X_TILED)
barrier()
modifier(XRGB8888, Y_TILED)
barrier()
done()
For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.
What do you think?
This seems like a good idea. Other solutions include having an enum for tranches
(preferred, fallback, etc) but that restricts the number of tranches. Using
tranche indexes makes the protocol more complicated. So your idea LGTM.

I'll also change the wording from "preferred" to "supported in order of
preference".

I have another question: what if the compositor doesn't know about the preferred
device? For instance if it's running nested in another Wayland compositor that
doesn't support this new protocol version. Maybe we should make all events
optional to let the compositor say "I have no idea"?
Philipp Zabel
2018-11-02 11:30:46 UTC
Permalink
Post by Daniel Stone
Hi Simon,
Thanks a lot for taking this on! :)
Post by Simon Ser
This commit introduces a new wp_linux_dmabuf_device_hints object. This object
advertizes a preferred device via a file descriptor and a set of preferred
formats/modifiers.
s/advertizes/advertises/g (including in the XML doc)
I also think this would be better called
wp_linux_dmabuf_surface_hints, since the change over the dmabuf
protocol is that it's surface-specific.
Post by Simon Ser
+ <interface name="zwp_linux_dmabuf_device_hints_v1" version="4">
+ <description summary="dmabuf device hints">
+ This object advertizes dmabuf hints for a surface. Such hints include the
*advertises
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
Which device should this be if the scanout engine is separate from the
render engine (e.g. IPU/imx-drm and GPU/etnaviv on i.MX6)

[...]
Post by Daniel Stone
Post by Simon Ser
+ <event name="modifier">
+ <description summary="preferred buffer format modifier">
+ This event advertises the formats that the server prefers, along with
+ the modifiers preferred for each format.
+
+ For the definition of the format and modifier codes, see the
+ wp_linux_buffer_params::create request.
+ </description>
+ <arg name="format" type="uint" summary="DRM_FORMAT code"/>
+ <arg name="modifier_hi" type="uint"
+ summary="high 32 bits of layout modifier"/>
+ <arg name="modifier_lo" type="uint"
+ summary="low 32 bits of layout modifier"/>
+ </event>
I think we want another event here, to group sets of modifiers
together by preference.
For example, say the surface could be directly scanned out, but only
if it uses the linear or X-tiled modifiers. Our surface-preferred
modifiers would be LINEAR + X_TILED. However, the client may not be
able to produce that combination. If the GPU still supports Y_TILED,
then we want to indicate that the client _can_ use Y_TILED if it needs
to, but _should_ use LINEAR or X_TILED.
DRI3 implements this by sending sets of modifiers in 'tranches', which
tranches = {
[0 /* optimal */] = {
{ .format = XRGB8888, .modifier = LINEAR }
{ .format = XRGB8888, .modifier = X_TILED }
},
[1 /* less optimal */] = {
{ .format = XRGB8888, .modifier = Y_TILED }
}
}
I imagine the best way to do it with Wayland events would be to add a
'marker' event to indicate the border between these tranches. So we
modifier(XRGB8888, LINEAR)
modifier(XRGB8888, X_TILED)
barrier()
modifier(XRGB8888, Y_TILED)
barrier()
done()
For a simple 'GPU composition or scanout' case, this would only be two
tranches, which are 'most optimal' and 'fallback'. For multiple GPUs
though, we could end up with three tranches: scanout-capable,
same-GPU-composition, or cross-GPU-composition. Similarly, if we take
media recording into account, we could end up with more than two
tranches.
What do you think?
What about contiguous vs non-contiguous memory?

On i.MX6QP (Vivante GC3000) we would probably want the client to always
render DRM_FORMAT_MOD_VIVANTE_SUPER_TILED, because this can be directly
read by both texture samplers (non-contiguous) and scanout (must be
contiguous).

On i.MX6Q (Vivante GC2000) we always want to use the most efficient
DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED, because neither of the
supported render formats can be sampled or scanned out directly.
Since the compositor has to resolve into DRM_FORMAT_MOD_VIVANTE_TILED
(non-contiguous) for texture sampling or DRM_FORMAT_MOD_LINEAR
(contiguous) for scanout, the client buffers can always be non-
contiguous.

On i.MX6S (Vivante GC880) the optimal render format for texture sampling
would be DRM_FORMAT_MOD_VIVANTE_TILED (non-contiguous) and for scanout
DRM_FORMAT_MOD_VIVANTE_SUPER_TILED (non-contiguous) which would be
resolved into DRM_FORMAT_MOD_LINEAR (contiguous) by the compositor.

All three could always handle DRM_FORMAT_MOD_LINEAR (contiguous) client
buffers for scanout directly, but those would be suboptimal if the
compositor decides to render on short notice, because the client would
have already resolved into linear and then the compositor would have to
resolve back into a texture sampler tiling format.

regards
Philipp
Simon Ser
2018-11-02 18:49:35 UTC
Permalink
Post by Philipp Zabel
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
Which device should this be if the scanout engine is separate from the
render engine (e.g. IPU/imx-drm and GPU/etnaviv on i.MX6)
When the surface hints are created, I expect the compositor to send the device
it uses for compositing as the primary device (assuming it's using only one
device).

When the surface becomes fullscreen on a different GPU (meaning it becomes
fullscreen on an output which is managed by another GPU), I'd expect the
compositor to change the primary device for this surface to this other GPU.

If the compositor uses multiple devices for compositing, it'll probably switch
the primary device when the surface is moved from one GPU to the other.

I'm not sure how i.MX6 works, but: even if the same GPU is used for compositing
and scanout, but the compositing preferred formats are different from the
scanout preferred formats, the compositor can update the preferred format
without changing the preferred device.

Is there an issue with this? Maybe something should be added to the protocol to
explain it better?
Post by Philipp Zabel
What about contiguous vs non-contiguous memory?
On i.MX6QP (Vivante GC3000) we would probably want the client to always
render DRM_FORMAT_MOD_VIVANTE_SUPER_TILED, because this can be directly
read by both texture samplers (non-contiguous) and scanout (must be
contiguous).
On i.MX6Q (Vivante GC2000) we always want to use the most efficient
DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED, because neither of the
supported render formats can be sampled or scanned out directly.
Since the compositor has to resolve into DRM_FORMAT_MOD_VIVANTE_TILED
(non-contiguous) for texture sampling or DRM_FORMAT_MOD_LINEAR
(contiguous) for scanout, the client buffers can always be non-
contiguous.
On i.MX6S (Vivante GC880) the optimal render format for texture sampling
would be DRM_FORMAT_MOD_VIVANTE_TILED (non-contiguous) and for scanout
DRM_FORMAT_MOD_VIVANTE_SUPER_TILED (non-contiguous) which would be
resolved into DRM_FORMAT_MOD_LINEAR (contiguous) by the compositor.
I think all of this works with Daniel's design.
Post by Philipp Zabel
All three could always handle DRM_FORMAT_MOD_LINEAR (contiguous) client
buffers for scanout directly, but those would be suboptimal if the
compositor decides to render on short notice, because the client would
have already resolved into linear and then the compositor would have to
resolve back into a texture sampler tiling format.
Is the concern here that switching between scanout and compositing is
non-optimal until the client chooses the preferred format?
Philipp Zabel
2018-11-12 17:43:38 UTC
Permalink
Hi Simon,
Post by Simon Ser
Post by Philipp Zabel
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
Which device should this be if the scanout engine is separate from the
render engine (e.g. IPU/imx-drm and GPU/etnaviv on i.MX6)
When the surface hints are created, I expect the compositor to send the device
it uses for compositing as the primary device (assuming it's using only one
device).
i.MX6 has a separate scanout device without any acceleration capabilities
except some hardware overlay planes, and a pure GPU render device without
any connection to the outside world. The compositor uses both devices for
compositing and output.
Post by Simon Ser
Post by Philipp Zabel
When the surface becomes fullscreen on a different GPU (meaning it becomes
fullscreen on an output which is managed by another GPU), I'd expect the
compositor to change the primary device for this surface to this other GPU.
If the compositor uses multiple devices for compositing, it'll probably switch
the primary device when the surface is moved from one GPU to the other.
I'm not sure how i.MX6 works, but: even if the same GPU is used for compositing
and scanout, but the compositing preferred formats are different from the
scanout preferred formats, the compositor can update the preferred format
without changing the preferred device.
Is there an issue with this? Maybe something should be added to the protocol to
explain it better?
It is not clear to me from the protocol description whether the primary
device means the scanout engine or the GPU, in case they are different.

What is the client process supposed to do with this fd? Is it expected
to be able to render on this device? Or use it to allocate the optimal
buffers?
Post by Simon Ser
Post by Philipp Zabel
What about contiguous vs non-contiguous memory?
On i.MX6QP (Vivante GC3000) we would probably want the client to always
render DRM_FORMAT_MOD_VIVANTE_SUPER_TILED, because this can be directly
read by both texture samplers (non-contiguous) and scanout (must be
contiguous).
On i.MX6Q (Vivante GC2000) we always want to use the most efficient
DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED, because neither of the
supported render formats can be sampled or scanned out directly.
Since the compositor has to resolve into DRM_FORMAT_MOD_VIVANTE_TILED
(non-contiguous) for texture sampling or DRM_FORMAT_MOD_LINEAR
(contiguous) for scanout, the client buffers can always be non-
contiguous.
On i.MX6S (Vivante GC880) the optimal render format for texture sampling
would be DRM_FORMAT_MOD_VIVANTE_TILED (non-contiguous) and for scanout
DRM_FORMAT_MOD_VIVANTE_SUPER_TILED (non-contiguous) which would be
resolved into DRM_FORMAT_MOD_LINEAR (contiguous) by the compositor.
I think all of this works with Daniel's design.
Post by Philipp Zabel
All three could always handle DRM_FORMAT_MOD_LINEAR (contiguous) client
buffers for scanout directly, but those would be suboptimal if the
compositor decides to render on short notice, because the client would
have already resolved into linear and then the compositor would have to
resolve back into a texture sampler tiling format.
Is the concern here that switching between scanout and compositing is
non-optimal until the client chooses the preferred format?
My point is just that whether or not the buffer must be contiguous in
physical memory is the essential piece of information on i.MX6QP,
whereas the optimal tiling modifier is the same for both GPU composition
and direct scanout cases.

If the client provides non-contiguous buffers, the "optimal" tiling
doesn't help one bit in the scanout case, as the scanout hardware can't
read from those.

regards
Philipp
Simon Ser
2018-11-13 18:19:29 UTC
Permalink
Post by Daniel Stone
Hi Simon,
Post by Simon Ser
Post by Philipp Zabel
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
Which device should this be if the scanout engine is separate from the
render engine (e.g. IPU/imx-drm and GPU/etnaviv on i.MX6)
When the surface hints are created, I expect the compositor to send the device
it uses for compositing as the primary device (assuming it's using only one
device).
i.MX6 has a separate scanout device without any acceleration capabilities
except some hardware overlay planes, and a pure GPU render device without
any connection to the outside world. The compositor uses both devices for
compositing and output.
But most of the time, client buffers will go through compositing. So the
primary device is still the render device.

The situation doesn't change a lot compared to wl_drm to be honest. The device
that is advertised via wl_drm will be the primary device advertised by this
protocol.

Maybe when the compositor decides to scan-out a client, it can switch the
primary device to the scan-out device. Sorry, I don't know enough about these
particular devices to say for sure.
Post by Daniel Stone
Post by Simon Ser
Post by Philipp Zabel
When the surface becomes fullscreen on a different GPU (meaning it becomes
fullscreen on an output which is managed by another GPU), I'd expect the
compositor to change the primary device for this surface to this other GPU.
If the compositor uses multiple devices for compositing, it'll probably switch
the primary device when the surface is moved from one GPU to the other.
I'm not sure how i.MX6 works, but: even if the same GPU is used for compositing
and scanout, but the compositing preferred formats are different from the
scanout preferred formats, the compositor can update the preferred format
without changing the preferred device.
Is there an issue with this? Maybe something should be added to the protocol to
explain it better?
It is not clear to me from the protocol description whether the primary
device means the scanout engine or the GPU, in case they are different.
What is the client process supposed to do with this fd? Is it expected
to be able to render on this device? Or use it to allocate the optimal
buffers?
The client is expected to allocate its buffers there. I'm not sure about
rendering.
Post by Daniel Stone
Post by Simon Ser
Post by Philipp Zabel
What about contiguous vs non-contiguous memory?
On i.MX6QP (Vivante GC3000) we would probably want the client to always
render DRM_FORMAT_MOD_VIVANTE_SUPER_TILED, because this can be directly
read by both texture samplers (non-contiguous) and scanout (must be
contiguous).
On i.MX6Q (Vivante GC2000) we always want to use the most efficient
DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED, because neither of the
supported render formats can be sampled or scanned out directly.
Since the compositor has to resolve into DRM_FORMAT_MOD_VIVANTE_TILED
(non-contiguous) for texture sampling or DRM_FORMAT_MOD_LINEAR
(contiguous) for scanout, the client buffers can always be non-
contiguous.
On i.MX6S (Vivante GC880) the optimal render format for texture sampling
would be DRM_FORMAT_MOD_VIVANTE_TILED (non-contiguous) and for scanout
DRM_FORMAT_MOD_VIVANTE_SUPER_TILED (non-contiguous) which would be
resolved into DRM_FORMAT_MOD_LINEAR (contiguous) by the compositor.
I think all of this works with Daniel's design.
Post by Philipp Zabel
All three could always handle DRM_FORMAT_MOD_LINEAR (contiguous) client
buffers for scanout directly, but those would be suboptimal if the
compositor decides to render on short notice, because the client would
have already resolved into linear and then the compositor would have to
resolve back into a texture sampler tiling format.
Is the concern here that switching between scanout and compositing is
non-optimal until the client chooses the preferred format?
My point is just that whether or not the buffer must be contiguous in
physical memory is the essential piece of information on i.MX6QP,
whereas the optimal tiling modifier is the same for both GPU composition
and direct scanout cases.
If the client provides non-contiguous buffers, the "optimal" tiling
doesn't help one bit in the scanout case, as the scanout hardware can't
read from those.
Sorry, I don't get what you mean. Can you please try to explain again?
Pekka Paalanen
2018-11-14 09:03:57 UTC
Permalink
On Tue, 13 Nov 2018 18:19:29 +0000
Post by Simon Ser
Post by Daniel Stone
Hi Simon,
Post by Simon Ser
Post by Philipp Zabel
Post by Simon Ser
+ <event name="primary_device">
+ <description summary="preferred primary device">
+ This event advertizes the primary device that the server prefers. There
+ is exactly one primary device.
Which device should this be if the scanout engine is separate from the
render engine (e.g. IPU/imx-drm and GPU/etnaviv on i.MX6)
When the surface hints are created, I expect the compositor to send the device
it uses for compositing as the primary device (assuming it's using only one
device).
i.MX6 has a separate scanout device without any acceleration capabilities
except some hardware overlay planes, and a pure GPU render device without
any connection to the outside world. The compositor uses both devices for
compositing and output.
But most of the time, client buffers will go through compositing. So the
primary device is still the render device.
The situation doesn't change a lot compared to wl_drm to be honest. The device
that is advertised via wl_drm will be the primary device advertised by this
protocol.
Maybe when the compositor decides to scan-out a client, it can switch the
primary device to the scan-out device. Sorry, I don't know enough about these
particular devices to say for sure.
Hi,

I do see Philipp's point after thinking for a while. I'll explain below.
Post by Simon Ser
Post by Daniel Stone
Post by Simon Ser
Post by Philipp Zabel
When the surface becomes fullscreen on a different GPU (meaning it becomes
fullscreen on an output which is managed by another GPU), I'd expect the
compositor to change the primary device for this surface to this other GPU.
If the compositor uses multiple devices for compositing, it'll probably switch
the primary device when the surface is moved from one GPU to the other.
I'm not sure how i.MX6 works, but: even if the same GPU is used for compositing
and scanout, but the compositing preferred formats are different from the
scanout preferred formats, the compositor can update the preferred format
without changing the preferred device.
Is there an issue with this? Maybe something should be added to the protocol to
explain it better?
It is not clear to me from the protocol description whether the primary
device means the scanout engine or the GPU, in case they are different.
What is the client process supposed to do with this fd? Is it expected
to be able to render on this device? Or use it to allocate the optimal
buffers?
The client is expected to allocate its buffers there. I'm not sure about
rendering.
Well, actually...
Post by Simon Ser
Post by Daniel Stone
Post by Simon Ser
Post by Philipp Zabel
What about contiguous vs non-contiguous memory?
On i.MX6QP (Vivante GC3000) we would probably want the client to always
render DRM_FORMAT_MOD_VIVANTE_SUPER_TILED, because this can be directly
read by both texture samplers (non-contiguous) and scanout (must be
contiguous).
On i.MX6Q (Vivante GC2000) we always want to use the most efficient
DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED, because neither of the
supported render formats can be sampled or scanned out directly.
Since the compositor has to resolve into DRM_FORMAT_MOD_VIVANTE_TILED
(non-contiguous) for texture sampling or DRM_FORMAT_MOD_LINEAR
(contiguous) for scanout, the client buffers can always be non-
contiguous.
On i.MX6S (Vivante GC880) the optimal render format for texture sampling
would be DRM_FORMAT_MOD_VIVANTE_TILED (non-contiguous) and for scanout
DRM_FORMAT_MOD_VIVANTE_SUPER_TILED (non-contiguous) which would be
resolved into DRM_FORMAT_MOD_LINEAR (contiguous) by the compositor.
I think all of this works with Daniel's design.
Post by Philipp Zabel
All three could always handle DRM_FORMAT_MOD_LINEAR (contiguous) client
buffers for scanout directly, but those would be suboptimal if the
compositor decides to render on short notice, because the client would
have already resolved into linear and then the compositor would have to
resolve back into a texture sampler tiling format.
Is the concern here that switching between scanout and compositing is
non-optimal until the client chooses the preferred format?
My point is just that whether or not the buffer must be contiguous in
physical memory is the essential piece of information on i.MX6QP,
whereas the optimal tiling modifier is the same for both GPU composition
and direct scanout cases.
If the client provides non-contiguous buffers, the "optimal" tiling
doesn't help one bit in the scanout case, as the scanout hardware can't
read from those.
Sorry, I don't get what you mean. Can you please try to explain again?
The hints protocol we are discussing here is a subset of what
https://github.com/cubanismo/allocator aims to achieve. Originally we
only concentrated on getting the format and modifier more optimal, but
the question of where and how to allocate the buffers is valid too. Is
it in scope for this extension is the big question below.

Ideally, the protocol would do something like this:

- Tell the client which device and for which use case the device must
be able to access the buffer at minimum and always.

- Tell the client that if it could make the buffer suitable also for a
secondary device and a secondary use case, the compositor could do a
more optimal job (e.g. putting the buffer in direct scanout,
bypassing composition, or a hardware video encoder in case the output
is going to be streamed).

We don't have the vocabulary for use cases and there are tons of
different details to be taken into account, which is the whole point of
the allocator project. So we cannot do the complete solution here and
now, but we can do an approximate solution by negotiating pixel
formats and modifiers.

The primary device is what the compositor uses for the fallback path,
which is compositing with a GPU. Therefore at very minimum, clients
need to allocate buffers that can be used with the primary device. We
guarantee this in the zwp_linux_dmabuf protocol by having the
compositor test the buffer import into EGL (or equivalent) before it
accepts that the buffer even exists. The client does not absolutely
necessarily need the primary device for this, but it will have much
better chances of making usable buffers if it uses it for allocation at
least.

The primary device also has another very different meaning: the
compositor will likely be using the primary device anyway so it is kept
active and if clients use the same device instead of some other device,
it probably results in considerable power savings. IOW, the primary
device is the preferred rendering device as well. Or so I assume, these
two concepts could be decoupled as well.

A secondary device is optional. In system where the GPU and display
devices are separate DRM devices, the GPU will be the primary device,
and the display device would be the secondary device. So there seems to
be a use case for sending the secondary device (or devices?) in
addition to the primary device.

AFAIK, the unix device memory allocator project does not yet have
anything we should be encoding as a Wayland extension, so all we seem
to be able to do is to deliver the device file descriptors and the
format+modifier sets.

Now the design question: do we want to communicate the secondary
devices in this extension? Quite likely we need a different extension
to be used with the allocator project.

Is communicating the display device fd useful already when it differs
from the rendering device? Is there a way for generic client userspace
to use it effectively, or would it rely on hardware-specific code in
clients rather than in e.g. Mesa drivers? Are there EGL or Vulkan APIs
to tell the driver it should make the buffer work on one device while
rendering on another?

My current opinion is that if there is no generic way for an
application to benefit from the secondary device fd, then we should not
add secondary devices in this extension yet.


Thanks,
pq
Philipp Zabel
2018-11-15 17:11:32 UTC
Permalink
Hi Pekka,

thank you for the explanation.

On Wed, 2018-11-14 at 11:03 +0200, Pekka Paalanen wrote:
[...]
Post by Pekka Paalanen
The hints protocol we are discussing here is a subset of what
https://github.com/cubanismo/allocator aims to achieve. Originally we
only concentrated on getting the format and modifier more optimal, but
the question of where and how to allocate the buffers is valid too. Is
it in scope for this extension is the big question below.
My guess is: probably not. Either way, I'd prefer the protocol docs to
be explicit about this.
Post by Pekka Paalanen
- Tell the client which device and for which use case the device must
be able to access the buffer at minimum and always.
- Tell the client that if it could make the buffer suitable also for a
secondary device and a secondary use case, the compositor could do a
more optimal job (e.g. putting the buffer in direct scanout,
bypassing composition, or a hardware video encoder in case the output
is going to be streamed).
We don't have the vocabulary for use cases and there are tons of
different details to be taken into account, which is the whole point of
the allocator project. So we cannot do the complete solution here and
now, but we can do an approximate solution by negotiating pixel
formats and modifiers.
The primary device is what the compositor uses for the fallback path,
which is compositing with a GPU.
Therefore at very minimum, clients
need to allocate buffers that can be used with the primary device. We
guarantee this in the zwp_linux_dmabuf protocol by having the
compositor test the buffer import into EGL (or equivalent) before it
accepts that the buffer even exists. The client does not absolutely
necessarily need the primary device for this, but it will have much
better chances of making usable buffers if it uses it for allocation at
least.
So the client must provide buffers that the primary device can import
and sample a texture from, ideally directly.
Can something like this be added to the interface description, to make
it clear what the primary device actually is supposed to be in this
context?
Post by Pekka Paalanen
The primary device also has another very different meaning: the
compositor will likely be using the primary device anyway so it is kept
active and if clients use the same device instead of some other device,
it probably results in considerable power savings. IOW, the primary
device is the preferred rendering device as well. Or so I assume, these
two concepts could be decoupled as well.
And the client should default to using the same primary device for
rendering for power savings.
Post by Pekka Paalanen
A secondary device is optional. In system where the GPU and display
devices are separate DRM devices, the GPU will be the primary device,
and the display device would be the secondary device. So there seems to
be a use case for sending the secondary device (or devices?) in
addition to the primary device.
AFAIK, the unix device memory allocator project does not yet have
anything we should be encoding as a Wayland extension, so all we seem
to be able to do is to deliver the device file descriptors and the
format+modifier sets.
Ok.
Post by Pekka Paalanen
Now the design question: do we want to communicate the secondary
devices in this extension? Quite likely we need a different extension
to be used with the allocator project.
As long as the use case is not clear, I'd say leave it out.
A "secondary_device" event may be added later with a version update if
needed.
Post by Pekka Paalanen
Is communicating the display device fd useful already when it differs
from the rendering device? Is there a way for generic client userspace
to use it effectively, or would it rely on hardware-specific code in
clients rather than in e.g. Mesa drivers? Are there EGL or Vulkan APIs
to tell the driver it should make the buffer work on one device while
rendering on another?
I have not found anything specific about this in the Vulkan spec.

The VK_KHR_external_memory extension even states:

"However, only the same concrete physical device can be used when
sharing memory, [...]"

and:

"Note this does not attempt to address cross-device transitions, nor
transitions to engines on the same device which are not visible
within the Vulkan API.
    Both of these are beyond the scope of this extension."

in the issues. So even though with VK_EXT_external_memory_dma_buf
and VK_EXT_image_drm_format_modifier bolted on top, sharing between
different devices should be possible, it is not the main focus of these
extensions.
Post by Pekka Paalanen
My current opinion is that if there is no generic way for an
application to benefit from the secondary device fd, then we should not
add secondary devices in this extension yet.
I agree.

regards
Philipp
Simon Ser
2018-11-18 14:08:57 UTC
Permalink
Post by Philipp Zabel
Hi Pekka,
thank you for the explanation.
Hi,

Thanks Pekka for clarifying.
Post by Philipp Zabel
[...]
Post by Pekka Paalanen
The hints protocol we are discussing here is a subset of what
https://github.com/cubanismo/allocator aims to achieve. Originally we
only concentrated on getting the format and modifier more optimal, but
the question of where and how to allocate the buffers is valid too. Is
it in scope for this extension is the big question below.
My guess is: probably not. Either way, I'd prefer the protocol docs to
be explicit about this.
Post by Pekka Paalanen
- Tell the client which device and for which use case the device must
be able to access the buffer at minimum and always.
- Tell the client that if it could make the buffer suitable also for a
secondary device and a secondary use case, the compositor could do a
more optimal job (e.g. putting the buffer in direct scanout,
bypassing composition, or a hardware video encoder in case the output
is going to be streamed).
We don't have the vocabulary for use cases and there are tons of
different details to be taken into account, which is the whole point of
the allocator project. So we cannot do the complete solution here and
now, but we can do an approximate solution by negotiating pixel
formats and modifiers.
The primary device is what the compositor uses for the fallback path,
which is compositing with a GPU.
Therefore at very minimum, clients
need to allocate buffers that can be used with the primary device. We
guarantee this in the zwp_linux_dmabuf protocol by having the
compositor test the buffer import into EGL (or equivalent) before it
accepts that the buffer even exists. The client does not absolutely
necessarily need the primary device for this, but it will have much
better chances of making usable buffers if it uses it for allocation at
least.
So the client must provide buffers that the primary device can import
and sample a texture from, ideally directly.
Can something like this be added to the interface description, to make
it clear what the primary device actually is supposed to be in this
context?
This seems sensible, I'll do that.
Post by Philipp Zabel
Post by Pekka Paalanen
The primary device also has another very different meaning: the
compositor will likely be using the primary device anyway so it is kept
active and if clients use the same device instead of some other device,
it probably results in considerable power savings. IOW, the primary
device is the preferred rendering device as well. Or so I assume, these
two concepts could be decoupled as well.
And the client should default to using the same primary device for
rendering for power savings.
Will be in the next version, but with "can" instead of "should", because some
clients (games with DRI_PRIME) might want to use another device to get better
performance.
Post by Philipp Zabel
Post by Pekka Paalanen
A secondary device is optional. In system where the GPU and display
devices are separate DRM devices, the GPU will be the primary device,
and the display device would be the secondary device. So there seems to
be a use case for sending the secondary device (or devices?) in
addition to the primary device.
AFAIK, the unix device memory allocator project does not yet have
anything we should be encoding as a Wayland extension, so all we seem
to be able to do is to deliver the device file descriptors and the
format+modifier sets.
Ok.
Post by Pekka Paalanen
Now the design question: do we want to communicate the secondary
devices in this extension? Quite likely we need a different extension
to be used with the allocator project.
As long as the use case is not clear, I'd say leave it out.
A "secondary_device" event may be added later with a version update if
needed.
Yes, I agree, I'd prefer not having this in the protocol for now.
Post by Philipp Zabel
Post by Pekka Paalanen
My current opinion is that if there is no generic way for an
application to benefit from the secondary device fd, then we should not
add secondary devices in this extension yet.
I agree.
+1

Simon Ser
2018-11-10 13:54:19 UTC
Permalink
Just a general update about this: I tried to see how we could make Mesa use this
new protocol.

A bad news is that the DRM FD is per-EGLDisplay and I think it would require
quite some changes to make it per-EGLSurface. I'm still new to the Mesa
codebase, so it'd probably make sense to only use the new protocol to get the
device FD, without relying on wl_drm anymore. We could talk about using the
protocol more efficiently in the future. I also think a lot of clients weren't
designed to support multiple device FDs, so it would be nice to have a smoother
upgrade path.

That leaves an issue: the whole protocol provides hints for a surface. When the
EGLDisplay is created we don't have a surface yet. I can think of a few possible
solutions:

* Create a wl_surface, get the hints, and destroy everything (without mapping
the surface)
* Allow the get_surface_hints to take a NULL surface
* Add a get_hints request without a wl_surface argument
* Forget about per-surface hints, make hints global
* (Someone else volunteers to patch Mesa to use per-surface FDs)

What do you think?
Pekka Paalanen
2018-11-12 09:14:13 UTC
Permalink
On Sat, 10 Nov 2018 13:54:19 +0000
Post by Simon Ser
Just a general update about this: I tried to see how we could make Mesa use this
new protocol.
A bad news is that the DRM FD is per-EGLDisplay and I think it would require
quite some changes to make it per-EGLSurface. I'm still new to the Mesa
codebase, so it'd probably make sense to only use the new protocol to get the
device FD, without relying on wl_drm anymore. We could talk about using the
protocol more efficiently in the future. I also think a lot of clients weren't
designed to support multiple device FDs, so it would be nice to have a smoother
upgrade path.
Hi,

yeah, that sounds fine to me: use the new protocol, if available, to
only find the default device at EGLDisplay creation.

What can be done per surface later is only the changing of
format+modifier, within the limits of what EGLConfig the app is using,
so maybe it's the modifier alone. If EGL should do that automatically
and internally to begin with... it could change the modifier at least.
Post by Simon Ser
That leaves an issue: the whole protocol provides hints for a surface. When the
EGLDisplay is created we don't have a surface yet. I can think of a few possible
Indeed.
Post by Simon Ser
* Create a wl_surface, get the hints, and destroy everything (without mapping
the surface)
* Allow the get_surface_hints to take a NULL surface
* Add a get_hints request without a wl_surface argument
* Forget about per-surface hints, make hints global
* (Someone else volunteers to patch Mesa to use per-surface FDs)
What do you think?
I think maybe it would be best to make the device hint "global" in a
way, not tied to any surface, while leaving the format+modifier hints
per-surface. IOW, just move the primary_device event from
zwp_linux_dmabuf_device_hints_v1 into zwp_linux_dmabuf_v1 (or
equivalent).

Can anyone think of practical uses where the default device would need
to depend on the surface somehow?

I seem to recall we agreed that the primary device is the one the
compositor is compositing with. Using the compositing device as the
recommended default device makes sense from power consuption point of
view: the compositor will be keeping that GPU awake anyway, so apps
that don't care much about performance but do want to use a GPU should
use it.

Your possible solutions are a valid list for another problem as well:
the initial/default format+modifier hints before a surface is mapped. I
think it should be either allowing get_surface_hints with NULL surface
or adding get_default_hints request that doesn't take a surface.
Technically the two equivalent.

I do not like the temp wl_surface approach, and we really do want hints
to be per-surface because that's the whole point with the
format+modifier hints.


Thanks,
pq
Simon Ser
2018-11-12 10:13:39 UTC
Permalink
Post by Pekka Paalanen
Post by Simon Ser
* Create a wl_surface, get the hints, and destroy everything (without mapping
the surface)
* Allow the get_surface_hints to take a NULL surface
* Add a get_hints request without a wl_surface argument
* Forget about per-surface hints, make hints global
* (Someone else volunteers to patch Mesa to use per-surface FDs)
What do you think?
I think maybe it would be best to make the device hint "global" in a
way, not tied to any surface, while leaving the format+modifier hints
per-surface. IOW, just move the primary_device event from
zwp_linux_dmabuf_device_hints_v1 into zwp_linux_dmabuf_v1 (or
equivalent).
Can anyone think of practical uses where the default device would need
to depend on the surface somehow?
I seem to recall we agreed that the primary device is the one the
compositor is compositing with. Using the compositing device as the
recommended default device makes sense from power consuption point of
view: the compositor will be keeping that GPU awake anyway, so apps
that don't care much about performance but do want to use a GPU should
use it.
In the case of compositing the surface, yes the primary device will be the one
used for compositing. However there are two cases in which a per-surface device
hint would be useful.

First, what happens if the surface isn't composited and is directly scanned out?
Let's say I have two GPUs, with one output each. The compositor is using one GPU
for compositing, and the surface is fullscreened on the other's output. If we
only have a global device hint, then the primary device will be the one used for
compositing. However this causes an unnecessary copy between the two GPUs: the
client will render on one, and then the compositor will copy the DMA-BUF to the
other one for scan-out. It would be better if the client can render directly on
the GPU it will be scanned out with.

Second, some compositors could support rendering with multiple GPUs. For
instance, if I have two GPUs with one output each, the compositor could use GPU
1 for compositing output 1 and GPU 2 for compositing output 2. In this case, it
would be better if the client could render using the GPU it will be composited
with, and this depends on the output the surface is displayed on.
Post by Pekka Paalanen
the initial/default format+modifier hints before a surface is mapped. I
think it should be either allowing get_surface_hints with NULL surface
or adding get_default_hints request that doesn't take a surface.
Technically the two equivalent.
I think the cleanest solution would be to add get_default_hints, which would
create a wp_linux_dmabuf_hints object.
Post by Pekka Paalanen
I do not like the temp wl_surface approach, and we really do want hints
to be per-surface because that's the whole point with the
format+modifier hints.
Aye.
Pekka Paalanen
2018-11-12 12:13:09 UTC
Permalink
On Mon, 12 Nov 2018 10:13:39 +0000
Post by Simon Ser
Post by Pekka Paalanen
Post by Simon Ser
* Create a wl_surface, get the hints, and destroy everything (without mapping
the surface)
* Allow the get_surface_hints to take a NULL surface
* Add a get_hints request without a wl_surface argument
* Forget about per-surface hints, make hints global
* (Someone else volunteers to patch Mesa to use per-surface FDs)
What do you think?
I think maybe it would be best to make the device hint "global" in a
way, not tied to any surface, while leaving the format+modifier hints
per-surface. IOW, just move the primary_device event from
zwp_linux_dmabuf_device_hints_v1 into zwp_linux_dmabuf_v1 (or
equivalent).
Can anyone think of practical uses where the default device would need
to depend on the surface somehow?
I seem to recall we agreed that the primary device is the one the
compositor is compositing with. Using the compositing device as the
recommended default device makes sense from power consuption point of
view: the compositor will be keeping that GPU awake anyway, so apps
that don't care much about performance but do want to use a GPU should
use it.
In the case of compositing the surface, yes the primary device will be the one
used for compositing. However there are two cases in which a per-surface device
hint would be useful.
First, what happens if the surface isn't composited and is directly scanned out?
Let's say I have two GPUs, with one output each. The compositor is using one GPU
for compositing, and the surface is fullscreened on the other's output. If we
only have a global device hint, then the primary device will be the one used for
compositing. However this causes an unnecessary copy between the two GPUs: the
client will render on one, and then the compositor will copy the DMA-BUF to the
other one for scan-out. It would be better if the client can render directly on
the GPU it will be scanned out with.
Theoretically yes. However, apps are not usually prepared to switch the
GPU they render with.

Rendering with and being scanned out on are somewhat orthogonal. In the
above case, the compositor could keep the default device as the
compositing GPU, but change the modifiers so that it would be possible
to import the dmabuf to the scanout GPU either for direct scanout or
having the scanout GPU make the copy. It's not always possible for
other reasons like an incompatible memory domain, I give you that.

If you envision that apps (toolkits) might be willing to implement GPU
switching sometimes, then I have no objections. It is again the
difference between initial default hints vs. optimization hints after
the surface is mapped.
Post by Simon Ser
Second, some compositors could support rendering with multiple GPUs. For
instance, if I have two GPUs with one output each, the compositor could use GPU
1 for compositing output 1 and GPU 2 for compositing output 2. In this case, it
would be better if the client could render using the GPU it will be composited
with, and this depends on the output the surface is displayed on.
From protocol point of view this does not differ from the first case.
Post by Simon Ser
Post by Pekka Paalanen
the initial/default format+modifier hints before a surface is mapped. I
think it should be either allowing get_surface_hints with NULL surface
or adding get_default_hints request that doesn't take a surface.
Technically the two equivalent.
I think the cleanest solution would be to add get_default_hints, which would
create a wp_linux_dmabuf_hints object.
Right. And if we want the preferred device to also have the initial
hints vs. optimized hints after mapping, you'd keep the device event in
zwp_linux_dmabuf_device_hints_v1.

Sounds fine to me.


Thanks,
pq
Loading...