External DMA to gpu

On Tue, 11 Sep 2018 15:28:35 +0200

Post by Dirk Eibach
I have a grabber device on The PCIe-bus that is able to transfer image
data to other PCIe devices.
I want to setup a wayland client, that reserves a buffer in GPU
memory. Then the grabber could DMA to the buffer address. After
finishing the transfer, the client could flip the buffer.
Is there already a concept for this in weston? What might be a good
starting point?

Hi Dirk,

that would not involve Weston in any special way at all. Buffer
allocation is usually done in the client any way the client wants. To
ensure the buffer can be used by the compositor before you fill it with
data, you would export your buffer as a dmabuf and use
zwp_linux_dmabuf_v1 extension to send the buffer details to the Wayland
compositor. If that succeeds, all is good and you can fill the buffer.
After that, you have a wl_buffer you can attach to a wl_surface, and
the compositor will just process it, even put it on a DRM plane
bypassing compositing if possible.

If you want to process the buffer contents with the GPU inside your
client instead of showing it directly on screen, then you would not do
anything at all with Wayland. Once you have the dmabuf, you can try to
import it as an EGLImage and turn that into a GL texture.

How to do the non-Wayland things in the client is a good question.
Presumably your grabber card has a Linux kernel driver. You could have
the grabber device/driver allocate the buffer and export it as dmabuf
(requires implementation in the driver), but then there is a risk that
it is non-optimal or even unusable to the GPU and/or the display.

Allocating on a GPU device you would need to go through EGL or GBM,
export as dmabuf, import the dmabuf to your grabber driver (needs
implementation again) and hope the grabber device/driver is able to
write to that buffer. gbm_bo_create_with_modifiers() might be the best
bet.

Anyway, the gist is that the buffer handle in userspace is always a
dmabuf file descriptor, and the grabber card driver needs to be
prepared to use those. Physical addresses in usespace are no-go.

Thanks,
pq

Dirk Eibach

2018-09-12 06:30:55 UTC

Hi Pekka,

Post by Pekka Paalanen
that would not involve Weston in any special way at all. Buffer
allocation is usually done in the client any way the client wants. To
ensure the buffer can be used by the compositor before you fill it with
data, you would export your buffer as a dmabuf and use
zwp_linux_dmabuf_v1 extension to send the buffer details to the Wayland
compositor. If that succeeds, all is good and you can fill the buffer.
After that, you have a wl_buffer you can attach to a wl_surface, and
the compositor will just process it, even put it on a DRM plane
bypassing compositing if possible.

Thank you so much, that is exactly the information I needed.
Is the simple-dmabuf-v4l client an implementation of this principle?
So v4l2 offers an interface for passing the dmabuf. A v4l2 driver
would probably the right choice for my grabber anyway.

Cheers
Dirk

Pekka Paalanen

2018-09-12 09:21:22 UTC

On Wed, 12 Sep 2018 08:30:55 +0200

Post by Dirk Eibach
Hi Pekka,

Hi Dirk,

yes, simple-dmabuf-v4l does exactly what I wrote in the above quote.
However, it does not allocate from the GPU device or from the display
device. Instead, it allocates from the V4L2 device and hopes that the
compositor will be able to use the buffers. Quite likely the compositor
can use the buffer, but they might not be fit for direct scanout which
would mean that composite bypass is not possible in the compositor.

FWIW, there is no general solution to the buffer allocation problem
that would ensure the buffer is usable for all purposes you would hope
it to be. There is work going on though:
https://github.com/cubanismo/allocator

Until that materializes, programs need to be smart about how to
allocate and hope it will work.

Thanks,
pq

Pekka Paalanen

2018-09-13 11:30:29 UTC

On Wed, 12 Sep 2018 11:51:39 +0200

Post by Dirk Eibach
Hi Pekka,

Post by Pekka Paalanen
yes, simple-dmabuf-v4l does exactly what I wrote in the above quote.
However, it does not allocate from the GPU device or from the display
device. Instead, it allocates from the V4L2 device and hopes that the
compositor will be able to use the buffers. Quite likely the compositor
can use the buffer, but they might not be fit for direct scanout which
would mean that composite bypass is not possible in the compositor.

To be sure the buffer can be used for scanout, I could use
gbm_bo_create_with_modifiers() to allocate it, right?

Hi Dirk,

yeah, you query the KMS device for acceptable formats and modifiers by
reading the IN_FORMATS property (see Weston's
drm_plane_populate_formats()), and try to create a GBM bo with one of
those. Your grabber driver need to use the same format and modifier as
well. Hopefully that gives you a buffer that is good for scanout and
texturing.

There are no strong guarantees, though. That would need the Unix device
memory allocator infrastructure.

This is the theory as I understand it, at least.

Thanks,
pq

Dirk Eibach

2018-10-02 13:06:14 UTC

Hi Pekka,

I finally got everything working. I am using gbm_bo_create() and
gbm_bo_get_fd() to get a buffer that is filled by my grabber. Then I
use eglCreateImageKHR() and glEGLImageTargetTexture2DOES() to display
it.
My only problem left is that glEGLImageTargetTexture2DOES() does only
accept ARGB8888 and not RGB888, which means I have to waste a lot of
PCIe bandwidth. Any ideas how to get around this? Or what would be a
more appropriate place to post this question?

Cheers
Dirk

Pekka Paalanen

2018-10-02 13:47:32 UTC

On Tue, 2 Oct 2018 15:06:14 +0200

Post by Dirk Eibach
Hi Pekka,
I finally got everything working. I am using gbm_bo_create() and
gbm_bo_get_fd() to get a buffer that is filled by my grabber. Then I
use eglCreateImageKHR() and glEGLImageTargetTexture2DOES() to display
it.

Hi Dirk,

nice to hear!

I suppose that means you still do a copy from the gbm_bo/dmabuf into a
window surface? If you used zwp_linux_dmabuf manually from your Wayland
client, you could avoid even that copy. It has the same caveat as below
though.

Post by Dirk Eibach
My only problem left is that glEGLImageTargetTexture2DOES() does only
accept ARGB8888 and not RGB888, which means I have to waste a lot of
PCIe bandwidth. Any ideas how to get around this? Or what would be a
more appropriate place to post this question?

Yeah, I suppose support for true 24-bit-storage formats is rare
nowadays.

The format list advertised via zwp_linux_dmabuf, visible via e.g.
weston-info, can tell you what you could use directly. After all, a
Wayland compositor does the same EGLImage import as you do in the
simple case.

You could probably use the GPU to convert from 24-bit to 32-bit format
though, by importing the image as R8 format instead of RGB888 and
pretend the width is 3x. Then you could use a fragment shader to sample
the real R, G and B separately and write out a 32-bit format image for
display.

Thanks,
pq

Pekka Paalanen

2018-10-03 07:03:20 UTC

On Tue, 2 Oct 2018 17:08:25 +0200

Post by Dirk Eibach
Hi Pekka,

Re-adding wayland-devel to cc, hope that's ok.

Post by Pekka Paalanen
I suppose that means you still do a copy from the gbm_bo/dmabuf into a
window surface? If you used zwp_linux_dmabuf manually from your Wayland
client, you could avoid even that copy. It has the same caveat as below
though.

I don't think so. The grabber does direct DMA to the VRAM, making the
texture should be zero copy. Or am I missing something?

Below you say you use glEGLImageTargetTexture2DOES(). That gets you a
GL texture. To actually get that GL texture on screen, you have to do a
GL drawing command to copy the pixels into an EGLSurface created from a
wl_surface. That's the copy I'm referring to and which would be
avoidable if you don't have to e.g. convert the color format in the app.

Or are you using some other tricks?

Once the pixels are on a wl_surface, the compositor will do one more
copy to get those into a framebuffer, unless the requirements for
scanning out directly from the client buffer are met. But I would guess
it is more important to optimize the grabber-to-VRAM path than the
wl_surface-to-scanout path which is likely just VRAM-to-VRAM so pretty
good already.

Yeah, I suppose support for true 24-bit-storage formats is rare
nowadays.
The format list advertised via zwp_linux_dmabuf, visible via e.g.
weston-info, can tell you what you could use directly. After all, a
Wayland compositor does the same EGLImage import as you do in the
simple case.
You could probably use the GPU to convert from 24-bit to 32-bit format
though, by importing the image as R8 format instead of RGB888 and
pretend the width is 3x. Then you could use a fragment shader to sample
the real R, G and B separately and write out a 32-bit format image for
display.

Is there any example code for a gl noob? I already did some research but
didn't find anything useful.

Nothing much come to mind. Weston uses similar tricks to convert YUV
data to RGB by lying to EGL and GL that the incoming buffer is R8 or
RG88 and using a fragment shader to compute the proper RGB values. It
is really just about lying to EGL when you import the dmabuf: instead of
the actual pixel format, you use R8 and adjust the width/height/stride
to match so that you can sample each byte correctly. Then in the
fragment shader, you compute the correct texture coordinates to read
each of R, G and B values for an output pixel and then combine those
into an output color.

Reading YUV is more tricky than reading 24-bit RGB, because YUV is
usually arranged in multiple planes, some of which are sub-sampled,
e.g. half resolution.

Thanks,
pq

Dirk Eibach

2018-10-04 08:15:52 UTC

Hi Pekka,

I don't think so. The grabber does direct DMA to the VRAM, making the
texture should be zero copy. Or am I missing something?

Below you say you use glEGLImageTargetTexture2DOES(). That gets you a
GL texture. To actually get that GL texture on screen, you have to do a
GL drawing command to copy the pixels into an EGLSurface created from a
wl_surface. That's the copy I'm referring to and which would be
avoidable if you don't have to e.g. convert the color format in the app.
Or are you using some other tricks?
Once the pixels are on a wl_surface, the compositor will do one more
copy to get those into a framebuffer, unless the requirements for
scanning out directly from the client buffer are met. But I would guess
it is more important to optimize the grabber-to-VRAM path than the
wl_surface-to-scanout path which is likely just VRAM-to-VRAM so pretty
good already.

If I have to use a shader for colorspace conversion I cannot use this
approach, right?

Yeah, I suppose support for true 24-bit-storage formats is rare
nowadays.
The format list advertised via zwp_linux_dmabuf, visible via e.g.
weston-info, can tell you what you could use directly. After all, a
Wayland compositor does the same EGLImage import as you do in the
simple case.
You could probably use the GPU to convert from 24-bit to 32-bit format
though, by importing the image as R8 format instead of RGB888 and
pretend the width is 3x. Then you could use a fragment shader to sample
the real R, G and B separately and write out a 32-bit format image for
display.

Is there any example code for a gl noob? I already did some research but
didn't find anything useful.

Nothing much come to mind. Weston uses similar tricks to convert YUV
data to RGB by lying to EGL and GL that the incoming buffer is R8 or
RG88 and using a fragment shader to compute the proper RGB values. It
is really just about lying to EGL when you import the dmabuf: instead of
the actual pixel format, you use R8 and adjust the width/height/stride
to match so that you can sample each byte correctly. Then in the
fragment shader, you compute the correct texture coordinates to read
each of R, G and B values for an output pixel and then combine those
into an output color.
Reading YUV is more tricky than reading 24-bit RGB, because YUV is
usually arranged in multiple planes, some of which are sub-sampled,
e.g. half resolution.

Thanks, that was very helpful, as always. This is what we came up
with, and it works nicely:
float x_int = floor(3840.0 * vTexCoord.x) * 3.0;\n"
float r = texture2D(uTexture, vec2((x_int + 0.0) / (3840.0 * 3.0),
vTexCoord.y)).r;
float g = texture2D(uTexture, vec2((x_int + 1.0) / (3840.0 * 3.0),
vTexCoord.y)).r;
float b = texture2D(uTexture, vec2((x_int + 2.0) / (3840.0 * 3.0),
vTexCoord.y)).r;
gl_FragColor = vec4(r, g, b, 1.0); // add alpha component

We have to pass the horizontal resolution to the shader, I suppose
there is no way around this, right?

I was afraid that the unaligned access in the shader would have some
performace penalty. But in fact performace is better than the 32-bit
version. Thumbs up!

Cheers
Dirk

Pekka Paalanen

2018-10-04 08:52:03 UTC

On Thu, 4 Oct 2018 10:15:52 +0200

Post by Dirk Eibach
Hi Pekka,

I don't think so. The grabber does direct DMA to the VRAM, making the
texture should be zero copy. Or am I missing something?

Below you say you use glEGLImageTargetTexture2DOES(). That gets you a
GL texture. To actually get that GL texture on screen, you have to do a
GL drawing command to copy the pixels into an EGLSurface created from a
wl_surface. That's the copy I'm referring to and which would be
avoidable if you don't have to e.g. convert the color format in the app.
Or are you using some other tricks?
Once the pixels are on a wl_surface, the compositor will do one more
copy to get those into a framebuffer, unless the requirements for
scanning out directly from the client buffer are met. But I would guess
it is more important to optimize the grabber-to-VRAM path than the
wl_surface-to-scanout path which is likely just VRAM-to-VRAM so pretty
good already.

If I have to use a shader for colorspace conversion I cannot use this
approach, right?

Sorry, which "this"?

If you use a shader in your app, you are making a copy, and you need to
use glEGLImageTargetTexture2DOES() to make the grabbed buffer available
to the shader. So yes, in that case you don't use zwp_linux_dmabuf
directly but you rely on EGL instead to send your final image to the
compositor.

If instead you're asking about what the compositor does, then the
scanout-ability on the client side is determined how the EGL
implementation chooses to allocate the buffer it will send to the
compositor. Often it is scanout-able, so you don't lose that
opportunity.

Yeah, I suppose support for true 24-bit-storage formats is rare
nowadays.
The format list advertised via zwp_linux_dmabuf, visible via e.g.
weston-info, can tell you what you could use directly. After all, a
Wayland compositor does the same EGLImage import as you do in the
simple case.
You could probably use the GPU to convert from 24-bit to 32-bit format
though, by importing the image as R8 format instead of RGB888 and
pretend the width is 3x. Then you could use a fragment shader to sample
the real R, G and B separately and write out a 32-bit format image for
display.

Is there any example code for a gl noob? I already did some research but
didn't find anything useful.

Nothing much come to mind. Weston uses similar tricks to convert YUV
data to RGB by lying to EGL and GL that the incoming buffer is R8 or
RG88 and using a fragment shader to compute the proper RGB values. It
is really just about lying to EGL when you import the dmabuf: instead of
the actual pixel format, you use R8 and adjust the width/height/stride
to match so that you can sample each byte correctly. Then in the
fragment shader, you compute the correct texture coordinates to read
each of R, G and B values for an output pixel and then combine those
into an output color.
Reading YUV is more tricky than reading 24-bit RGB, because YUV is
usually arranged in multiple planes, some of which are sub-sampled,
e.g. half resolution.

Thanks, that was very helpful, as always. This is what we came up
float x_int = floor(3840.0 * vTexCoord.x) * 3.0;\n"
float r = texture2D(uTexture, vec2((x_int + 0.0) / (3840.0 * 3.0),
vTexCoord.y)).r;
float g = texture2D(uTexture, vec2((x_int + 1.0) / (3840.0 * 3.0),
vTexCoord.y)).r;
float b = texture2D(uTexture, vec2((x_int + 2.0) / (3840.0 * 3.0),
vTexCoord.y)).r;
gl_FragColor = vec4(r, g, b, 1.0); // add alpha component
We have to pass the horizontal resolution to the shader, I suppose
there is no way around this, right?

Correct.

I wonder if if you should add 0.5 to x_int to hit the middle of the
texel, just to be sure NEAREST interpolation gives you the right texel.
I'd have to draw it on paper to see if that formula is exactly right,
so can't say off-hand.

Post by Dirk Eibach
I was afraid that the unaligned access in the shader would have some
performace penalty. But in fact performace is better than the 32-bit
version. Thumbs up!

Nice!

Thanks,
pq

Continue reading on narkive:

Search results for 'External DMA to gpu' (Questions and Answers)

Does the ps3 have good graphics?

started 2007-01-07 16:32:50 UTC

video & online games

Use a solid state drive to speed up a computer (& photoshop)?

started 2009-05-16 09:21:14 UTC

add-ons

which is better ps3 or xbox 360?

started 2009-02-07 17:15:03 UTC

video & online games