From VFXPedia

About OpenCL

Written by Daniel on 28th April 2010

You know how your workstation has 4 or 8 cores? Your graphics card has hundreds of cores, and really high memory bandwidth too. Be nice if you could use all that to make Defocus render faster, wouldn’t it? That’s what OpenCL is for. It’s a standard app interface that lets Fusion make use of your graphics card for crunching numbers, and it’s also a language (similar to C) that you can write your own mini-programs (”kernels”) in.

Before you can do anything with OpenCL, you’ll want a modern graphics card; either an nVidia GeForce 8-series/Quadro x600, AMD Radeon 5x00, or later. Windows only for now, with recent drivers (for AMD, their Stream SDK must also be installed). Alternatively, ATi’s Stream SDK includes OpenCL drivers for your CPU, though there are some restrictions on what OpenCL features are supported.

First up, a few links, coz I ain’t gonna teach you everything about writing OpenCL kernels:

   * OpenCL home page at Khronos.org, with a good overview
   * Introductory Tutorial
   * Online Reference and Specification (PDF)

What I will do is let you know how to use Fusion’s OpenCL framework in your fuses. This is a great way of quickly knocking together a simple (or not-so-simple) There are a few examples at Fusion:Fuses\OpenCL, so let’s quickly go through the OpenCL sample fuse there.

You’ll note that it’s a tool fuse like any other, with a FuRegisterClass node, a Create() function to add controls in, and a Process() function to spit out pixels. One notable difference you’ll see is the big ‘clsource’ string - that’s the OpenCL source code for two small kernels, gradient() and circle(), and it gets compiled on-the-fly into a fast, natively-executed program that can run on your graphics card (or even your CPU). The compilation is done at the bottom of Create():

    mgr = OCLManager()
    if mgr then
        path = debug.getinfo(1).source
        prog = mgr:BuildCachedProgram("OpenCLSample", string.sub(path, 2), clsource)
    end

First we get the OCLManager object, then we can call mgr:BuildCachedProgram(), passing it an arbitrary string ID, the path to the fuse itself (so that the manager can cache the compiled program, and compare its timestamp against the fuse to see if it needs to recompile), and finally the string with our OpenCL source code. Incidentally, the compilation can be quite slow, currently (nVidia’s drivers may take 10-15 seconds), but this happens in a background thread.

In Process(), we fetch our control values in the usual way, then we check to see if we have our OpenCL program object. If we do, we create an OpenCL image to render into, a kernel object that we can call, and we start passing arguments to our kernel:

    local imgcl = prog:CreateImage(img, "readwrite")
        if imgcl then
            local kernel = prog:CreateKernel("gradient")
            if kernel >= 0 then
                prog:SetArg(kernel, 0, imgcl)
                prog:SetArg(kernel, 1, 1,1,1,1)
              
                prog:RunKernel(kernel)

prog:CreateImage() takes a Fusion image, and passes it to OpenCL. The second arg can be “read”, “write” or “readwrite” - we want the latter, because we’ll be writing to it with our first kernel, then reading from it with our second.

prog:CreateKernel(”gradient”) gives us a handle to our gradient() function, so that we can give it some arguments and then invoke it. prog:SetArg() needs the kernel handle, a zero-based index to identify which of gradient()’s arguments we’re setting, and a value. Looking at the OpenCL source code, our gradient() function requires an image object and a float4 colour value, so we need two SetArg() calls to give it the image we created earlier, and a (1.0,1.0,1.0,1.0) white/solid colour.

Finally, prog:RunKernel() actually executes our gradient() function, once for every pixel in the image:

    int2 ipos = (int2)(get_global_id(1), get_global_id(0));
    int2 dstsize = get_image_dim(dst);
    float2 pos = convert_float2(ipos) / convert_float2(dstsize);
    
    col *= (float4)(pos.x, pos.y, 0, 1.0f);
    FuWriteImagef(dst, ipos, dstsize, col.zyxw); // image, int2 pixel pos, int2 image size, float4 colour (swizzled)

Briefly, get_global_id() is used to get the coordinates of the pixel into ipos, get_image_dim() gives us the overall size of the image, and the normalised image coordinates are calculated in pos. We use this to create a basic colour gradient, which is multiplied (gained) by our colour value argument, then written into the image.

Now that we’ve filled out our image, we set up and run our second circle() kernel; reading from our image, warping it, and writing to a new image. Finally, prog:Download() is used to take the resulting OpenCL image and copy its pixels back into a Fusion image, and our OpenCL images are Release()ed (kernel objects are automatically handled).

Some final notes about the Fusion-specific OpenCL bits:

FuReadImage_t is equivalent to “__read_only image2d_t”
FuWriteImage_t is equivalent to “__write_only image2d_t”
FuReadImagef() and FuWriteImagef() are used similarly to OpenCL’s read_imagef() and write_imagef(), taking integer pixel coordinates, with a black canvas and no subpixel filtering
FuReadImageCf() clamps out-of-bounds reads to the edge pixels (i.e. duplicate-edge behaviour)
FuSampleImageCf() takes normalised (0..1) float coordinates, and does edge clamping and subpixel filtering.

These types are optional; you may use image2d_t and read_imagef() instead. The advantage of the Fusion image types is that they will still work on OpenCL devices that do not support image2d_t (i.e. CPUs and older ATi GPUs) by substituting an array of float4s instead.

Any OpenCL errors or messages are displayed in the Console view, if you have Verbose Console Messages turned on in Prefs/OpenCL.

A final note - current drivers are quite sensitive to OpenCL kernel execution errors. For example, pointer overruns can bork the whole OpenCL state, throwing persistent CL_OUT_OF_RESOURCES errors until you restart Fusion. Worse, a perfectly valid OpenCL kernel that executes too long (say, a few thousand milliseconds) is deemed by the Windows driver watchdog to have locked up, and the entire graphics driver is killed and restarted (which will require a restart of Fusion, and possibly more besides). The only workaround for this is to split the execution of complex kernels into smaller chunks, as the Noise and QJulia fuses do, or disabling the Windows watchdog timer altogether (not recommended).

More complete documentation on Fusion’s OpenCL API for fuses is available here.

Eyeon:Manual/Fusion 6/About OpenCL

From VFXPedia

About OpenCL

Views

Personal tools

Navigation

external links

Search

Toolbox