Eyeon:Script/Reference/Applications/Fuse/OpenCL Fuse Reference Manual
From VFXPedia
[ Main Manual Page ]
OpenCL for Fuses
|
For pixel processing, OpenCL can speed up fuses dramatically, in some cases many times faster than native-compiled tools. Even when running on the CPU rather than the GPU, OpenCL's on-the-fly compilation can still provide a significant performance boost over regular interpreted fuses.
Using OpenCL with your existing tool fuses can be quite simple.
- One or more OpenCL "kernels" are written - functions that are called repeatedly, usually for each pixel. These are stored as a string constant, which is compiled with BuildCachedProgram() in Create().
- Source Images or buffers can be uploaded to the OpenCL host with CreateImage() or CreateBuffer().
- An empty Image or other buffer should be passed to the OpenCL host using CreateImage() or CreateBuffer() with the "write" flag. This is the destination for the processed data.
- CreateKernel() is called to identify the kernel to be used.
- Call SetArg() for each argument to pass to the kernel, including images/buffers.
- You may need to call SetSize() to specify the buffer dimensions (not needed for Images), and SetWorksgroupSize() can be used to tweak the workgroup dimensions.
- To start the processing, RunKernel() executes the kernel for each pixel or element in the destination Image or buffer.
- Download() retrieves the processed pixels/data.
A discussion of an example OpenCL fuse can be found here.
API
There’s three main classes you’ll need to use when adding OpenCL support to your fuses:
- OCLManager
- OCLProgram
- OCLMemory
OCLManager
Functions
- mgr = OCLManager()
- Get the global manager
- OCLProgram = BuildProgram(string sourcecode [, number sourcelen [, string compileoptions]])
- Compiles the OpenCL program source, which may take some time (often 10-15 seconds). Compile errors are output to the Console view (error line numbers may be offset). If sourcelen is not specified, it is calculated automatically. Compileoptions are as per OpenCL spec. Note: returns immediately; compiling happens on a background thread.
- OCLProgram = BuildCachedProgram(string ID, string sourcefile [, string sourcecode [, number sourcelen [, string compileoptions]]])
- OCLProgram = BuildCachedProgram(Registry regnode [, string sourcecode [, number sourcelen [,string compileoptions]]])
- Similar to BuildProgram(), but caches the result to Fusion:OCLPrograms, and can be much quicker (though still a few seconds). ID should be a unique string. Sourcefile may be the path to a .cl file (which is loaded and compiled), or alternatively to a file that contains the source code within it, such as the fuse itself (if the latter, the sourcecode arg must be given). The timestamp on sourcefile is compared against the cached file, and is recompiled if newer. If a Registry node is passed instead of ID and sourcefile, the sourcefile's path is assumed to be "Shaders:<regnode_ID>.cl", or "Fusion:Shaders\<regnode_ID>.cl". In either case, if no cache or source file can be found, sourcecode is compiled instead.
Variables
- SupportsImage
- True if the OpenCL image2d_t/image3d_t types are supported
- GLSharing
- True if sharing of OpenCL images and OpenGL textures is supported
- FP16
- True if the half (float16) type is supported
- FP64
- True if the double (float64) type is supported
OCLProgram
Functions
- kernel = CreateKernel(string kernelname)
- Waits until any background compiling is finished, then returns a handle to your kernel function.
- success = RunKernel(kernel [, boolean wait])
- Runs the kernel. If wait is true or not given, it blocks until the kernel has finished.
- SetSize(number width [, number height])
- Sets the dimensions of the next RunKernel() call. This determines how many times (width * height) the kernel will be executed, and thus the range of indices returned by get_global_id() will be. If height is not specified, the kernel is executed width times.
- SetWorkgroupSize(number x [, number y])
- Sets the dimensions the workgroup used for the next RunKernel() call. Workgroups are groups of host processors that share __local memory, and jobs are divided into groups of x * y elements. Setting x and y too small or too large, can cause under-utilisation, reducing performance on some hosts. Values that are an integer fraction of the job size and that match the host's architecture often work best. If not specified, Fusion will attempt to pick reasonable values.
- SetArg(kernel, number argindex, value [, value [, value [, value]]])
- Passes a value to an argument in your kernel. Argindex is zero-based, so SetArg(kernel, 0, 5.0) passes 5.0 for the first argument. All arguments must be set before running a kernel, and argument types must be matched to the kernel’s args. Up to 4 scalar values can be passed for vector types, e.g. float4. The following Lua/Fusion types are matched to OpenCL types as follows:
Fusion type | OpenCL type |
---|---|
1, 2 or 4 numbers | float, float2, float4 |
table of 1, 2 or 4 numbers | float, float2, float4 |
Vector2 | float2 |
Vector4 | float4 |
Matrix4 | float16 |
OCLMemory (from CreateImage()) | image2d_t (or FuReadImage_t, FuWriteImage_t) |
OCLMemory (from CreateBuffer()) | __global pointer to any OpenCL-supported type |
- SetArgInt(kernel, number argindex, value [, value [, value [, value]]])
- Similar to SetArg(), but for setting int/uint arguments, as follows:
Fusion type | OpenCL type |
---|---|
1, 2 or 4 numbers | int, uint, int2, uint2, int4, uint4 |
table of 1, 2 or 4 numbers | int, uint, int2, uint2, int4, uint4 |
Vector2 | int2, uint2 |
Vector4 | int4, uint4 |
Matrix4 | int16, uint16 |
- SetArgBufferSize(kernel, number argindex, number bufsize)
- Used for specifying the size of __local buffers. Bufsize is the number of bytes to allocate.
- SetArgDataWindow(kernel, number argindex, Image)
- Takes the DataWindow from the given Fusion Image, and passes the left/bottom/width/height to a uint4.
- OCLMemory = CreateImage(Image [, string flags])
- Creates an OpenCL image2d_t object, and uploads the contents of the given Fusion Image to it. Flags may include any combination of “read”, “write” and “gl” (default is “read”). If “gl” is specified and OpenGL interoperability is supported, an OpenGL texture is created and cached with the Image, the Image contents are uploaded (if not cached) to an OpenGL texture, then shared with OpenCL (Note: some restrictions apply, for example GL_FLOAT GL_BGRA textures are not supported, and GL texture outputs are displayed but not always read back). Otherwise, the Image is uploaded directly each time, and no caching is done.
- OCLMemory = CreateBuffer(Image [, string flags])
- OCLMemory = CreateBuffer(table, string type [, string flags])
- Creates an OpenCL __global buffer of the Image’s pixel type (e.g. float images produce a __global float4 buffer). Flags can currently be “read” or “write”. If a table is specified, the table is uploaded to a __global buffer, as specified by the type string (default is bytes):
Fusion type | OpenCL type |
---|---|
“int8″ | bytes |
“int16″ | shorts |
“float16″ | halfs |
“float32″ | floats |
- Download(OCLMemory, Image)
- Download(OCLMemory, table, number size [, string type])
- Copies the contents of an OCLMemory image or buffer to a Fusion Image (size & depth must match), or from an OCLMemory buffer to a table of size bytes, as type of “int8″, “int16″, “float16″ or “float32″.
OCLMemory
Functions
- Release()
- Releases the OpenCL image/buffer. This is called automatically when all references to a local buffer are released, and the variable has gone out of scope. You can use also use this function to release images/buffers sooner if desired.
Fusion Image Types
Types
As a portability convenience, Fusion also supplies the FuReadImageT and FuWriteImageT types to compiled kernels. These are equivalent to OpenCL's image2d_t type, on supporting hosts, and otherwise to float4*. They can be helpful when running the same fuse on a local workstation with a GPU, and on GPU-less render farms. See the following table for precise type mapping:
Fusion type | OpenCL Image type | OpenCL non-Image type |
---|---|---|
FuReadImageT | __read_only image2d_t | const __global float4 * |
FuWriteImageT | __write_only image2d_t | __global float4 * |
Functions
Additionally, common image read/write functions that use these types are supplied, which compile to appropriate code for each type of host. Also, OCLProgram's CreateImage() will automatically call CreateBuffer() instead, if the host does not support OpenCL images, so that
- float4 FuReadImagef(FuReadImage_t img, int2 coords, int2 size)
- Reads a pixel from img at integer pixel (0..w/h) coordinates coords, when the image has size dimensions. Returns the pixel as an RGBA float4 with normalised (0..1) colour values, and returns (0,0,0,0) when reading beyond the image bounds.
- float4 FuReadImageCf(FuReadImage_t img, int2 coords, int2 size)
- Reads a pixel from img at integer pixel (0..w/h) coordinates coords, when the image has size dimensions. Returns the pixel as an RGBA float4 with normalised (0..1) colour values, and clamps out-of-bound reads (i.e. returns the nearest image edge pixel instead).
- float4 FuSampleImagef(FuReadImage_t img, float2 coords, int2 size)
- Bilinear-samples a pixel from img at normalised image (0..1) coordinates coords, when the image has size dimensions. Returns the subpixel sample as an RGBA float4 with normalised (0..1) colour values, and returns (0,0,0,0) when reading beyond the image bounds.
- float4 FuSampleImageCf(FuReadImage_t img, float2 coords, int2 size)
- Bilinear-samples a pixel from img at normalised image (0..1) coordinates coords, when the image has size dimensions. Returns the subpixel sample as an RGBA float4 with normalised (0..1) colour values, and clamps out-of-bound reads (i.e. returns the nearest image edge pixel instead).
- FuWriteImagef(FuWriteImage_t img, int2 coords, int2 size, float4 pix)
- Writes the pixel pix to img at integer pixel (0..w/h) coordinates coords, when the image has size dimensions. Writing beyond the image bounds is an undefined operation, and may cause an error or even crash on some hosts.
These functions are equivalent to native OpenCL functions with sampler_t flags as follows:
Fusion function | OpenCL function | sampler_t flags |
---|---|---|
FuReadImagef() | read_imagef() | CLK_NORMALISED_COORDS_FALSE, CLK_ADDRESS_CLAMP, CLK_FILTER_NEAREST |
FuReadImageCf() | read_imagef() | CLK_NORMALISED_COORDS_FALSE, CLK_ADDRESS_CLAMP_TO_EDGE, CLK_FILTER_NEAREST |
FuSampleImagef() | read_imagef() | CLK_NORMALISED_COORDS_TRUE, CLK_ADDRESS_CLAMP, CLK_FILTER_LINEAR |
FuSampleImageCf() | read_imagef() | CLK_NORMALISED_COORDS_TRUE, CLK_ADDRESS_CLAMP_TO_EDGE, CLK_FILTER_LINEAR |
FuWriteImagef() | write_imagef() | n/a |
Support
Hosts that support the OpenCL image2d_t type include nVidia CUDA-enabled chips and ATi 5xxx+ chips (with the Stream SDK 2.1 or later). Hosts that typically do not support any image types include CPUs.
Note that hosts can vary greatly in characteristics. For example, some hosts cache global data (particularly images), speeding up repeated reads, while some hosts do not. Since memory bandwidth and access patterns are often the bottleneck when processing pixels in OpenCL, this can make a big difference to performance, though even slower hosts will likely still be faster than plain interpreted fuses.
The contents of this page are copyright by eyeon Software. |