Langs

Pixel Bender Assembler

Pixel Bender is an Adobe technology that is now part of the Flash Player 10 API (where its named Shader).

A Pixel Bender script is capable of performing floating point calculus in parallel using available CPU instructions such as SIMD.

In short, it's a way to be able to crunch numbers very efficiently, and can also be used in Flash Player 10 to write pixel-shaders for custom blend modes, filters or graphics fill.

The Pixel Bender Toolkit is freely available and you can use it to experiment custom filters. It takes an PBK script source file and can export it to the PBJ binary file which is a compiled version of a Pixel Bender script that can be loaded by the Flash Player.

The latest Haxe file format library contains complete support to read and write PBJ file, enabling you to write Pixel Bender assembler directly in Haxe, then compile it on-the-fly into PBJ bytes, which can then be saved on disk or loaded directly in Flash.

Installation

In order to use the Pixel Bender assembler, you need to :

haxelib install format

Reading PBJ

Here's a small example that uses the Haxe PBJ support to load PBJ file an dump its content and assembler :

[pbjreader.swf]

The source code of the reader is quite simple and is the following PBJReader.hx source file :

import flash.events.Event;
import flash.events.MouseEvent;

class PBJReader {

    static var tf : flash.text.TextField;

    static function main() {
        var root = flash.Lib.current;
        tf = new flash.text.TextField();
        tf.width = root.stage.stageWidth;
        tf.height = root.stage.stageHeight;
        tf.text = "Click to load PBJ file";
        root.addChild(tf);
        root.stage.addEventListener(MouseEvent.CLICK,onClick);
    }

    static function onClick(_) {
        var l = new flash.net.FileReference();
        l.addEventListener(Event.SELECT,function(_) l.load());
        l.addEventListener(Event.COMPLETE,function(_) dumpPBJ(l.data));
        l.browse([new flash.net.FileFilter("Pixel Bender File","*.pbj")]);
    }

    static function dumpPBJ( data : flash.utils.ByteArray ) {
        var bytes = haxe.io.Bytes.ofData(data);
        var input = new haxe.io.BytesInput(bytes);
        var reader = new format.pbj.Reader(input);
        var pbj = reader.read();
        tf.text = format.pbj.Tools.dump(pbj);
    }

}

It can be compiled with Haxe by using the following HXML file :

-swf pbjreader.swf
-swf-version 10
-main PBJReader
-lib format

By reading PBJ files, you can see how the Adobe Pixel Bender Toolkit compiles your PBK file into assembler.

Optimizations

If you watch the assembler compiler by Adobe Pixel Bender Toolkit, you'll notice that a lot of operations are underoptimized. Take for example the following PBK file, which simply perform a multiply over two textures pixels :

<languageVersion : 1.0;>

kernel Multiply
<   namespace : "";
    vendor : "";
    version : 0;
>
{
    input image4 background;
    input image4 foreground;
    output pixel4 dst;

    void evaluatePixel()
    {
        dst = sampleNearest(background,outCoord())
         * sampleNearest(foreground,outCoord());
    }
}

It will get compiled (with current Toolkit version 1.1) to the following assembler :

  sampleNearest f2, text0[f0.rg]
  sampleNearest f3, text1[f0.rg]
  mov f4, f2
  mul f4, f3
  mov f1, f4

Do you notice that there are two mov operations that could be avoided ? The optimized code would then be the following :

  sampleNearest f2, text0[f0.rg]
  sampleNearest f1, text1[f0.rg]
  mul f1, f2

That's one of the reason it's quite nice to directly create PBJ from assembler : you know exactly how it will run, and you can optimize it even better than any compiler would do.

Creating PBJ

Let's see how we implement the previous PBJ assembler in Haxe for Flash Player 10 :

import format.pbj.Data;

class TestShader {

    static function main() {
        var bytes = initPBJ();
        // creates a shader from PBJ bytes
        var shader = new flash.display.Shader(bytes.getData());
        // add one shape
        var root = flash.Lib.current;
        var a = new flash.display.Shape();
        a.graphics.beginFill(0xFFFF00);
        a.graphics.drawCircle(50,50,100);
        root.addChild(a);
        // add one second shape
        var b = new flash.display.Shape();
        b.graphics.beginFill(0x00FFFF);
        b.graphics.drawCircle(120,120,100);
        root.addChild(b);
        // use our shader to perform Multiply blendMode
        b.blendMode = flash.display.BlendMode.SHADER;
        b.blendShader = shader;
    }

    static function initPBJ() {
        var pbj : PBJ = {
            version : 1,
            name : "Multiply",
            metadatas : [],
            // the parameters are the input/output of the shader
            // see PBJ Reference below for a full description
            parameters : [
                { name : "_OutCoord", p : Parameter(TFloat2,false,RFloat(0,[R,G])), metas : [] },
                { name : "background", p : Texture(4,0), metas : [] },
                { name : "foreground", p : Texture(4,1), metas : [] },
                { name : "dst", p : Parameter(TFloat4,true,RFloat(1)), metas : [] },
            ],
            // this is our assembler code for the shader, you can see it's similar
            // to what we have written in previous section
            code : [
                OpSampleNearest(RFloat(2),RFloat(0,[R,G]),0),
                OpSampleNearest(RFloat(1),RFloat(0,[R,G]),1),
                OpMul(RFloat(1),RFloat(2)),
            ],
        };
        var output = new haxe.io.BytesOutput();
        var writer = new format.pbj.Writer(output);
        writer.write(pbj);
        return output.getBytes();
    }

}

You can compile this example by using the following HXML file :

-swf shader.swf
-swf-version 10
-main TestShader
-lib format

As you notice if you measure performances, our custom PBJ assembler is faster than the one of Adobe Pixel Bender Toolkit, but is slower than native MULTIPLY blendmode. Why ? Because even shaders are optimized and parallelized over multicore CPUs, they still need to convert RGBA to 4-Float-Values and back again... for each pixel.

This confirms that shaders are good for "advanced" complex effects but for simple things, combining native blend modes is still faster.

PBJ ASM Reference

A PBJ data structure contains the following fields :

  • version : the version of the PBJ file (only 1 is accepted so far)
  • name : the name of the PBJ script
  • metadatas : the metadatas that have been specified in the PBJ script
  • parameters : the list of input and output parameters and textures (see below)
  • code : the assembler instructions that perform operations

Parameters

Each parameter has a name and a metas array of metadatas, which can contain informations such as minimum, maximum and default values for instance. The p field tells which type of parameter :

  • Texture : a texture with a given number of color channels (4 in our example) and an index that is used in SampleNearest and SampleLinear assembler operations
  • Parameter : it takes three arguments which are, in that order : the type of the parameter (see below for a complete list), a boolean that tells if the parameter is an input (false) or an ouput (true), and a register which tells where the parameter will be accessible.

Please note that you can store several parameters in the same register by using different channels. For instance the _OutCoord parameter stores the (X,Y) pixel position in the R and G channels of the register 0. Additional parameters might use channels B and A which are available.

Registers

There are two type of registers : floating point registers RFloat and integer registers RInt (boolean registers are not yet supported in Flash). Each register have an index and a list of channels. For instance RFloat(3,[G,A]) means "the channels green and alpha of the float register #3".

When writing assembler, you can swizzle input registers and mask output registers (for details about this, see Pixel Bender Toolkit help).

Types

The following types are available :

    TFloat;
    TFloat2;
    TFloat3;
    TFloat4;
    TFloat2x2;
    TFloat3x3;
    TFloat4x4;
    TInt;
    TInt2;
    TInt3;
    TInt4;

The additional type TString is only used by metadatas.

Opcodes

The following opcodes are available :

Basic Operations

  • OpNop : does nothing
  • OpMov(dst,src) : dst = src
  • OpAdd(dst,src) : dst = dst + src
  • OpSub(dst,src) : dst = dst - src
  • OpMul(dst,src) : dst = dst * src
  • OpRcp(dst,src) : dst = 1 / src
  • OpDiv(dst,src) : dst = dst / src
  • OpPow(dst,src) : dst = pow(dst,src)
  • OpMod(dst,src) : dst = dst % src
  • OpMin(dst,src) : dst = min(dst,src)
  • OpMax(dst,src) : dst = max(dst,src)
  • OpSqrt(dst,src) : dst = sqrt(src)
  • OpRSqrt(dst,src) : dst = 1 / sqrt(src)
  • OpAbs(dst,src) : dst = abs(src)
  • OpStep(dst,src) : dst = (dst < src) ? 1.0 : 0.0
  • OpSign(dst,src) : ''dst = (src == 0) ? 0 : ((src < 0 ) ? : -1 : 1);

Trigonometry Operations

  • OpAtan2(dst,src) : dst = atan2(dst,src)
  • OpSin(dst,src) : dst = sin(src)
  • OpCos(dst,src) : dst = cos(src)
  • OpTan(dst,src) : dst = tan(src)
  • OpASin(dst,src) : dst = asin(src)
  • OpACos(dst,src) : dst = acos(src)
  • OpATan(dst,src) : dst = atan(src)
  • OpExp(dst,src) : dst = exp(src)
  • OpExp2(dst,src) : dst = exp2(src)
  • OpLog(dst,src) : dst = log(src)
  • OpLog2(dst,src) : dst = log2(src)

Int/Float operations

  • OpFract(dst,src) : dst = src - floor(src)
  • OpFloor(dst,src) : dst = floor(src)
  • OpCeil(dst,src) : dst = ceil(src)
  • OpFloatToInt(dst,src) : dst = int(src) , dst is an integer register, src is a float register
  • OpIntToFloat(dst,src) : dst = float(src) , dst is a float register, src is an integer register

Vector operations

  • OpNormalize(dst,src) : dst = normalize(src)
  • OpLength(dst,src) : dst = length(src)
  • OpDistance(dst,src) : dst= distance(dst,src)
  • OpDotProduct(dst,src) : dst = dst . src
  • OpCrossProduct(dst,src) : dst = dst x src

Logical operations

  • OpLogicalNot(dst,src) : dst = ~src
  • OpLogicalAnd(dst,src) : dst = dst & src
  • OpLogicalOr(dst,src) : dst = dst | src
  • OpLogicalXor(dst,src) : dst = dst ^ src

Other operations

  • OpSampleNearest(dst,src,texture) : dst = sampleNearest(src,texture) takes the nearest point in a texture
  • OpSampleLinear(dst,src,texture) : dst = sampleLinear(src,texture) calculate a bilinear interpolation of a point in a texture
  • OpLoadInt(dst,v) : dst = v (only one value at a time)
  • OpLoadFloat(dst,v) : dst = v (only one value at a time)

If/Else

You can create if/else blocks by using the following operations :

  • OpIf(src) : if src , continue the bytecode or jump to the corresponding else or endif operation
  • OpElse : branch here if the if fails
  • OpEndIf

Assembler example :

if r0.r
    r1.g = r0.r
else
    r1.g = 1.0
endif

Matrix operations

A matrix consists in several registers with consecutive indexes. Instead of using [R,G,B,A] for channels, you have to use M2x2, M3x3 or M4x4 to select the matrix operation, for example :

OpAdd(RFloat(10,[M4x4]),RFloat(14,[M4x4]))

This will add the matrix in registers (10,11,12,13) and the matrix in registers (14,15,16,17).

  • OpMatrixMatrixMult(dst,src) : matrix(dst) = matrix(dst) x matrix(src)
  • OpVectorMatrixMult(dst,src) : dst = dst x matrix(src)
  • OpMatrixVectorMult(dst,src) : dst = matrix(dst) x src

Credits

Nicolas Cannasse is responsible for this whole mess.
Thanks to Tinic Uro for releasing C++ code which helped to reverse engineer PBJ format.

version #299, modified 2010-02-09 09:49:23 by ncannasse