It's user error.

Debu.gs


Brainfind a Stranger in the Alps

In which is described an Inferno program that generates Linux/x86 executables, the design of a simple VM, and programs that generate programs (but all of it will be horribly disappointing).

“But that’s all fun stuff, how could it be disappointing?”

Glad you asked. I have itemized this list.

  1. The VM in question is for a language that, although its use is good mental exercise for programming, is not very useful in practice.
  2. The VM in question is for a language that may be the most-implemented language of all time. (At time of writing, I’ve implemented this language five times.)
  3. “Fun” is a highly subjective concept.
  4. AWIB does the heavy lifting.

“What language are you talking about?”

Glad you asked. I have itemized this list.

  1. BF
  2. BrainF***
  3. Urban Müller’s language
  4. oenvashpx
  5. ++++[>++<-]>[<++>-]<[>++++++>+++++++<<-]>++.>++.<-.++++++++.>----.<---.>+++++++.<---.>----------.

“Wait, why are you dancing around the F-word?”

Glad you asked. I have itemized this list.

  1. I have clients rather than a boss and, as you probably know, offending a boss is different from offending a potential client.
  2. I’m already offending with substance and don’t want to press my luck.
  3. Because of one of my favorite movies of all time.

Do you see what happens?

Let’s describe this fabulous computational device. Start with a Harvard Architecture. Add an array of cells for the memory. And we’ll throw in a pointer to the current cell. For the instruction memory, let’s say we have eight instructions. Those instructions:

It’s a pretty simple machine, but it is Turing-complete.

Do you see what happens, Larry?

Okay, we have a very simple language with a very simple instruction set and semantics that are easy enough for any programmer to understand. It’s no Nock, but it’ll do. Let’s take a first pass at implementing it in Limbo.

This is what happens.

A straightforward implementation would use eight instructions, so let’s start there. As a very minor optimization, we’ll add an EXIT instruction to stop the program. We’ll need to keep track of some state (the cells, the instruction pointer, and the memory pointer). Also, we need to…nope, that’s it. It’s a really simple language!

Here’s some Limbo for doing just that:

execute(code: array of int, arena: array of byte)
{
    pc := 0;
    p := 0;
    buf := array[1] of byte;
    for(;;) {
        case code[pc] {
        DEC => arena[p]--;
        INC => arena[p]++;
        DECP =>
            p--;
            if(p < 0)
                p = len arena - 1;
        INCP =>
            p = (p + 1) % len arena;
        READ =>
            sys->read(sys->fildes(0), buf, 1);
            arena[p] = buf[0];
        WRITE =>
            buf[0] = arena[p];
            sys->write(sys->fildes(1), buf, 1);
        JNZ =>
            if(arena[p] != byte 0)
                pc = code[pc + 1];
            else
                pc++;
        JZ =>
            if(arena[p] == byte 0)
                pc = code[pc + 1];
            else
                pc++;
        EXIT => return;
        }
        pc++;
    }
}

The little execute() funcion takes an array of integers representing our instructions, and an array of bytes representing the program’s memory. pc is the instruction pointer (“program counter”), p is the memory pointer, if you couldn’t guess by the names. Then a big loop with a switch in it, and one case for each of our instructions, plus an EXIT instruction. Very nearly as simple as it gets; it’s almost embarrassing to do any explanation. One item of note is that we keep the jump addresses inline with the instruction stream. (It does have the single dispatch problem, where we frustrate any attempt at branch prediction, but I’m certain we can deal with a somewhat less performant implementation of +++++++++++[->++++++<]>.++++. for now.) How do we get there from ++++++++++[->++++++++++>+++++++++++<<]>++++.---.>++++.<., though?

This is what happens, Larry.

Parsing a program is, in general, a non-trivial task. Know what’s nice about a language like ++++++++++[>++++++++++>>>+<<<<-]>[-<+>>+>+<<]>>>[<+<->>-]<[<<+>>-]<<<--.>++++.<-.++++++++.>----.[>>+>+<<<-]<---.>>+.<<---.++.>>>.+++++.>+.<-.<<<.-.>>++.? There’s nothing to parse. No tokenizer, no lexer, no AST. (Well, no real tokenizer or lexer.) A simple loop over the program that emits the appropriate instruction will do:

compile(p: string): array of int
{
    marks: list of int = nil;
    code := array[len p * 2 + 1] of { * => EXIT };
    pc := 0;
    for(i := 0; i < len p; i++) {
        case p[i] {
        '-' => code[pc++] = DEC;
        '+' => code[pc++] = INC;
        '<' => code[pc++] = DECP;
        '>' => code[pc++] = INCP;
        ',' => code[pc++] = READ;
        '.' => code[pc++] = WRITE;
        '[' =>
            code[pc++] = JZ;
            marks = pc++ :: marks;
        ']' =>
            if(marks == nil) {
                sys->fprint(sys->fildes(2), "bf: unmatched ']' at character %d.", pc);
                raise "fail:errors";
            }
            c := hd marks;
            marks = tl marks;
            code[pc++] = JNZ;
            code[c] = pc;
            code[pc++] = c;
        }
    }
    if(marks != nil) {
        sys->fprint(sys->fildes(2), "bf: unmatched '['.");
        raise "fail:errors";
    }
    return code[:pc+1];
}

Takes a program as its argument, returns an array of instructions. As noted before, we store the addresses for the jumps inline. We also keep a stack of addresses (marks, borrowed from Forth) When we hit a [ (JZ), we leave the address field blank and store the address’s address on the stack to be resolved when we hit the matching ] (JNZ). (If you’re reading carefully, you’ll spot a couple of ridiculous bits in the branching, but we’ll fix them shortly.)

So, there’s a VM and a compiler for it, and we’ve hit exactly 70 lines of code, with no black art. It’s a trivial runtime for a trivial language, though. (But really, we can read standard input and write standard output and compute in the interim, so you shouldn’t need anything else, right? Right?) Now what happens?

This is what happens when you find a stranger in the Alps.

Let’s talk a little about AWIB. It’s quite an achievement: it’s a polyglot program in Tcl, C, bash, and BF. On top of that, it’s a compiler for BF. On top of that, it emits optimized code. On top of that, it has five backends for code generation: Linux/386 machine code (a statically linked ELF with no symbols), Tcl, Ruby, Go, and C.

So, given that it’s written in the language that it executes (among other languages), you can use it to compile itself to machine code. Pretty cool, right? But it doesn’t have a frontend that works under Inferno, unfortunately. (Sort of; there’s an implementation of Tcl, but it doesn’t run AWIB as-is.) But if we use our little VM to run AWIB, then we can use it to compile itself. That’s a bit less than satisfying, though.

Cypher sent us to hell, but we’re going even deeper!

We’ve got a small set of instructions, and the code to execute them in Limbo is trivial. Let’s revise the code a bit, and then just go crazy with it. First, we’ll replace INC, DEC, INCP, and DECP with two other instructions: ADD and ADDP, and we’ll have them take integer arguments the same way that JZ and JNZ do, then we’ll adjust compile() and execute() accordingly:

compile(p: string): array of int
{
    marks: list of int = nil;
    code := array[len p * 2 + 1] of { * => EXIT };
    pc := 0;
    n: int;
    for(i := 0; i < len p; i++) {
        case p[i] {
        '-' or '+' =>
            n = 0;
            while(i < len p) {
                if(p[i] == '-')
                    n--;
                else if(p[i] == '+')
                    n++;
                else {
                    i--;
                    break;
                }
                i++;
            }
            if(n) {
                code[pc++] = ADD;
                code[pc++] = n;
            }
        '<' or '>' =>
            n = 0;
            while(i < len p) {
                if(p[i] == '<')
                    n--;
                else if(p[i] == '>')
                    n++;
                else {
                    i--;
                    break;
                }
                i++;
            }
            if(n) {
                code[pc++] = ADDP;
                code[pc++] = n;
            }
        ',' => code[pc++] = READ;
        '.' => code[pc++] = WRITE;
        '[' =>
            code[pc++] = JZ;
            marks = pc++ :: marks;
        ']' =>
            if(marks == nil) {
                sys->fprint(sys->fildes(2), "bf: unmatched ']' at character %d.", pc);
                raise "fail:errors";
            }
            c := hd marks;
            marks = tl marks;
            code[pc++] = JNZ;
            code[c] = pc;
            code[pc++] = c;
        }
    }
    if(marks != nil) {
        sys->fprint(sys->fildes(2), "bf: unmatched '['.");
        raise "fail:errors";
    }
    return code[:pc + 1];
}

So, we do one character of “look ahead”, and generate somewhat more compact code (depeinding; if you want to complicate things, you could re-add the four old instructions and have it emit, e.g., INCP when ADDP 1 would be emitted by this code).

We’ve also fixed up JZ and JNZ somewhat so that we don’t need to do math on the address we load into pc when we jump.

execute(code: array of int, arena: array of byte)
{
    pc := 0;
    p := 0;
    buf := array[1] of byte;
    stopreading := 0;
    for(;;) {
        case code[pc] {
        ADD =>
            arena[p] += byte code[++pc];
        ADDP =>
            p += code[++pc];
            while(p < 0)
                p += len arena;
            p = p % len arena;
        READ =>
            arena[p] = byte -1;
            if(!stopreading) {
                n := sys->read(sys->fildes(0), buf, 1);
                if(n < 1)
                    stopreading = 1;
                else
                    arena[p] = buf[0];
            }
        WRITE =>
            buf[0] = arena[p];
            sys->write(sys->fildes(1), buf, 1);
        JNZ =>
            if(arena[p] != byte 0)
                pc = code[pc + 1];
            else
                pc++;
        JZ =>
            if(arena[p] == byte 0)
                pc = code[pc + 1];
            else
                pc++;
        EXIT => return;
        }
        pc++;
    }
}

Yawn. Ready for the fun part? Instead of executing the code, why not emit some Limbo? It’s not hard, and then we have a BF to Limbo compiler.

bf2limbo(code: array of int): string
{
    indent := 1;
    s := "implement BfProg;\n" +
        "include \"sys.m\"; sys: Sys;\n" +
        "include \"draw.m\";\n" +
        "BfProg: module {\n" +
        "\tinit: fn(nil: ref Draw->Context, nil: list of string);\n" +
        "};\n" +
        "init(nil: ref Draw->Context, nil: list of string)\n{\n" +
        "\tsys = load Sys Sys->PATH;\n" +
        "\tp := 0;\n" +
        "\tstopreading := 0;\n" +
        "\tn := 0;\n" +
        "\tbuf := array[1] of byte;\n" +
        "\tarena := array[" + string ARENASZ + "] of { * => byte 0 };\n" +
        "\n";
    for(i := 0; i < len code && code[i] != EXIT; i++) {
        case code[i] {
        ADD => s += indents(indent) + "arena[p] += byte " + string code[++i] + ";\n";
        ADDP =>
            s += indents(indent) + "p += " + string code[++i] + ";\n" +
                indents(indent) + "while(p < 0)\n" +
                indents(indent + 1) + "p += len arena;\n" +
                indents(indent) + "p = p % len arena;\n";
        READ =>
            s += indents(indent) + "arena[p] = byte -1;\n" +
                indents(indent) + "if(!stopreading) {\n" +
                indents(indent + 1) + "n = sys->read(sys->fildes(0), buf, 1);\n" +
                indents(indent + 1) + "if(n < 1)\n" +
                indents(indent + 2) + "stopreading = 1;\n" +
                indents(indent + 1) + "else\n" +
                indents(indent + 2) + "arena[p] = buf[0];\n" +
                indents(indent) + "}\n";
        WRITE =>
            s += indents(indent) + "buf[0] = arena[p];\n" +
                indents(indent) + "sys->write(sys->fildes(1), buf, 1);\n";
        JNZ =>
            indent--;
            s += indents(indent) + "}\n";
            i++;
        JZ =>
            s += indents(indent) + "while(arena[p] != byte 0) {\n";
            indent++;
            i++;
        }
    }
    return s + "}\n";
}

Sure, that piece of code is a little tedious, but now we have a program that generates Limbo, and thus all the pieces to build a Limbo program that generates x86 code! Let’s give it a shot:

% bf -c awib-0.3.b > awib.b
% limbo -gw awib.b
% echo '@lang_tcl
+++++++++++[->++++++<]>.++++.' | awib 
#!/usr/bin/tclsh
fconfigure stdout -encoding binary
fconfigure stdin -encoding binary

proc run {in out} {
    set p 0
    set mem {0}
    for {set i 0} {$i<0xffff} {incr i} {lappend mem 0}
    lset mem $p [expr ([lindex $mem $p] +11) % 256]
    while {[lindex $mem $p] != 0} {
        lset mem $p [expr ([lindex $mem $p] -1) % 256]
        set p [expr $p+1]
        lset mem $p [expr ([lindex $mem $p] +6) % 256]
        set p [expr $p-1]
        }
    set p [expr $p+1]
    puts -nonewline $out [format %c [lindex $mem $p]]
    lset mem $p [expr ([lindex $mem $p] +4) % 256]
    puts -nonewline $out [format %c [lindex $mem $p]]
    return 0
}
run stdin stdout

Aha! And if you replace lang_tcl with 386_linux, we get machine code. Have a look:

% echo '@386_linux
+++++++++++[->++++++>+<<]>.++++.>-.' | awib > program

And then copying that to an x86 or x86-64 Linux machine (by some means) and chmod’ing it 755 (or 700 if you are embarrassed), you can verify that we have indeed written a Limbo program that we used to create another Limbo program from a BF program and then used that program to produce a valid executable:

$ ./program
BF

Oddly enough, I couldn’t convince GNU’s objdump to disassemble it, despite trying some options that seemed like they ought to work and then a few dozen other permutations of objdump’s options. ndisasm had no trouble and only required a glance at the man page, so if you’d like to disassemble it and you have nasm, ndisasm -e 84 -a ought to work.

“But why?”

I have a soft spot for esoteric languages and other forms of computational perversion. I’d been doing a little playing with Rosetta Code, and had been doing a few solutions in Limbo when I noticed the "Execute BrainF***" task, and couldn’t resist. I’ve done a few implementations of that language (including one in Pez) and I usually use AWIB to test them out.

“Where’s the code?”

The bf repo on my BitBucket account.

Tags: code inferno


<< Previous: "Dropping Feedburner"
Next: "Making Music with Computers: Two Unconventional Approaches" >>