I thought this series of posts was finished, but I caught myself thinking about it repeatedly :-) so here is what my “preposterous little brain” came up with in these last few days.

Revisions to the last two posts

I studied LLVM a bit, and decided, if I was to actually implement the language (which I'm now calling Dream), LLVM is not a good match. The advantages it offers on top of using the gcc framework are almost totally irrelevant for this project.

In fact, ideally, Dream should be written in itself, and have its own machine code generators; but it should also have a C generator, which pumps C into gcc to get machine code, for two reasons: first, shipping the generated C code for the compiler, is a great way to bootstrap from source; second, it would help on platforms for which a machine code generator doesn't exist yet. This approach is not my original invention - I'm copying it wholesale from Pypy. (Although Pypy does have an LLVM backend as well.)

Also, upon later reflection, I decided the C-like syntax for message passing (the last code example in the previous post) is Evil™, and detracts from the stated goal of “regular syntax”; it's out.

The binary format

I mentioned Dream having its own binary format. Why, and what does it look like?

It's important to bear in mind, this is a “pure” object-oriented environment; the binary formats we have now (ELF and EXE) are designed around the needs of C, with long-running processes, lots of functions identified by name only, and variables which are blobs of binary data. This is simply not appropriate for the kind of information we want to store.

A Dream binary file has to be essentially a persisted object. Technically, it will actually be a tree of objects; the “main” object represented by the file, plus the other objects needed to reconstruct it (its attributes and message handlers, and their attributes and message handlers, and so on). Notably, what's probably the most “interesting” part of the file, the machine code for any code objects contained in this tree.

Now, each Dream object consists of five pieces of information:

A link to another object may take a few different forms:

A somewhat important attribute of this format is that the machine code is stored in a separate section from the object data; code objects will have their source in the “byte blob” slot. The machine-dependent section of the file is treated as a cache; you may move the file to another machine of a different architecture, or different settings, and the source will be transparently recompiled. You should also be able to chop off the code section entirely, forcing it to be regenerated - useful for distribution.

Binary and source

With this layout, a Dream file is not only a viable binary format, but also quite decent as the source form for development. There would be little sense in having the source for the code objects scattered in little text files, only to be collected into a Dream file by a tool. Rather, it's best to design tools to edit Dream files, tweaking the object links in it and editing the source for code objects; there could be a GUI interface, an emacs library, a set of command line utilities, you name it.

One consideration raised by this is that the format would then need to be amenable to revision control. There are two obvious ways to do that; either make the format text-based, or at least line-oriented (possibly XML - yuck), or, by having a C library that manipulates the file format, implement plugins for extensible revision control systems (like bzr) to handle Dream files smartly. (Which in this context would mean a lot of things - not only track object links and code source, but also, ignore the machine-dependent section entirely.)

Dreamshells

This runtime model of “pure objects”, of a bunch of persistent objects to which you send messages, does not map well into the OSes of today. At some point, there must be something with Unixish semantics - an executable which is loaded into a process, and which can be accessed from the shell or desktop environment.

This is the role of the Dreamshell: essentially, an object which has the ability to be saved as a “native” executable.

The Dreamshell interface would look more or less like this:

(string) config_name
return a filename to use when looking for a configuration file. On Unix, for example, if this message returns "bleh", we'd look for /etc/dreamconf/bleh, and ~/.dreamconf/bleh (if both exist, load both, in this order).
(sequence) config_data
return a sequence of config_data objects, each having a short name, a long name, help text, and some data on what to do with it if found (exact semantics to be defined; look at good command line libraries, such as Python's optparse, for inspiration).
(string) copyright
copyright info to display if requested on the command line.
(string) version
version info to display if requested on the command line.
(string) help
information to be displayed in --help output, after the list of arguments.
(integer) run

actually do whatever this Dreamshell is supposed to do (not called if the command line had --help, --version, or an error).

Gets arguments:

(string_sequence) raw_command_line
the unprocessed command line
(string_sequence) args
what was left of the command line after processing configuration
(string_mapping) environment
the environment
(sequence) files
the open file descriptors (wrapped in file objects)
(configuration) options
the data found on the config files and command line

An introspection call somewhere can dump a Dreamshell object into a native executable.

The system distribution would ship with a few useful Dreamshells; one to send a message by name to the object represented by a Dream file, one to listen on some network port and do distributed object brokering (probably using VIP), one to display a bunch of View (UI) objects on an X server (acting as a bridge between X and Dream until killed).

An hypothetical Windows distribution would rely on a different kind of object, which would bridge Dream with DCOM, rather than generate executable files.

Daydream

With all this said, it feels to me that it would be just too painful to write C code for code objects that need low-level logic or optimisation. Not only painful for the programmer, but also for the maintainers of the language itself, as the runtime would need to be able to compile C code.

Rather, I would again go the Pypy route, and implement a limited dialect of Dream, which would translate almost directly to machine code. I call it “Daydream”.

Here, a variable is not an object reference with an interface; it is a primitive type (integer, float, boolean, character, or a vector of one of these, or “object”), plus a (byte) size and a (vector) length, plus a reference to an object (usually “self”), plus an offset into the byte blob of that object, or just a stack address (for locals). If you need anything that can't be expressed with these primitives, you shouldn't be using Daydream ;-)

Note

The size property is the byte size of the individual elements, not of the total; the length is 1 if the variable is not a vector. So a single integer could be (integer, 32, 1), while a vector of 5 longs could be (integer, 64, 5). The sizes are explicit, so that the code is sanely portable between hardware platforms.

An “object” variable only has a few operations. You can perform some basic introspection (check types, check for attributes or message bindings), you can read or set attributes, and you can send messages. Sending a message looks more or less like:

send (interface) my_object "message_name" argument argument.

(the usual rules about syntactic noise apply.)

Notably, the Daydream runtime would have a way to open a shared library, either by absolute filename or by using the system search path, and wrap it as an “object”; you then could introspect it to check for the presence of symbols (attributes), get attributes and cast them to a primitive type, and send messages (call functions), casting the result. The object would be read-only, so trying to modify it would fail.

Maybe: smarter low-level data

Instead of an opaque byte blob, the “low-level” blob of an object could go all the way down to C, and be in the form of a symbol table, with name, primitive type, and value. I very much prefer not to do this, because it would encourage more complex logic to be done in Daydream, which is not the intent; but it would tie in better with the open-ended type system (as in, you could use two code objects that were written for different structures). The nag is, I'm not sure this is a good thing :-) I'd rather keep those “optimised” objects very simple, which means the Daydream code has to know exactly what low-level data it's messing with.

Syntax notes

A few ideas about syntax.

The syntax design revolves around three goals, in order:

One of the corollaries that sprout naturally from these goals, is that syntax should try to model colloquial communication between humans, and not mathematical language. This is the rationale, for example, behind function calls not having the familiar obj.foo(arg1, arg2) syntax, which is borrowed from algebra, but feels consistently alien and weird to non-programmers who are looking at source code or learning to program. The way you think about this is something more similar to foo obj with arg1 arg2, which is valid Dream syntax if you only append a colon to with.

Another decision that follows from this is the use of the period (.) at the end of statements, rather than the semicolon inherited from C and Pascal. In colloquial communication, a message with a lot of semicolons is badly written; and as illustrated by this paragraph, statements separated by a semicolon are expected to be closely related. We have been using the period to end statements since we learned to write; let's keep it.

Returns

A code object returns the value of its last statement. This is mostly for the sake of very short code fragments; for longer code, an explicit return is recommended.

To return explicitly, begin a statement with an equal sign (therefore “assigning” the following value to the code output).

On the other hand, a message handler may be declared “asynchronous”, in which case it does not return. The compiler should probably issue an error or at least a warning, if an explicit return is used in code associated with such a message.

(Asynchronous messages may be tagged as such using the slot for the return value interface.)

Typing

Variables and arguments are typed (by interface). This is done by prefixing the interface name in parenthesis to the variable definition. The parenthesis are for the sake of argument lists - since the comma is optional, you need to be able to tell easily what is an interface and what is the name being defined. (But it then allows me to support sequence unpacking, if I decide so.)

You can omit the interface, defaulting to (object).

You can redefine a variable; you might want to do that, for example, after determining that the value is, after all, of a certain type.

An expression of the form (expression :: interface) (parenthesis not optional) is a cast. You shouldn't use it often, but sometimes you might want to send a message to an object using a different interface that you know it has. Pronounce it as “expression as interface” (eg, “first_name as sequence”).

Reserved symbols

This is the complete list of reserved symbols up to now: {}()"=:. are meaningful, ,; are syntactic noise, # is a comment (to end of line, like in shell or Python).

The curly braces delimit a code block. Parenthesis have a few uses - grouping expressions, declaring types, and casting (although casting is only a special case of grouping). Double quotes delimit a string; the string syntax is roughly similar to Python's unicode literals, sans the leading u, except that it can go multi-line.

The equal sign is used for attribution (and explicit return which is a special case of attribution); it is special in that it's only reserved when by itself (delimited by whitespace), so ==, +=, etc are not reserved. (This sounds dirty. Maybe use := instead?)

The colon is probably the most overloaded one. It's heavily used in the syntactic noise system - if a token after the second ends in colon, it's noise; if the first or second ends in colon, it's the message name. Doubled, it's used for casts. And as the first token in a code block, it starts an argument list (see next section).

The period, finally, separates statements. Like the Pascal semicolon, it's a separator and not a terminator - meaning, it's optional on the last statement of a block. The period has to be followed by whitespace or end of input, to disambiguate from a decimal point or ellipsis (or other uses of the period character we may introduce later).

Syntactic noise is ignored; this is important, because if you're using it as a separator, you have to follow it with whitespace. The expression 1, 230 is two numbers (possibly two arguments to a message), while the expression 1,230 is a single number.

Arguments

Argument lists are defined by having a colon as the first token in a code block:

(code) add = { : (number) a (number) b : a + b }.

I'm still uncertain about keyword arguments; coming from Python, I certainly know their value, but also coming from Python, I recognise that they usually appear on overcomplex code, that could be done better using more object-oriented techniques - or, for some uses (like the datetime constructor), multimethods.

You set up a default value, if you want one, inside the parenthesis:

(code) add = { : (number 2) a (number 2) b : a + b }.

For variable-length arguments, you use the ellipsis:

(code) add = { : (number) a ... :
  iterate_over ... with: { : (number) n : a += n }.
  = a
}.

Here's a possible implementation of the new method:

: ... :
new_obj = create_empty object.
iterate_over ... with: { : base : new_obj add_base: base }.
# only initialise after adding all bases -
# because one base may care what other
# bases the object has
iterate_over ... with: { : base : base, initialize: new_obj }.
= new_obj

Scope

There is no built-in syntax for “give me attribute bar of object foo”, because we don't want to encourage code to mess with attributes of objects other than self (sorry, Python). Of course, you can get to these attributes, using introspection (something like: (foo :: object) get_attribute: "bar").

Temporary (local) variables in Dream (not Daydream) don't live in the stack, but on a global soup on the heap, and are garbage collected. This is for the sake of nested scopes; consider:

(integer) number = 1.
(code) silly_code = { number + 1 }.
number += 1.
= silly_code

Since this code object returned, then number can't be “alive” in its stack frame anymore. But the code object returned at the end has a reference to it, so it must be somewhere. (Those who ever worked on a Lisp implementation, or even Python, will be familiar with closure theory, of which I just scratched the surface.)

For consideration: what happens to number if I persist silly_code?

So here's how a name is looked up to resolve to a variable:

  1. reserved names. There's only a few of these; null, unknown, true, false, self, context.
  2. local variables from the current code object (including arguments).
  3. local variables from enclosing code objects (nested scopes).
  4. attributes of self.
  5. interface names registered in the interface manager; an interface doesn't have to be registered, but registering it makes it accessible in this fashion, therefore much more usable.