I thought this series of posts was finished, but I caught myself
thinking about it repeatedly :-) so here is what my “preposterous
little brain” came up with in these last few days.
Revisions to the last two posts
I studied LLVM a bit, and decided, if I was to actually implement the
language (which I'm now calling Dream), LLVM is not a good match. The
advantages it offers on top of using the gcc framework are almost
totally irrelevant for this project.
In fact, ideally, Dream should be written in itself, and have its own
machine code generators; but it should also have a C generator, which
pumps C into gcc to get machine code, for two reasons: first, shipping
the generated C code for the compiler, is a great way to bootstrap
from source; second, it would help on platforms for which a machine
code generator doesn't exist yet. This approach is not my original
invention - I'm copying it wholesale from Pypy. (Although Pypy does
have an LLVM backend as well.)
Also, upon later reflection, I decided the C-like syntax for message
passing (the last code example in the previous post) is Evil™, and
detracts from the stated goal of “regular syntax”; it's out.
The binary format
I mentioned Dream having its own binary format. Why, and what does it
look like?
It's important to bear in mind, this is a “pure” object-oriented
environment; the binary formats we have now (ELF and EXE) are designed
around the needs of C, with long-running processes, lots of functions
identified by name only, and variables which are blobs of binary
data. This is simply not appropriate for the kind of information we
want to store.
A Dream binary file has to be essentially a persisted object.
Technically, it will actually be a tree of objects; the “main” object
represented by the file, plus the other objects needed to reconstruct
it (its attributes and message handlers, and their attributes and
message handlers, and so on). Notably, what's probably the most
“interesting” part of the file, the machine code for any code objects
contained in this tree.
Now, each Dream object consists of five pieces of information:
- links to bases (what Self calls the “parent” link)
- an opaque blob of bytes, with its size information (usually 0)
- a symbol table mapping (interface, name) pairs to other objects -
the attributes
- the method handler table, mapping (interface, message name) pairs to
other objects (usually code); this may include additional metadata
(such as the interface of the return value, if any)
- an index into the machine-dependent section of the file (normally
unused, except for code objects)
A link to another object may take a few different forms:
- an index to another object stored in the same file.
- a “magic number” pointing to a built-in “shortcutting” method
implementation - for example, the run message of code objects
will make a call into the machine code version, which will have been
stored elsewhere in the file (so that it can be loaded into a “code”
segment, on hardware that uses segmentation), while arithmetic
operations on a number object will resolve to a handful machine code
instructions.
- a reference to an object in a different file. This is the delicate
part - there are issues to be carefully thought about, about search
paths, about keeping these files useful when you move them between
machines (which possibly have different directory layouts), and the
most important - if an object N is referenced as an attribute of both
A and B, if N itself is not persisted in its own file, and you
persist both A and B to their own files, which one gets N? The
answer to this will probably have to do with weak references.
- a name (indirect reference) - mostly used for aliasing attributes,
and more usually, for referring to registered interfaces.
A somewhat important attribute of this format is that the machine code
is stored in a separate section from the object data; code objects
will have their source in the “byte blob” slot. The
machine-dependent section of the file is treated as a cache; you may
move the file to another machine of a different architecture, or
different settings, and the source will be transparently recompiled.
You should also be able to chop off the code section entirely, forcing
it to be regenerated - useful for distribution.
Binary and source
With this layout, a Dream file is not only a viable binary format, but
also quite decent as the source form for development. There would be
little sense in having the source for the code objects scattered in
little text files, only to be collected into a Dream file by a tool.
Rather, it's best to design tools to edit Dream files, tweaking the
object links in it and editing the source for code objects; there
could be a GUI interface, an emacs library, a set of command line
utilities, you name it.
One consideration raised by this is that the format would then need to
be amenable to revision control. There are two obvious ways to do
that; either make the format text-based, or at least line-oriented
(possibly XML - yuck), or, by having a C library that manipulates the
file format, implement plugins for extensible revision control systems
(like bzr) to handle Dream files smartly. (Which in this context
would mean a lot of things - not only track object links and code
source, but also, ignore the machine-dependent section entirely.)
Dreamshells
This runtime model of “pure objects”, of a bunch of persistent objects
to which you send messages, does not map well into the OSes of today.
At some point, there must be something with Unixish semantics - an
executable which is loaded into a process, and which can be accessed
from the shell or desktop environment.
This is the role of the Dreamshell: essentially, an object which has
the ability to be saved as a “native” executable.
The Dreamshell interface would look more or less like this:
- (string) config_name
- return a filename to use when looking for a configuration file. On
Unix, for example, if this message returns "bleh", we'd look for
/etc/dreamconf/bleh, and ~/.dreamconf/bleh (if both exist,
load both, in this order).
- (sequence) config_data
- return a sequence of config_data objects, each having a short name,
a long name, help text, and some data on what to do with it if found
(exact semantics to be defined; look at good command line libraries,
such as Python's optparse, for inspiration).
- (string) copyright
- copyright info to display if requested on the command line.
- (string) version
- version info to display if requested on the command line.
- (string) help
- information to be displayed in --help output, after the list of
arguments.
- (integer) run
actually do whatever this Dreamshell is supposed to do (not called
if the command line had --help, --version, or an error).
Gets arguments:
- (string_sequence) raw_command_line
- the unprocessed command line
- (string_sequence) args
- what was left of the command line after processing configuration
- (string_mapping) environment
- the environment
- (sequence) files
- the open file descriptors (wrapped in file objects)
- (configuration) options
- the data found on the config files and command line
An introspection call somewhere can dump a Dreamshell object into a
native executable.
The system distribution would ship with a few useful Dreamshells; one
to send a message by name to the object represented by a Dream file,
one to listen on some network port and do distributed object brokering
(probably using VIP), one to display a bunch of View (UI) objects on
an X server (acting as a bridge between X and Dream until killed).
An hypothetical Windows distribution would rely on a different kind of
object, which would bridge Dream with DCOM, rather than generate
executable files.
Daydream
With all this said, it feels to me that it would be just too painful
to write C code for code objects that need low-level logic or
optimisation. Not only painful for the programmer, but also for the
maintainers of the language itself, as the runtime would need to be
able to compile C code.
Rather, I would again go the Pypy route, and implement a limited
dialect of Dream, which would translate almost directly to machine
code. I call it “Daydream”.
Here, a variable is not an object reference with an interface; it is a
primitive type (integer, float, boolean, character, or a vector of one
of these, or “object”), plus a (byte) size and a (vector) length, plus
a reference to an object (usually “self”), plus an offset into the
byte blob of that object, or just a stack address (for locals). If
you need anything that can't be expressed with these primitives, you
shouldn't be using Daydream ;-)
Note
The size property is the byte size of the individual elements, not
of the total; the length is 1 if the variable is not a vector. So a
single integer could be (integer, 32, 1), while a vector of 5 longs
could be (integer, 64, 5). The sizes are explicit, so that the code
is sanely portable between hardware platforms.
An “object” variable only has a few operations. You can perform some
basic introspection (check types, check for attributes or message
bindings), you can read or set attributes, and you can send messages.
Sending a message looks more or less like:
send (interface) my_object "message_name" argument argument.
(the usual rules about syntactic noise apply.)
Notably, the Daydream runtime would have a way to open a shared
library, either by absolute filename or by using the system search
path, and wrap it as an “object”; you then could introspect it to
check for the presence of symbols (attributes), get attributes and
cast them to a primitive type, and send messages (call functions),
casting the result. The object would be read-only, so trying to
modify it would fail.
Maybe: smarter low-level data
Instead of an opaque byte blob, the “low-level” blob of an object
could go all the way down to C, and be in the form of a symbol table,
with name, primitive type, and value. I very much prefer not to do
this, because it would encourage more complex logic to be done in
Daydream, which is not the intent; but it would tie in better with the
open-ended type system (as in, you could use two code objects that
were written for different structures). The nag is, I'm not sure this
is a good thing :-) I'd rather keep those “optimised” objects very
simple, which means the Daydream code has to know exactly what
low-level data it's messing with.
Syntax notes
A few ideas about syntax.
The syntax design revolves around three goals, in order:
- Readable. Code should be more or less self-documenting; looking
at a piece of code, you should easily infer intent. (Although no
effort would be made to make it impossible to write bad code. All
efforts in this direction that I've ever seen only end up creating
cumbersome constraints.)
- Learnable. This language is designed partially to attract new
programmers, so it should be usable as a first language; learning to
program in it should be easy, not only for people who are already
programmers, but also (and almost more importantly) as a first
language.
- Writable. Once you have wrapped your brain around a few
important paradigms - passing messages to objects, and Dream's
concept of a variable - the syntax should be quite similar to how
you think about the problem in your head; it should feel natural to
write it down, it should feel like you're writing a message
expressing your thoughts, and not like using some arcane structured
secret code. This not only helps programmers be more productive
(write faster), it also helps on the first goal (make the code
readable and intent-expressing).
One of the corollaries that sprout naturally from these goals, is that
syntax should try to model colloquial communication between humans,
and not mathematical language. This is the rationale, for example,
behind function calls not having the familiar obj.foo(arg1, arg2)
syntax, which is borrowed from algebra, but feels consistently alien
and weird to non-programmers who are looking at source code or
learning to program. The way you think about this is something more
similar to foo obj with arg1 arg2, which is valid Dream syntax if
you only append a colon to with.
Another decision that follows from this is the use of the period
(.) at the end of statements, rather than the semicolon inherited
from C and Pascal. In colloquial communication, a message with a lot
of semicolons is badly written; and as illustrated by this paragraph,
statements separated by a semicolon are expected to be closely
related. We have been using the period to end statements since we
learned to write; let's keep it.
Returns
A code object returns the value of its last statement. This is mostly
for the sake of very short code fragments; for longer code, an
explicit return is recommended.
To return explicitly, begin a statement with an equal sign (therefore
“assigning” the following value to the code output).
On the other hand, a message handler may be declared “asynchronous”,
in which case it does not return. The compiler should probably
issue an error or at least a warning, if an explicit return is used in
code associated with such a message.
(Asynchronous messages may be tagged as such using the slot for the
return value interface.)
Typing
Variables and arguments are typed (by interface). This is done by
prefixing the interface name in parenthesis to the variable
definition. The parenthesis are for the sake of argument lists -
since the comma is optional, you need to be able to tell easily what
is an interface and what is the name being defined. (But it then
allows me to support sequence unpacking, if I decide so.)
You can omit the interface, defaulting to (object).
You can redefine a variable; you might want to do that, for example,
after determining that the value is, after all, of a certain type.
An expression of the form (expression :: interface) (parenthesis
not optional) is a cast. You shouldn't use it often, but sometimes
you might want to send a message to an object using a different
interface that you know it has. Pronounce it as “expression as
interface” (eg, “first_name as sequence”).
Reserved symbols
This is the complete list of reserved symbols up to now: {}()"=:.
are meaningful, ,; are syntactic noise, # is a comment (to end
of line, like in shell or Python).
The curly braces delimit a code block. Parenthesis have a few uses -
grouping expressions, declaring types, and casting (although casting
is only a special case of grouping). Double quotes delimit a string;
the string syntax is roughly similar to Python's unicode literals,
sans the leading u, except that it can go multi-line.
The equal sign is used for attribution (and explicit return which is a
special case of attribution); it is special in that it's only reserved
when by itself (delimited by whitespace), so ==, +=, etc are
not reserved. (This sounds dirty. Maybe use := instead?)
The colon is probably the most overloaded one. It's heavily used in
the syntactic noise system - if a token after the second ends in
colon, it's noise; if the first or second ends in colon, it's the
message name. Doubled, it's used for casts. And as the first token
in a code block, it starts an argument list (see next section).
The period, finally, separates statements. Like the Pascal semicolon,
it's a separator and not a terminator - meaning, it's optional on the
last statement of a block. The period has to be followed by
whitespace or end of input, to disambiguate from a decimal point or
ellipsis (or other uses of the period character we may introduce
later).
Syntactic noise is ignored; this is important, because if you're
using it as a separator, you have to follow it with whitespace. The
expression 1, 230 is two numbers (possibly two arguments to a
message), while the expression 1,230 is a single number.
Arguments
Argument lists are defined by having a colon as the first token in a
code block:
(code) add = { : (number) a (number) b : a + b }.
I'm still uncertain about keyword arguments; coming from Python, I
certainly know their value, but also coming from Python, I recognise
that they usually appear on overcomplex code, that could be done
better using more object-oriented techniques - or, for some uses (like
the datetime constructor), multimethods.
You set up a default value, if you want one, inside the parenthesis:
(code) add = { : (number 2) a (number 2) b : a + b }.
For variable-length arguments, you use the ellipsis:
(code) add = { : (number) a ... :
iterate_over ... with: { : (number) n : a += n }.
= a
}.
Here's a possible implementation of the new method:
: ... :
new_obj = create_empty object.
iterate_over ... with: { : base : new_obj add_base: base }.
# only initialise after adding all bases -
# because one base may care what other
# bases the object has
iterate_over ... with: { : base : base, initialize: new_obj }.
= new_obj
Scope
There is no built-in syntax for “give me attribute bar of object foo”,
because we don't want to encourage code to mess with attributes of
objects other than self (sorry, Python). Of course, you can get
to these attributes, using introspection (something like:
(foo :: object) get_attribute: "bar").
Temporary (local) variables in Dream (not Daydream) don't live in
the stack, but on a global soup on the heap, and are garbage
collected. This is for the sake of nested scopes; consider:
(integer) number = 1.
(code) silly_code = { number + 1 }.
number += 1.
= silly_code
Since this code object returned, then number can't be “alive” in its
stack frame anymore. But the code object returned at the end has a
reference to it, so it must be somewhere. (Those who ever worked on
a Lisp implementation, or even Python, will be familiar with closure
theory, of which I just scratched the surface.)
For consideration: what happens to number if I persist silly_code?
So here's how a name is looked up to resolve to a variable:
- reserved names. There's only a few of these; null, unknown, true,
false, self, context.
- local variables from the current code object (including arguments).
- local variables from enclosing code objects (nested scopes).
- attributes of self.
- interface names registered in the interface manager; an interface
doesn't have to be registered, but registering it makes it
accessible in this fashion, therefore much more usable.