The language of my dreams, part 3 - Runtime

blog entry posted by lalo (Lalo Martins) on 2006-03-14 13:59:10

Tags:

I thought this series of posts was finished, but I caught myself thinking about it repeatedly :-) so here is what my “preposterous little brain” came up with in these last few days.

Revisions to the last two posts

I studied LLVM a bit, and decided, if I was to actually implement the language (which I'm now calling Dream), LLVM is not a good match. The advantages it offers on top of using the gcc framework are almost totally irrelevant for this project.

In fact, ideally, Dream should be written in itself, and have its own machine code generators; but it should also have a C generator, which pumps C into gcc to get machine code, for two reasons: first, shipping the generated C code for the compiler, is a great way to bootstrap from source; second, it would help on platforms for which a machine code generator doesn't exist yet. This approach is not my original invention - I'm copying it wholesale from Pypy. (Although Pypy does have an LLVM backend as well.)

Also, upon later reflection, I decided the C-like syntax for message passing (the last code example in the previous post) is Evil™, and detracts from the stated goal of “regular syntax”; it's out.

The binary format

I mentioned Dream having its own binary format. Why, and what does it look like?

It's important to bear in mind, this is a “pure” object-oriented environment; the binary formats we have now (ELF and EXE) are designed around the needs of C, with long-running processes, lots of functions identified by name only, and variables which are blobs of binary data. This is simply not appropriate for the kind of information we want to store.

A Dream binary file has to be essentially a persisted object. Technically, it will actually be a tree of objects; the “main” object represented by the file, plus the other objects needed to reconstruct it (its attributes and message handlers, and their attributes and message handlers, and so on). Notably, what's probably the most “interesting” part of the file, the machine code for any code objects contained in this tree.

Now, each Dream object consists of five pieces of information:

  • links to bases (what Self calls the “parent” link)
  • an opaque blob of bytes, with its size information (usually 0)
  • a symbol table mapping (interface, name) pairs to other objects - the attributes
  • the method handler table, mapping (interface, message name) pairs to other objects (usually code); this may include additional metadata (such as the interface of the return value, if any)
  • an index into the machine-dependent section of the file (normally unused, except for code objects)

A link to another object may take a few different forms:

  • an index to another object stored in the same file.
  • a “magic number” pointing to a built-in “shortcutting” method implementation - for example, the run message of code objects will make a call into the machine code version, which will have been stored elsewhere in the file (so that it can be loaded into a “code” segment, on hardware that uses segmentation), while arithmetic operations on a number object will resolve to a handful machine code instructions.
  • a reference to an object in a different file. This is the delicate part - there are issues to be carefully thought about, about search paths, about keeping these files useful when you move them between machines (which possibly have different directory layouts), and the most important - if an object N is referenced as an attribute of both A and B, if N itself is not persisted in its own file, and you persist both A and B to their own files, which one gets N? The answer to this will probably have to do with weak references.
  • a name (indirect reference) - mostly used for aliasing attributes, and more usually, for referring to registered interfaces.

A somewhat important attribute of this format is that the machine code is stored in a separate section from the object data; code objects will have their source in the “byte blob” slot. The machine-dependent section of the file is treated as a cache; you may move the file to another machine of a different architecture, or different settings, and the source will be transparently recompiled. You should also be able to chop off the code section entirely, forcing it to be regenerated - useful for distribution.

Binary and source

With this layout, a Dream file is not only a viable binary format, but also quite decent as the source form for development. There would be little sense in having the source for the code objects scattered in little text files, only to be collected into a Dream file by a tool. Rather, it's best to design tools to edit Dream files, tweaking the object links in it and editing the source for code objects; there could be a GUI interface, an emacs library, a set of command line utilities, you name it.

One consideration raised by this is that the format would then need to be amenable to revision control. There are two obvious ways to do that; either make the format text-based, or at least line-oriented (possibly XML - yuck), or, by having a C library that manipulates the file format, implement plugins for extensible revision control systems (like bzr) to handle Dream files smartly. (Which in this context would mean a lot of things - not only track object links and code source, but also, ignore the machine-dependent section entirely.)

Dreamshells

This runtime model of “pure objects”, of a bunch of persistent objects to which you send messages, does not map well into the OSes of today. At some point, there must be something with Unixish semantics - an executable which is loaded into a process, and which can be accessed from the shell or desktop environment.

This is the role of the Dreamshell: essentially, an object which has the ability to be saved as a “native” executable.

The Dreamshell interface would look more or less like this:

(string) config_name
return a filename to use when looking for a configuration file. On Unix, for example, if this message returns "bleh", we'd look for /etc/dreamconf/bleh, and ~/.dreamconf/bleh (if both exist, load both, in this order).
(sequence) config_data
return a sequence of config_data objects, each having a short name, a long name, help text, and some data on what to do with it if found (exact semantics to be defined; look at good command line libraries, such as Python's optparse, for inspiration).
(string) copyright
copyright info to display if requested on the command line.
(string) version
version info to display if requested on the command line.
(string) help
information to be displayed in --help output, after the list of arguments.
(integer) run

actually do whatever this Dreamshell is supposed to do (not called if the command line had --help, --version, or an error).

Gets arguments:

(string_sequence) raw_command_line
the unprocessed command line
(string_sequence) args
what was left of the command line after processing configuration
(string_mapping) environment
the environment
(sequence) files
the open file descriptors (wrapped in file objects)
(configuration) options
the data found on the config files and command line

An introspection call somewhere can dump a Dreamshell object into a native executable.

The system distribution would ship with a few useful Dreamshells; one to send a message by name to the object represented by a Dream file, one to listen on some network port and do distributed object brokering (probably using VIP), one to display a bunch of View (UI) objects on an X server (acting as a bridge between X and Dream until killed).

An hypothetical Windows distribution would rely on a different kind of object, which would bridge Dream with DCOM, rather than generate executable files.

Daydream

With all this said, it feels to me that it would be just too painful to write C code for code objects that need low-level logic or optimisation. Not only painful for the programmer, but also for the maintainers of the language itself, as the runtime would need to be able to compile C code.

Rather, I would again go the Pypy route, and implement a limited dialect of Dream, which would translate almost directly to machine code. I call it “Daydream”.

Here, a variable is not an object reference with an interface; it is a primitive type (integer, float, boolean, character, or a vector of one of these, or “object”), plus a (byte) size and a (vector) length, plus a reference to an object (usually “self”), plus an offset into the byte blob of that object, or just a stack address (for locals). If you need anything that can't be expressed with these primitives, you shouldn't be using Daydream ;-)

Note

The size property is the byte size of the individual elements, not of the total; the length is 1 if the variable is not a vector. So a single integer could be (integer, 32, 1), while a vector of 5 longs could be (integer, 64, 5). The sizes are explicit, so that the code is sanely portable between hardware platforms.

An “object” variable only has a few operations. You can perform some basic introspection (check types, check for attributes or message bindings), you can read or set attributes, and you can send messages. Sending a message looks more or less like:

send (interface) my_object "message_name" argument argument.

(the usual rules about syntactic noise apply.)

Notably, the Daydream runtime would have a way to open a shared library, either by absolute filename or by using the system search path, and wrap it as an “object”; you then could introspect it to check for the presence of symbols (attributes), get attributes and cast them to a primitive type, and send messages (call functions), casting the result. The object would be read-only, so trying to modify it would fail.

Maybe: smarter low-level data

Instead of an opaque byte blob, the “low-level” blob of an object could go all the way down to C, and be in the form of a symbol table, with name, primitive type, and value. I very much prefer not to do this, because it would encourage more complex logic to be done in Daydream, which is not the intent; but it would tie in better with the open-ended type system (as in, you could use two code objects that were written for different structures). The nag is, I'm not sure this is a good thing :-) I'd rather keep those “optimised” objects very simple, which means the Daydream code has to know exactly what low-level data it's messing with.

Syntax notes

A few ideas about syntax.

The syntax design revolves around three goals, in order:

  • Readable. Code should be more or less self-documenting; looking at a piece of code, you should easily infer intent. (Although no effort would be made to make it impossible to write bad code. All efforts in this direction that I've ever seen only end up creating cumbersome constraints.)
  • Learnable. This language is designed partially to attract new programmers, so it should be usable as a first language; learning to program in it should be easy, not only for people who are already programmers, but also (and almost more importantly) as a first language.
  • Writable. Once you have wrapped your brain around a few important paradigms - passing messages to objects, and Dream's concept of a variable - the syntax should be quite similar to how you think about the problem in your head; it should feel natural to write it down, it should feel like you're writing a message expressing your thoughts, and not like using some arcane structured secret code. This not only helps programmers be more productive (write faster), it also helps on the first goal (make the code readable and intent-expressing).

One of the corollaries that sprout naturally from these goals, is that syntax should try to model colloquial communication between humans, and not mathematical language. This is the rationale, for example, behind function calls not having the familiar obj.foo(arg1, arg2) syntax, which is borrowed from algebra, but feels consistently alien and weird to non-programmers who are looking at source code or learning to program. The way you think about this is something more similar to foo obj with arg1 arg2, which is valid Dream syntax if you only append a colon to with.

Another decision that follows from this is the use of the period (.) at the end of statements, rather than the semicolon inherited from C and Pascal. In colloquial communication, a message with a lot of semicolons is badly written; and as illustrated by this paragraph, statements separated by a semicolon are expected to be closely related. We have been using the period to end statements since we learned to write; let's keep it.

Returns

A code object returns the value of its last statement. This is mostly for the sake of very short code fragments; for longer code, an explicit return is recommended.

To return explicitly, begin a statement with an equal sign (therefore “assigning” the following value to the code output).

On the other hand, a message handler may be declared “asynchronous”, in which case it does not return. The compiler should probably issue an error or at least a warning, if an explicit return is used in code associated with such a message.

(Asynchronous messages may be tagged as such using the slot for the return value interface.)

Typing

Variables and arguments are typed (by interface). This is done by prefixing the interface name in parenthesis to the variable definition. The parenthesis are for the sake of argument lists - since the comma is optional, you need to be able to tell easily what is an interface and what is the name being defined. (But it then allows me to support sequence unpacking, if I decide so.)

You can omit the interface, defaulting to (object).

You can redefine a variable; you might want to do that, for example, after determining that the value is, after all, of a certain type.

An expression of the form (expression :: interface) (parenthesis not optional) is a cast. You shouldn't use it often, but sometimes you might want to send a message to an object using a different interface that you know it has. Pronounce it as “expression as interface” (eg, “first_name as sequence”).

Reserved symbols

This is the complete list of reserved symbols up to now: {}()"=:. are meaningful, ,; are syntactic noise, # is a comment (to end of line, like in shell or Python).

The curly braces delimit a code block. Parenthesis have a few uses - grouping expressions, declaring types, and casting (although casting is only a special case of grouping). Double quotes delimit a string; the string syntax is roughly similar to Python's unicode literals, sans the leading u, except that it can go multi-line.

The equal sign is used for attribution (and explicit return which is a special case of attribution); it is special in that it's only reserved when by itself (delimited by whitespace), so ==, +=, etc are not reserved. (This sounds dirty. Maybe use := instead?)

The colon is probably the most overloaded one. It's heavily used in the syntactic noise system - if a token after the second ends in colon, it's noise; if the first or second ends in colon, it's the message name. Doubled, it's used for casts. And as the first token in a code block, it starts an argument list (see next section).

The period, finally, separates statements. Like the Pascal semicolon, it's a separator and not a terminator - meaning, it's optional on the last statement of a block. The period has to be followed by whitespace or end of input, to disambiguate from a decimal point or ellipsis (or other uses of the period character we may introduce later).

Syntactic noise is ignored; this is important, because if you're using it as a separator, you have to follow it with whitespace. The expression 1, 230 is two numbers (possibly two arguments to a message), while the expression 1,230 is a single number.

Arguments

Argument lists are defined by having a colon as the first token in a code block:

(code) add = { : (number) a (number) b : a + b }.

I'm still uncertain about keyword arguments; coming from Python, I certainly know their value, but also coming from Python, I recognise that they usually appear on overcomplex code, that could be done better using more object-oriented techniques - or, for some uses (like the datetime constructor), multimethods.

You set up a default value, if you want one, inside the parenthesis:

(code) add = { : (number 2) a (number 2) b : a + b }.

For variable-length arguments, you use the ellipsis:

(code) add = { : (number) a ... :
  iterate_over ... with: { : (number) n : a += n }.
  = a
}.

Here's a possible implementation of the new method:

: ... :
new_obj = create_empty object.
iterate_over ... with: { : base : new_obj add_base: base }.
# only initialise after adding all bases -
# because one base may care what other
# bases the object has
iterate_over ... with: { : base : base, initialize: new_obj }.
= new_obj

Scope

There is no built-in syntax for “give me attribute bar of object foo”, because we don't want to encourage code to mess with attributes of objects other than self (sorry, Python). Of course, you can get to these attributes, using introspection (something like: (foo :: object) get_attribute: "bar").

Temporary (local) variables in Dream (not Daydream) don't live in the stack, but on a global soup on the heap, and are garbage collected. This is for the sake of nested scopes; consider:

(integer) number = 1.
(code) silly_code = { number + 1 }.
number += 1.
= silly_code

Since this code object returned, then number can't be “alive” in its stack frame anymore. But the code object returned at the end has a reference to it, so it must be somewhere. (Those who ever worked on a Lisp implementation, or even Python, will be familiar with closure theory, of which I just scratched the surface.)

For consideration: what happens to number if I persist silly_code?

So here's how a name is looked up to resolve to a variable:

  1. reserved names. There's only a few of these; null, unknown, true, false, self, context.
  2. local variables from the current code object (including arguments).
  3. local variables from enclosing code objects (nested scopes).
  4. attributes of self.
  5. interface names registered in the interface manager; an interface doesn't have to be registered, but registering it makes it accessible in this fashion, therefore much more usable.

Web lining?

blog entry posted by lalo (Lalo Martins) on 2006-03-12 16:32:30

Tags:

The problem

One thing that bothers me on the web these days — and it must bother many other people, because I see a lot of “web 2.0” sites going to lengths to work around this — is that we use a lot of media (which implies a lot of stuff for the client to download), and significant loading and rendering time, to build what could be best described as “decorations” around our main content. Look at my blog; I have relatively little - a logo, the menu of previous posts, a banner ad, a small attribution button, and an optional frame around the post (it goes away if you click on it, in case you haven't found that out yet; click on the post title to get it back). Still, my historic menu requires a non-trivial chunk of data, and it would be a waste to get it from the server on every page load.

Alas, the blog — and the “web 2.0” sites I was talking about — don't really do page loads that often. When you click on a post on the menu, I do an “AJAX” request and load the new post in the frame, without having to rebuild the whole page. Even if you use one of the subject filters, by clicking on the small subject items near the posts, you still don't have a full page load.

But that comes with a price. The page that the URL above points to, doesn't actually have any real content - which means my banner ads end up not being targeted, which means they get a lower CTR, and I get less money. Okay, I don't blog to make money, but if I can make some, why not?

And then, if I want search engines to index me, I have to go trough wild contortions. (In the case of this blog, I just don't bother, since they already index the planet.) And finally, to have unique URLs that you can bookmark for each post, I have to subvert my URLs and use them in ways the Gods never meant them to be used.

On the other site I'm working on, I go with the opposite (and more common) approach; I just do the decorations on each page, and hope that if your computer and connection are fast enough, you won't notice the whole window being discarded and new decorations, identical to the old ones, being drawn on the same places. But this not only can look bad (when things are not fast enough), it's also rather wasteful, in terms of bandwidth, CPU/memory usage, and raw page size. I still believe a webpage should have only the actual content, and some very minimal metadata — if that metadata includes links to scripts, stylesheets and whatever else is needed to make the content purty, then that's fine by me.

Hmm. Whatever else? What about having the decorations themselves as one of those separate resources?

Yet a third approach, which I use on dotplan, is to serve just the content, and build the whole user interface bells and whistles with javascript. I'm aware of quite a few sites that go that route — Gmail for one. What I'm proposing is a variation of this approach — or maybe, rather, a formalisation.

Enter: Lining

Let's call it “lining”; it's a word as good as any other, it's a moderately good mnemonic for what we're actually doing, and I don't believe it's already in use as a (software) technical term.

So we'd have four “component categories” to a web page; actual content (xhtml, images, whatever else you need to show), scripts (defines dynamic behaviour, including that of the lining), styling (defines what the content and lining actually look like - you can have alternate styles for the user to choose from), and lining (a set of widgets that put the page in the context of your site, and helps the user make the best use of the site).

Like scripts and styles, lining can go as a separate resource (<link rel="lining-def" ...) or inline (using XML namespaces, like you would embed a piece of SVG).

There are a few obvious optimisations there; for one, lining would by default not be rendered at all, on print media. If you want a letterhead logo on the prints, maybe also an ad, you'd have to explicitly give it a display property in a @media print block.

A browser that supports lining would behave just like it does now, when displaying a page with no lining. The interesting thing would happen when you're seeing a page with lining, and you click on a link or submit a form (either on the content or lining), or do any other action that would normally cause the page to unload. In this case, the browser would proceed like this:

  • first, make the network request.
    • if it fails, check if the lining has a "onRequestFail" event handler. If it does, call it and stop.
  • if it succeeds, check the content-type. If it's something that can't possibly have lining (eg, a PNG), unload the current page and proceed as usual.
  • if it can have lining, load enough information to discover it. On xhtml (and that's the only format for which lining is defined as yet, though I can imagine SVG being useful too), that would be up to the end of the head element.
  • check if the set of lining-defs is identical. For linked lining, that would be a simple URL comparison; for inline, textual. If they differ, unload the current page and proceed as usual.
  • if the lining-def sets are identical, then unload the content, and replace it with the new one; don't touch the lining.

Lining widgets are not considered to be in the body; rather, they are in the lining. So, if you happen to have a table element in your lining (I'd rather not, but you could), you can affect only lining tables or only content tables on your CSS, with something like:

lining table {
  border: 3px dotted pink;
}

body table {
  border: thin solid blue;
}

Lining definition and lining data

To make that even more useful, we could parametrise the lining. Essentially, this separates the lining information in two sub-categories; definition and data.

The lining definition just sets up all possible widgets that may be used anywhere in the site. (No information on what they look like or where they're drawn; this goes in the style.)

Then the lining data says which widgets will get used, and what information they will display.

Uh, that's too abstract, example please!

A typical example is a contextual navigation menu. Let's use this blog as an example. All blogs in this site would have the same lining definition, which defines a widget for the category description (used when a filter is set — I call it the “abstract”, if you check my source), one for the historic menu (parametrised by the actual historic, which is different for each blog), and the frame around the actual content, which is not parametrised at all. This could be linked from the top of the blog pages, with <link rel="lining-def" href="/media/lining/blog.wall" />.

Then the data for the abstract box would just not be supplied by the pages, which makes the widget not be shown; if you ever set a filter, the appropriate javascript would then provide this data, which would cause the widget to show up. If the filter is reset, the script would clear the widget's data, and it would disappear again.

(Actually, I'm half-lying here; my abstract box never disappears in the current implementation — if you have no filter set, it will say “All posts by lalo:”. But let's pretend it disappears, to make the example more interesting.)

The data for the menu itself would be the same for all pages in a given blog; so, the blog template could link to it on the header - something like <link rel="lining-data" href="/blog/$USER/menu-data.wall" />. One last piece of data could be provided inline: which entry on the menu should be highlighted as active. Or you may just have your javascript figure that out.

Backwards compatibility

Tricky, but not impossible.

If a page has lining, events like onLoad, onUnload etc — plus new events, something like onContentLoad / onContentUnload — would be sent to the lining element, never to the content body.

Then, an implementation of the lining system can be written completely in javascript. If the language for definition and data is XML, then it wouldn't be too hard.

Finally, you can do this on your pages (after loading the javascript lining library, of course): <body onload="setup_lining()">. And there you go; not complete functionality, but the basics.

Of course, you'd still get full page reloads; but at least, the site would be completely usable by any browsers that support JS and CSS. And for those that don't, you're probably better of not showing the lining at all, anyway.

(If it was possible to alter document.location without causing a reload, then it would even be possible to get complete functionality. Maybe it is possible, and I just haven't found how yet.)

Such a javascript implementation would not only help with compatibility, but also be the best way to prototype these ideas.

Caveats

  • Since the links to the different sections of your site would (presumably) be in the lining, they wouldn't be discovered by search engines (although if this gets popular, I wouldn't be too surprised if at least Google added support to it in their crawler). So you'd better put up some sitemap-style file in your site.

Alternatives

  • Rather than lining, layering. You can think of this one as a weird mutant child of frames. The idea is to allow multiple pages to be displayed at the same time, on top of one another. Communication between them would have to be in javascript, but clicking on a link in one could load a page in another, using the existing target= attribute. The location in the url bar would be that of the page containing the element where the focus is.

    This would do the trick, it's possibly easier to do, and doesn't require yet another new language. But I think it's conceptually just too ugly to be seriously considered. Besides, a new language is not necessarily bad; it's about time (x)html stopped being used for layout.

Conclusion

I'll try to build a site like that — which includes writing the loader library on javascript — and I'll let you know how it goes.

The language of my dreams, part 2

blog entry posted by lalo (Lalo Martins) on 2006-02-07 20:05:01

Tags:

The latest rambling made some considerations about my hypothetical dream programming language. This time I'll follow up with some ideas that should be slightly more controversial.

Extremely regular syntax

My dream syntax would be extremely regular, with little or no special cases. With “extremely regular” I mean, if you're learning the language from the beginning, in the first 30 minutes or so you will know absolutely everything there is to know about the syntax.

For example, “control structures” (like if, for, case, or exception catching) would look exactly like everything else; this is a feature of lisp and smalltalk that keeps pulling me back - not only for the “purity” and simplicity inherent, but because it means you can easily write your own such constructs.

Code definition vs. code metadata

IMHO, one of the reasons (non-geek) people find programming languages hard to learn, is that they have to conflate code definition and code metadata. Actual code is just a series of statements, and that is not terribly hard to learn; but a language environment also needs ways to define classes, functions, methods, etc, and initialise static data. The obvious solution is to implement that as more syntax. Alas, I don't think that's the best solution.

Of course, since my language would have full reflection, it would be possible to do these things in code. But that's not what you're supposed to do. The paradigm is that code is a sequence of statements, with arguments and some “return” value; it forms a code object (you may want to call it a closure). Although there is a syntax for creating inline, anonymous code blocks, the basic unit for programming is one such block, typically a method.

This is similar to how Smalltalk environments such as Squeak handle it. You don't write a file describing a class; rather, you define the class on the class browser, and edit the individual methods.

Compiled (to machine code), yet dynamic

I don't really see a reason why dynamic languages can't be compiled to binary. All dynamic languages I know about run from bytecode, most of them compiling so transparently that they look like classical interpreted languages to the untrained eye (in fact, “interpreted language” has pretty much been redefined to mean a bytecode virtual machine, these days, except perhaps for shells). This creates a false impression that a dynamic language needs to be a virtual machine. I don't think it does; I think this is more likely because compiling technology in wide use is trailing way behind, due mostly to only needing to support a few languages (which are mostly hopelessly outdated themselves), and to being bound by the binary formats of the operating systems, which are sad relics from the 80s.

If I actually started writing this language today, I'd base it on LLVM, and compile to binary code. Since we're talking about a fully object-oriented environment, actually saving that to the filesystem would probably require a custom binary file format; but the code in it would be actual platform machine code, not bytecode.

However, the compiler would be entwined to the environment; you could compile a string into a code object from inside your code, then run it on a sandbox.

Also, the source code (or at least an abstract syntax tree) for each code object would be stored in the internal representation; communicating code objects between machines would send this rather than the machine code, so the other computer can do the appropriate processing on the other side.

Syntactic noise

The argument name syntax of Smalltalk and Objective C is really nice for readability; it also often gets in the way, or becomes ugly, not to mention it hurts interoperability with other languages, and it's one more thing to remember.

One way to handle this is to make commas (maybe semicolons even) between arguments into optional syntactic noise, and also any word ending with a colon. So, assuming the syntax for code blocks is curly braces like C, you could write:

if condition
  { set_value widget "foo" }
  { set_value widget "bar" }.

This sends the message if to the object at condition. (If condition is an expression, how do you figure out which interface you're dealing with? That's still uncertain, but will probably have something to do with what is returned by the methods used in the expression; return values will probably have interfaces, just like variables - which makes sense, as you usually return a variable.)

You could, however, find it prettier like this:

if condition
  { set_value widget "foo" },
  { set_value widget "bar" }.

or:

if condition
  then: { set_value widget "foo" }
  else: { set_value widget "bar" }.

Or you might prefer the object first, would you? You can:

condition
  if: { set_value widget "foo" }
  else: { set_value widget "bar" }.

In this case, since the second token has a colon, it's interpreted as the message name, rather than the message target. It would also be true if the second token is not alphanumeric - which is how you'd implement operators. So you can define an operator ? as a synonym for the if message in the boolean interface:

condition ? { set_value widget "foo" }, { set_value widget "bar" }.

Comparing this to Python or C, you might get scared at the absence of parenthesis. I find that necessary; the conflation of parenthesis for both function calling and expression grouping is something I dislike personally, and which would quickly get confusing in a syntax this flexible.

Heck, define the method to return the return value of the block it executes, then you can just do:

set_value widget (if condition { "foo" } else: { "bar" }).

Then code it not to require code blocks - if an argument is a code block, it's executed, otherwise it's just used:

set_value widget (condition ? "foo" else: "bar").

Or write it pretty:

set_value in: widget to: (if condition then: "foo" else: "bar").

Looks too lisp-y to you? No problem, write it like C:

set_value widget if(condition, "foo", "bar").

An open-parenthesis right after an alphanumeric token, without space, makes it a message selector which applies to the first token in the parenthesis.

The language of my dreams, part 1

blog entry posted by lalo (Lalo Martins) on 2006-02-07 14:22:34

Tags:

I have always flirted with the idea of creating a programming language. I don't think I actually will, but here are a few thoughts on what it would hypothetically look like.

First-class objects

This one is a no-brainer. I would hardly use a language where the “everything is an object” dogma is not one of the essential truths of life.

However, as much as I say it's a no-brainer, I look around and see very few languages where it's completely true! Python does it thoroughly, or at least it does since 2.2 and PEPs 252 and 253. Ruby seems to do it decently. Perl is below my contempt - it seems to be a language based on special cases. C++ is almost there, but functions/methods and classes are different than everything else. Same goes for Smalltalk and Java.

I'm neutral wrt whether objects are “classic” objects with classes and inheritance, or delegation-based objects (as per Self). I kind of lean towards the latter, though, if only because it's easier to emulate the former on it than the other way around.

Full introspection

Python's readable syntax is not its number one strength; it comes second to the fact that you can introspect any aspect of the virtual machine. Add new methods at run-time, manually check if an object has such and such attributes, change the class hierarchy. If I could go even further and not special-case built-in objects (which in Python means those defined in C), even better.

Sandboxed with capabilities

If I have to point to one thing in Python I don't like, is the lack of a really safe restricted mode.

The way I'd implement a sandbox would be by wrapping the whole environment in one “language context” object, similar to Python's interpreter context (accessible from C). These contexts would then use capabilities - each message you define can declare to require some capabilities.

The root context has all capabilities, and is special in the sense that if a new capability is created anywhere, it's automatically granted to the root context. Most applications wouldn't need to know about contexts, simply running on the root context.

However, there is a simple way to create a new context, granting it a finite list of capabilities. If a context has a reference to another, it may grant capabilities it has.

This is not a full, complete capabilities system, but it's a very simple design that would probably work well enough.

Note

having a capability wouldn't be equivalent to having a reference to the capability object. This is important, because the language is fully introspective; so your code may need a reference to the capability, in order to check if it has the capability or not. Besides, capability references are probably not very hard to leak. So, the capability object would have a method with a name like “active” or maybe “granted”, which returns true if the current context has that capability.

A global object somewhere has references to all capabilities in the system, across all contexts. Possibly, the “capability” class itself. (It might be useful for introspection for classes to be able to back-reference to all their instances. There is a little-known trick to do that on Python, using the gc system.)

The “standard library” would ship with a set of predefined capabilities, including actions like “reading the filesystem”, “writing to the filesystem”, “using the network”, or the dangerous “executing external programs” and “making system calls” (which may effectively circumvent the capability system, and therefore should probably never be granted to a restricted context).

Allowing restricted operations

Now here's the trick - if I want to create a context that is able to read and write to one given directory, and no other, how do I go about doing that? I'll give one example, there may be other ways. For a more concrete example, let's talk about something I'm actually working on: a “view” object, which when called, updates an HTML file (and potentially one or more images) with data about some related model objects in memory. My system doesn't need to read the file, but for the sake of example, let's assume the example system does - it looks in the output for markers, replacing the contents of the marked parts with the model data.

Personally, I don't like to have object-oriented code using the file concept, so I'd probably wrap it with some persistence abstraction. In this example, I'd have a view_output class, which wraps the actual HTML or image file.

I'd also create a new capability, representing the ability to read to and write from the output directory.

Then, I'd declare the view_output class to require this new capability. You can declare requirements for either methods, or classes; declaring for a class means, all methods declared on that method require those capabilities (in practice, it may be just a shortcut to adding the declarations for all methods); the capability(ies) would also be required to create new instances. Also, for the method that actually reads the file, I could declare that instead of this capability, you can call it if you have the generic “read from the filesystem” capability, and conversely for the method that actually writes the file.

Now, the method that does the reading, will need to use the “read from the filesystem” capability, and I don't intend to grant this to the view context. So, I have to grant the capability to the method; you can think of this operation as the capabilities equivalent of the setuid bit. Of course you can only grant a capability to a method if you have that capability. (Usually, methods will be defined in the root context, so that doesn't matter much, but there may be exceptions - this view system itself, will of course need to define methods in restricted contexts, for example).

Capabilities granted to code are granted to code, not to the slot or some higher level abstraction; the code is an opaque entity. So, in my example, I granted the “write to the filesystem” capability to one method of my view_output object; since the system is introspective, the view context could replace the code for this method with something malicious. However, since the capability is granted to the code object, the replacement code wouldn't have the capability, so it would be harmless.

Now, I can create the view context, and only grant it the capability to use the output directory (or possibly other unrelated capabilities needed to access the model data). There it is.

Maybe: interfaces as a primitive concept

An alternative that I have considered, and incidentally provides a more “pure” capability system, is to make pervasive use of interfaces. This would provide some measure of “type-safety” for those wary of the untypedness of dynamic languages, possible enough to allow multi-methods (which C++ has convinced me are a powerful concept, even if sometimes it's “powerful” as in “a powerful gun to shoot out your feet with”).

The idea is that a variable points to an object and to an interface; the assignment operator takes an object and an interface, and method argument definitions also have interfaces. Trying to store an object in a variable with an interface not implemented by that object would raise an exception, similar to Python's TypeError.

This breaks the Python neat concept that methods are just callable attributes; but I'm not married to this concept anyway. It's still possible to store the code for a method in an attribute of the class, then bind it to a message.

In this system, methods are bound to a message name and an interface; so sending the same message to two variables that point to the same object may invoke different methods, if the variables have different types.

The capabilities system would be implemented by a restricted_context or sandbox class; such an object, when created, wouldn't have capabilities for any interfaces, so it wouldn't be able to send messages to any variables. Then a context that has a capability can grant it to a restricted context. Pseudo-code (Python syntax):

view_context = sandbox()
view_context.grant(view_output_interface)
view_context.grant(my_model_interface)
def view_code_wrapper():
    view_object.update(the_changed_object)
view_context.run(view_code_wrapper)

The hypothetical grant method above could conceivably accept a container (a set) of interfaces in addition to a single one.

An interface may be declared as a refinement of another interface; for example, interface cat extends animal. This is much like traditional inheritance. So if a concrete object binds a method to the feed method of the animal interface, a variable that's bound to the cat interface of that object, which sends the feed message, would still be invoking this method.

If you grant a capability for animal to a context, it wouldn't allow you to use messages defined in cat but not in animal; but if you grant cat, then you're also granting animal.

So you may have “abstract grouping” interfaces; instead of allowing the grant declaration to accept a set, you could define an interface my_model_interface as above, which extends all the interfaces you need to grant in order to allow the view to access model data, even if it's impossible to have an object that actually implements all those interfaces at the same time. Then you grant this abstract interface to the context, which grants all its base interfaces.

Putting delegation-based objects together with interfaces

So if I go with the two “maybes” above, how do they work together?

Each object has the “parent slot”, as in Self. It points to a sequence of other objects, defaulting to the “root object”. Trying to read an attribute of an object first looks in that object, then goes trough the parents. (There is, however, an introspection call that checks if the attribute is present on the object itself).

Each object also has a “method table”, which binds pairs of (interface, message name) to closures (or code objects in Python parlance). Looking for a binding checks first on the object itself, then proceeds up on the parents, like checking for attributes.

You can assign an object X to a variable of interface T if X has any bindings to that interface. This is a “weak” interface system; it doesn't require the object to implement all the messages defined in the interface. This is a necessity of the design; since everything is introspectable, you could add new messages to an interface, thereby rendering existing objects incomplete. There could, however, be a method of the interface object that checks if an object fully implements the interface. (Conceivably, there could also be a per-context setting that turns all assignments into strong checks.)

Let's call it a delegation system, rather than prototyping

I don't like much the fact that the instantiation operation in prototyping languages is a clone of a sample empty object. It makes sense in some languages, but the sample object might turn out to be a waste of memory in others.

With attributes handled in Python style, default values for them could be stored on the parent object instead.

You create a new instance by calling the new operation (where would this be? On the root object? Or the context object?), giving a sequence of parents. It would create an empty object, pointing the parent slots to the given parents.

An object may bind code to the initialize message of the object interface; this code would be called when the object is assigned to the parent slot of another object, with the “new” object as first argument, then the parent. There is no concept of constructor arguments; the model assumes new objects are all blank slates, and to do what in other languages we do with constructor arguments, you have to send methods to the object. This is IMHO cleaner, because it binds the initialization to a specific interface.

What's wrong with applications

blog entry posted by lalo (Lalo Martins) on 2005-10-25 06:04:00

Tags:

One thing that bothers me in today's computers, is that pretty much everyone is sold on to object-oriented programming, but we're still doing it half-heartedly at best.

First peeve, of course, is that most projects are still using C, a terrible language which isn't really good at anything. They happily rant on about its "low-levelness" and speed, and use that as an excuse not to design a proper language that has these same properties. But this doesn't really matter for the topic at hand.

What really bugs me is not our languages, but our systems. 30 years later, we're stuck in systems based on the assumptions, metaphors and paradigms of the multics/unix family.

We're still organizing our data in artificially-hierarchical trees, with opaque byte-streams as leaves; consequently, we're still wasting hundreds of hours per project, figuring out ways to efficiently represent our data as byte-streams.

We're still running huge, monolithic "processes", with big startup costs, and which we have to keep running (wasting system resources) as long as we want to use certain features. (To be fair, this is not a multics/unix thing; unix processes were supposed to do atomic operations and be lightweight.)

I'm tired of the "application". I'd rather have classes or objects. I'd rather not have "processes" hanging on, "waiting"; instead, the "normal" state of the system is to have no "processes", or a very small number of them, and lightweight, short-lived "processes" are fired up to respond to events (methods?).

Another thing that bugs me is that some people write "oop" "apps" in C, others in C++, others in java, others in C#, others in python, and these "apps" are "oop" only internally; they interact with each other and with the user in a completely non-oop way. One can't call methods on (or send messages to) another. Things like corba, .net, and vos, come to fill that void, but still in a somewhat limited way.

This is all horribly inefficient, because we have to run loops around ourselves, we have to spend a good deal of time writing these "adapters" so that our reasonably modern applications can communicate using the paleozoic interfaces we insist on sticking to. It's like living on a world where everybody has a skycar, but we are forbidden by law to take them off the ground. I feel stuck in traffic.

older posts newer posts