So, I’ve set this blog up as a way of recording the development process for my COMP390 directed study project, as one day in the future I will have to write a report detailing all of this. With that in mind, I will attempt to lay things out from a high-level perspective:
What:
I’m writing a distributed web application framework. That is, the environment and tools and technologies to allow people to easily write massively scalable dynamic web applications.
Why:
I think by now that it should be common knowledge that PHP sucks. Big time. If you don’t know this, then you probably have never been forced to use PHP for anything non-trivial. Despite this fact, people still write big applications in PHP. This is fine when it’s just you and your friends using the site, but once it becomes popular, you will have a problem. Now, considering that you’ve used PHP, you’ve probably also used MySQL. If you were really sensible, you might have written a simple database abstraction layer. You could perhaps add additional (or bigger) servers at the front-end (running your PHP application), but all that does is put additional strain on your database backend, and probably has very little in the way of tangible benefits.
Web applications very rarely deal with more than a very small subset of the possible operations within a relational database. The success of frameworks such as Ruby on Rails suggests that the majority of users really only need the basic CRUD operations (create, read, update, delete). The seperation between the web application and the database is usually a very artificial one within the context of web application development. It seems to be a tradition that persists because that’s the done thing. The datasets which applications deal with are rarely tied to a particular method of representation.
How:
So, in writing my framework, I have three major goals:
- Provide a suitable development platform for web applications that are bigger than PHP/MySQL and smaller than Oracle.
- Take good advantage of the increasingly parallel computing power available in commodity hardware (think: dual cores/cell processors) while providing the ability to also scale across hosts.
- Remove the artificial division between the application and the database. By combining the two, you can take advantage of complete control over data and logic to enact highly aggressive (read: reaching a theoretical maxima in terms of efficiency) caching schemes on very dynamic data.
A web application, in my experience, generally boils down to:
- A reasonably simple dataset (which may be large or small).
- A whole bunch of simple logic to take data and perform some sort of business logic, or to prepare it for display.
- A huge amount of display code, sometimes featuring a small amount of display logic.
Traditionally, the divisions between the first component and the second two is quite clear. The data is in the database, it gets queried out by the business logic and transformed for display. The second two pieces are more often than not munged together in some horrible way, or seperated via some basic templating mechanism which is usually very code-heavy and provides very little gain for the programmer (think: smarty).
Taken on its own, “a whole bunch of simple logic to take data and perform some sort of business logic” sounds much like a pure functional operation (think Haskell or ML functions, rather than C or PHP). A function f : x -> y without any side effects. In type theory, a displayUser function could take a User as a parameter and return a string which was the users’ details formatted as a nice HTML output, for example. Ignoring exactly where that User object comes from for now, this function is pure. It does nothing to modify the environment in which it runs. The program state S remains unaltered before, during and after execution. This means that given the same input, our displayUser function will always return the same output. This is a very useful piece of information from a caching perspective, and it forms the basis of both the aggressive caching and the ultra-scalability that takes place within this system.
If the User object were coming from a database, all the advantage we gained from knowing that the function is pure and effect-free is lost. A normal PHP application will have a bunch of inline database queries and then a bunch of code to perform some sort of computation or to prepare the result for display. Because the possibility of the database being modified exists (either by another part of your application or by an external user), the sorts of caching schemes which can be implemented are feeble at best, usually resorting to query-cache style caching.
Within my system, data objects are tied to functions explicitly. The domain of a function f will consist of a number of objects. An object may be part of any number of function domains, but it cannot exist independent of them. The domains of which a given object is a member of is determined programatically, and I envisage this information being able to be inferred statically at compile time, or at worst it will be defined by the user upon entering the data. By having explicit links between logic and data (and yet ensuring they still remain seperate), our caching scheme no longer has the uncertainty of external modification. If an object is modified, it is a simple matter of notify all upstream caches within the dependency tree, simply propagating upwards and invalidating the caches. Once again, the depedency tree can be computed statically at compile time, so the computational expense associated with this is minimal (and is far outweighed by the computational savings from the effectiveness of the caching scheme). Updates to objects can be done in a ‘mostly-functional’ manner. If a function attempts to take as input a given object from its domain and modify it, then its range must wholly intersect its domain, meaning that the result of the computation modifies the program state S, but the operation is still type-checked, and the appropriate cache-invalidation may take place. Update operations can be sequenced, which becomes important when we reach the topic of concurrency.
That’s all good and well, but what happens when you want to access an external datasource, for example, an external web service, or a file on the disk? The strategy I intend to employ is to have explicit ‘impure’ declarations for functions which can never be cached reliably. Once again, the pure/impure status of a function can be statically determined at compile time, and could potentially provide a basic mechansim for sandboxing and user security. By knowing where they external data sources lie, we do not waste computational effort on attempting to cache the uncacheable, which is yet another flaw of basic “dumb” caching schemes employed elsewhere.
So, assuming that a programmer has defined a number of objects (and corresponding custom types for these objects), a number of processes which can act on input from the caller and upon objects within their domain (and some which may modify the program state), then all that is required is a basic templating system which has the ability to output content verbatim, or to output the result of a given function (for example, displayPage(Request.id)), where Request.id is some user-provided input to the page. These templates could potentially be JIT compiled.
The benefit of enforcing this seperation of logic and display is that a) the usual issues associated with doing I/O in functional programming languages no longer apply, as the order of evaluation is not important, and b) all purely functional processes can be distributed and duplicated across as many distinct hosts or processors as you like. Exactly which processes can be distributed can be determined to within some infinitesimally small quantity away from optimum, because of the existence of a multi-headed dependency tree. The usual concurrency issues of deadlocks and thread starvation and the like are eliminated because program state is not being modified. And best of all, all of this is achievable without any extra work for the programmer!
Where:
Here, of course! And at the University of Waikato Computer Science department.
Anyway. I think I’ve written enough today. I do hope that it’s at least partially comprehensible when I come to try to read it back later. Perhaps tomorrow I will post some pretty diagrams explaining the process distribution method and some samples of what I think the programming and template languages should look like!