Regular Expressions, Lisp, SQL, Parsing, Domain Specific Languages

code, lisp, philosophy, programming, software engineering, unix 3 Comments »

I’ve been trying to code some more on Project Shelob (my web server) in my spare time. I’m to the point of needing a configuration file, so I can start up the server using different ports and directories for testing. Speaking of testing, I’m also to the point of needing automated test suites. I was refactoring some of the HTTP code, and when I got done, it was far more readable, and there was much rejoicing! Unfortunately, two days later I discovered I had introduced a subtle bug in keep-alive handling during a 404 event. Oops.

Anyway, I decided to use JSON as my configuration language. Simple, accommodated everything I needed, and later I would be able to easily write an AJAX GUI front end to configure the whole thing. Should be slick, right? Not as easy as it might sound. Though I have written parsers by hand, I’d rather not. Ok, so I’m using C++, surely someone has written an easy to use open source library that I can just stick in my rules and get out a nice data structure, right?

Well, kind of. There is Boost Spirit which would do everything that I want it to do, but it also required me translating the EBNF grammar of JSON into Boost’s strange amalgamation of YACC and C++. Okay well and good, but surely there is something better? After some more searching, I run across ANTLR which seems to be the spiritual successor to LEX and YACC/Bison. It even has a nice Java GUI and someone had kindly done the ANTLR rules for JSON. Check out the graphical goodness:

Still, the C++ backend wasn’t fully supported and required installing libraries and was complicated. Not 100% what I needed or wanted. All of which got me thinking about domain specific languages. Most programmers don’t consider it, but SQL and Regular Expressions are good examples of Domain Specific Languages (DSL), as are lex and yacc/bison. Up till now, I’ve frowned on the whole idea of DSLs in general. It had always seemed like bad software engineering practice to invent a new language for each problem. After all, did we really want to learn an entirely new programming language with each assignment? Who is going to maintain the code?

However, the facts point out that you have to learn an entire API anyway, and the API really just layers over what you’re really trying to do with a language that wasn’t quite expressive enough to do the job natively to begin with. Which of course leads me to LISP and through Martin Fowler who makes some good points here:

“One of the most obviously DSLy parts of the world is the Unix tradition of writing little languages. These are external DSL systems, that typically use Unix’s built in tools to help with translation. While at university I played a little with lex and yacc - similar tools are a regular part of the Unix tool-chain. These tools make it easy to write parsers and generate code (often in C) for little languages. Awk is a good example of this kind of mini-language.”

While I’ve been using SQL, regular expressions, awk, lex, and yacc for years, I’d never really classified them in my mind as DSLs. I’ve been well aware of the power of small specialized utilities aggregated together to perform a bigger task and why UNIX has been so successful at this, but I hadn’t made the leap to apply this to my programming. Fowler continues:

“Lisp is probably the strongest example of expressing DSLs directly in the language itself.. Symbolic processing is embedded into the name as well as practice of lispers. Doing this is helped by the facilities of lisp - minimalist syntax, closures, and macros present a heady cocktail of DSL tooling. Paul Graham writes a lot about this style of development. Smalltalk also has a strong tradition of this style of development.”

I’ve heard “grey-beards” and academics talk about the power of Lisp for years, and though I did some trivial functional programming in college, I’ve dismissed the rants of the Lisp guys as nothing more than rants. Today though, the ideas are crystallizing in my head, and I’m excited to explore this more.

Lisp Cells Moving to Python

code, lisp, python 1 Comment »

I ran across the NYC Lisp User Group’s description of a Google Summer of Code project to port Lisp Cells to Python. I hadn’t heard of Cells before this, but this seems like a potentially cool thing. I would like to try this, especially on Python.

I’m not a Lisp coder, but it would be fascinating to go to one of their meetings. One can imagine anyone in NYC passionate enough to show up to a Lisp user group would be an interesting character. :-)

Parsing, Priv Seperation and chroot

code, http, internet, security No Comments »

I fixed up the parsing issues on Shelob so that it is somewhat respectable, instead of a bunch of hacks. It was obvious once I started looking at what the client was sending me (the LiveHTTP headers Firefox extension rocks), that I needed to break up each line and then seperate the values into a name and value.

After rewriting the getHeaders() function to use STL hash tables, not only is the code more flexible, but it is also cleaner. For example:

[code]
log.writeLogLine(inet_ntoa(sock->client.sin_addr), request_line, 200, size, headermap["Referer"], headermap["User-Agent"]);
[/code]

Here, with the headermap, it is obvious what values I am passing. Before the rewrite, I just had a bunch of tokens[3], tokens[5], etc.

I’m also toying around with the idea of privilege seperation and chroot jails. This sort of flows with the previous post of a micro-kernel type approach, similar to how Postfix works. While it is more secure, the programming challenges are pretty high. I may leave that for a later version. I still have a bit of cleanup to do before a release.

Aside:

Theo de Raat gave a nice presentation on exploit mitigation techniques that OpenBSD is using which relates to some of these ideas.

More Hacking Shelob

code, http, internet 1 Comment »

I fussed around more with logging today, which lead me to the parseHeader() function. Parsing is one of the weakest areas right now. For simplicity, I had implemented it by simply tokenizing on “space”, shoving the tokens into a string vector and then iterating over that vector for the tokens I needed.

So far, I’ve not peeked at anyone elses source code, Shelob is a clean room implementation of a basic HTTP server. However, I really need to clean up the parser. I thought about going with a full lexer using flex or something, but that is probably overkill. Plus, I’d rather not add another dependency. More thought on this is needed and maybe some research into how other people are doing this. Very much an area where security can go wrong, it needs to be done right.

The other thought I had while poking around, is that I could make each component into its own server, sort of a mini-microkernel approach. I could imagine a swarm of different servers, all being able to communicate. You could have the log server running on one host, seperate cgi servers for each user, as well as different backends. The only thing I’m not sure about is how much overhead this would be. A lot of the interprocess communication could happen over local UNIX sockets, FIFOS, or even shared memory, but it would be awesome if it all worked fast over a regular socket. Yet more thought needed here.

So far I’m having a blast playing with this program. It is nice to write something for yourself and make only the trade offs you decide. I don’t have any customer or management trying to shoe horn this thing into something I don’t want. Even if I never release it, it is a good brain excercise.

Hacking Shelob

code, http No Comments »

Today I added support for NCSA/Apache style logs. It has been nearly 2 years since I last touched this code and closer to 3 since I first wrote it. Surprisingly, I’m able to make modifications pretty easily. To me, this indicates that the design is semi-clean. The odd thing about Shelob is that it is literally my first C++ program. I’ve never so much as done a Hello World in C++ before writing a web server. Granted, I had done a fair amount of C before this and I’m using C++ more for the STL and namespaces.

It isn’t completely OOP, but C++ isn’t either. One of the big things that I was trying to do with Shelob was to use C++ strings exclusively, but I found out quickly that it is almost impossible not to drop down and use C style “strings” at some point, espeically when dealing with sockets. Right now Shelob is very incomplete, but it does have the following features:

  • Compiles cleanly on Solaris/Sparc, OpenBSD/PPC, OSX/PPC, Linux/x86
  • Binary is less than 60K
  • Supports HTTP/1.1 Keep-Alive
  • Basic log file support
  • A filter class (currently supports adding a footer to every HTML page before serving)

Currently, it is forking, but I’m considering moving to a select model for speed. I would also like to be able to run it from Win32, but that is a much lower priority. It would be nice if Vista supported forking. I have some ideas for future features, but there are some areas that are a little rough in the current code that need refactoring. I also need to ponder what license to release under. I’m leaning towards BSD, but GPL is running a close second. I should probably look at other web servers and see what they are operating under.

Shelob Needs a New Name

code 1 Comment »

Several years ago, I implemented a partially compliant HTTP/1.1 web server in C++. It is named Shelob, after the Spider Beast in Lord of the Rings, it’s also an acronym: Server for HTTP enviroment and Logging Outgoing Bits (credit goes to Darren Morin for the name). I ported it to Automake/Autoconf a year ago, and I would like to update it some and release it as open source. However, I probably need to come up with a new name to avoid copyright infrigement and to make it easier to find in search engines. Any ideas?

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in