Seattle Job Opportunities
Wes over at brokenbuild.com has posted two jobs that have opened at his company:
If you, or someone you know is interested, send them his way. The jobs are located in Seattle, WA.
Wes over at brokenbuild.com has posted two jobs that have opened at his company:
If you, or someone you know is interested, send them his way. The jobs are located in Seattle, WA.
One goal in my day to day work is to quantify events in a systemic way. System administrators are in a unique position to view the network, servers, clients, software and the ways that they interact. While good software development depends on abstracting away as many things as you can, good system administration depends on understanding how the layers interact.
For example, a good developer will abstract away the type of database he is connecting to. There is a small shim that can be adjusted so that the program runs with no changes on Oracle or PostgreSQL, for example. The Java language itself depends on abstracting away the entire computer by implementing a virtual machine that acts consistently over differing operating systems, or even different CPU architectures. A Java programmer doesn’t care that he is running on Solaris Sparc or Linux MIPS or Windows X86, or whether the CPU is big-endian or little-endian.
However, a good system administrator does care, and should know the difference. System administration is about removing layers to solve problems that occur when the abstractions break down. Joel Spolsky refers to this as “The Law of Leaky Abstractions.”
All non-trivial abstractions, to some degree, are leaky.
Some have compared system admins to the plumbers of the IT world. Like plumbing, the effects of system administration disappear when everything is working. Only when things start to leak, and shit starts to hit the fan (literally or figuratively) does it become noticeable. There seems to be one breed of system administrator that thrives on fixing problems. Imagine the server going down, and the mayor frantically paging the heroic sysadmin with the Bat Signal.
Our hero drops into the storm with his combat boots and trusty Leatherman, typing arcane commands, drinking Mountain Dew and cursing at everyone around him. Suddenly, joyous shouts erupt as the users discover their work can continue. Everyone cheers the SysOp, while he struts back to his Bat Cave, until the next Bat Time, at the same Bat Channel.
How does one measure the performance of the lone rogue sysadmin troubleshooter against another that has carefully scheduled downtime, and the system “just works”? Is the system with less downtime more reliable because of the work of the system administrator, or are they just lucky? How does one compensate the hero who fixes every problem solved, verses someone that never demonstrates this ability because the system never goes down?
What of the sysadmin who has unreliable hardware or buggy software forced on him by upper management or customer demand? A lot of companies want to measure metrics like uptime, but is it even possible to properly measure 99.99% uptime, and does that have any correlation to the person running the system?
99.9% uptime amounts to approximately 42 minutes of downtime in a single month, but many of the tools used to measure the availability of the system have a minimum time resolution of 1 minute. For example, you want to test that your website is up and available to your users, so you write a script that makes an HTTP request and returns the result. It sends you e-mail if it doesn’t get a response. However, the standard UNIX cron utility that schedules tasks can only run once per minute.
With a CPU running millions of instructions per second and servers typically having multiple processors, one minute is too long. But, if we magically invent a utility that can schedule and execute your script once per second, suddenly your server is overwhelmed by these requests and your script itself brings the system to a halt. What if you have a process that crashes and restores itself in less time than your monitoring tool checks? You wouldn’t consider a server that crashed every 30 seconds reliable, but most monitoring software can’t tell the difference.
Recently, I upgraded our company’s e-mail server because it was crashing under an ever increasing load of spam. The new software was more efficient and no longer crashed, however this meant it was also more efficient at delivering spam. I was happy because I wasn’t getting pages to restart the mail server, but the average user actually saw more spam in their in-boxes. It is difficult to explain to the average person who just wants to read and send e-mail how complex the system is and how upgrading the software was the right thing to do.
Most people don’t understand that e-mail isn’t guaranteed instant delivery, and that mail servers will attempt redelivery if it can’t get through to a server. In our case, when the server was flooded by spammers, all the legitimate e-mail eventually got through while some spam probably didn’t (spammers typically won’t retry delivery when they can’t connect). Now, both spam and ham get through equally quick. Of course, we are working on ways to reduce the spam, but it is an almost intractable problem when you have thousands of people around the world working day and night to devise clever ways to deliver their junk.
One thing that is important from a sysadmin point of view is to document and explain the problem both upwards to management and downwards to the clients and customers. To quantify the problem I’m using log analysis tools to graph the problem over time. Now that I have hard data, I can start to formalize the problem and test the validity of various hypotheses to solve it.
The challenge, as with uptime statistics is to find numbers that are accurate without introducing a sort of Heisenburg effect from monitoring and then presenting the numbers in a way so that the people who depend on the sysadmin to get their work done can evaluate whether that person is doing a good job or not. I’m not sure there is any magic bullet, but it is clear to me that applying some science to the art of system administration can aid in communication, diagnosis and ultimately problem resolution.
It is an area I will be expending more brain slices on in the future and on this blog.
I love snarky bug reports for some reason. It cracks me up that it took 8 years for Sun to add password prompting to Java. The users increasingly becoming irate in the bug reports is awesome. I wish the programmers would have responded back in a big flame war. I can only imagine what they were saying inside Sun. Good stuff.
Improved interactive console I/O (password prompting, line editing)
I’ve been fixing some on my web server again, for fun. Last weekend, I refactored a lot of code, added dynamic mime typing and CGI support! I was going to continue fleshing it out today, but I had to do even more refactoring to clean up some messy code paths.
I broke the Http:sendFile() method into several new ones and moved HTTP/1.1 keep-alive handling into a more central location instead of just tacked on to the first place that it worked. One thing that is driving me slightly insane, is that there appears to be a tiny memory leak. I can’t figure out where it is happening since I eliminated almost all dynamic allocations in the stack. As I wrote that sentence, I think I figured out where it could be coming from, but I’ll need to rewrite some more code to fix it. It is so small, you don’t start to notice it until you get at least 1,000 requests.
Overall, I’m happy with the design so far. This is my first C++ program and my first serious “server” program. I didn’t do any real upfront design except for drawing on past experience and my gut. I’ve added a number of features and it has been extendable. OOP purists would probably frown on it, but I’m using the subset of C++ and OOP in general that makes sense to me and is practical for what I’m doing. Is there a cleaner way? Likely *shrug*.
I’m getting to the point where a couple of patterns probably make sense. This program is growing organically, but under a tight enough constraint that it isn’t turning into a mess (at least not yet). The other thing that starts making sense is Unit Testing. I’ve been doing more refactoring than I have been adding new features and it would be awesome to be able to run a test suite and know that I haven’t broken anything. I’m not even really sure where to begin on that, but it is obvious that Shelob is becoming more of a “real” program and less of a toy.
A couple more good weekends and it would actually be semi-useful.