[an error occurred while processing this directive] [an error occurred while processing this directive]

ISconf version 3 was originally intended to be a 2002 rewrite in Perl, by Luke Kanies, of the shell scripts which make up version 2. Version 3 is completely unsupported, but still available for download here. I know of at most two production environments which have ever used it.

For version 3, Luke wanted to try some things, including having another go at making ISconf work on legacy machines -- we had last tried this in 1994, with version 1; we evolved out of that practice real fast back then -- at least now we have a better understanding of why it's a bad idea; see the 2002 Turing Equivalence (TE) paper by Lance Brown and myself.

Luke published a LISA 2003 paper critiquing ISconf version 3 and its fitness (or lack of it) for the job, and faulting several things he attributed to TE. Unfortunately, the paper didn't mention that Luke was talking about his own code and techniques; because he didn't mention version numbers, all ISconf versions got painted with an overly broad brush. You should find the paper and reach your own conclusions; a copy should be here, or Google for 'ISconf: Theory, Practice, and Beyond'. For the rest of this page I'll refer to this paper as 'Beyond'.

Some things we learned (or re-learned) from this round:

Don't try to do deterministic automated management of previously ad-hoc, manually maintained machines. We already knew this. You need to re-image your machines to get them to a known state first. If you don't want to do this, then don't use ISconf, please. Those of us who use ISconf need it for deterministic maintenance of known machines, where availability and reproducibility for disaster recovery are primary concerns. Cfengine is optimized for 'computer immunology' of machines in environments which expect drift due to manual or other out-of-band changes. Cfengine can be used to partially corral ad-hoc machines, but it doesn't work as well for guaranteeing bare-metal disaster recovery. (One tool that might split the difference well is Radmind -- it in effect re-images those parts of disk you care about each time it runs).

The Turing Equivalence theory that 'Beyond' faulted actually predicted the results of what happens if you don't re-image. Luke was attempting to use a congruent tool to manage a convergent infrastructure. According to TE, the results of doing that will be convergent at best -- things will break in production sometimes, as he describes. If you can't afford production breakage, then you need to first rebuild all of your machines so you know what's on them, then use a congruent tool to keep them that way.

Before publishing 'Beyond', Luke began experimenting more with cfengine, and found more traction there -- this makes perfect sense -- to manage an infrastructure containing ad-hoc machines you really want a convergent tool like cfengine.

Toward the end of the paper, 'Beyond' discusses combining cfengine and ISconf version 3 in various ways. There are many things that can be done here -- the simplest is to use cfengine's file replication capability rather than rsync. I did this once at a NASA site, for political reasons (they explicitly wanted to use cfengine) -- and we quickly found scaling issues, which in turn resulted in Greg Smith's perl-cfd code, and later improvements to cfengine's file replication performance. Aside from simple file replication, the thing to remember is that combining a convergent and congruent tool can inadvertently result in a convergent infrastructure, not a congruent one. A little discussion of how to avoid this effect in cfengine's case is here.

Don't deploy untested changes to production machines. This one seems obvious, but few shops follow it. You need a test environment. Your test environment needs to support all of your hardware and O/S combinations -- sorry, no free lunch in software testing.

'Beyond' makes an attempt to invalidate TE by describing an example in which a hypothetical sysadmin intentionally deploys a command sequence which works in test and fails in production, perhaps due to a different IP address, hostname, or domain. This is an example of poor testing, not poor theory -- TE would indicate that a representative test environment would be identical to production in every way, right down to the details of host names, IP addresses, and domain names, and would catch our hypothetical sysadmin in the act. Because this isn't practical in the overwhelming majority of shops, we do have to use our wits to compensate for imperfect test environments. TE helps a great deal here, by offering a least-cost path with the lowest risk, giving human brains a better chance of catching the few remaining edge cases, like the example in 'Beyond', before they reach production. The cost of testing is already factored into this least-cost analysis, and TE already takes the inadequacy of testing into account.

ISconf 4 makes testing and managing forks much easier, and greatly lessens the financial commitment needed for test environments as well -- see the version 4 page.

Editing history is (still) bad. Much of what 'Beyond' responds to is a common desire among sysadmins; the wish to modify code that has already been used to build production machines -- code which might be called upon to rebuild those machines in case of disaster. This practice has come to be known as "editing history". The reasons for doing this are appealing -- being able to re-use existing code, with slight modifications, to support new hardware combinations, for instance (see Beyond's SCSI/IDE discussion for one example of this). In object-oriented programming terms, "editing history" would be equivalent to modifying a base class that is already in use throughout several legacy applications, in order to support a new application -- it's considered a risky practice. In accounting, "editing history" would be equivalent to altering or removing transactions which have already been entered and reported -- this violates generally accepted accounting principles (GAAP) because it alters audit trails. These class structures and audit trails are nothing other than human inventions intended to help us encapsulate and understand change and risk.

Modifying code that is known to correctly build existing machines is expensive as well as risky. Don't do it. Just like in OO programming, you'll have to re-test all of the legacy stuff all over again, and there is no guarantee that your testing will catch any new bugs you may have introduced. It's cheaper to fork the code and modify the new copy. In an OO programming world you would create a new subclass which modifies the behavior of the base class, and use that subclass for your new application instead of fiddling with the base class. Under GAAP, you would add an adjusting entry to the end of the journal. Disk is cheap -- cheaper than the labor involved in modifying and re-testing legacy code, cheaper than the time it can take to unravel the results of editing history, and much cheaper than the cost of downtime from a failed disaster recovery. 'Beyond' claims that leaving the legacy code on disk, untouched, encourages "software rot"; that the assumptions of the program become outdated. This is incorrect -- when using ISconf we should assume only that the legacy code can be used to rebuild the same host, on the same hardware, starting from the same base image. We can make no assumptions about new combinations -- they need to be tested.

You need to expect new combinations to break, and you need to expect to use your wet, mushy, trial-and-error brain to fix that breakage with new code and configurations. You need to expect to fork your disk images, configuration files, and binaries to support new hardware. Of all of the hardware and O/S combinations I've worked with, AIX and RS/6000 are the easiest to deal with in this regard, requiring the fewest forks. IBM knows how to do hardware microcode right. Solaris is further down the list, and HP/UX isn't even in the running. Luke was working with Solaris and HP/UX machines.

The question is frequently raised "But what if I deploy a change to all of my production machines which breaks the change tool, crashes machines, or deletes critical data? Can't I go back and edit history in that case?" Of course you can. But the only way you're going to find yourself in this situation is if you've already skipped some critical testing or code review. The whole point of ISconf is to reduce the cost of deployment, testing and review, make it easier to do right, and reduce the risk of this sort of thing happening to you.

The reason you want to avoid "editing history" in routine operations is because, every time you do it, you create a new host class, adding that much more to your testing load. As this reduces your ability to do complete testing, you will find more breakage in production, causing you to want to edit history again... The more you edit history, the more you'll find a need to edit history. In theory you can even produce a combinatorial explosion of required testing and review. In practice, what you'll find is that you are creating a slowly diverging infrastructure, with convergence in those places where you are currently focusing your attention. The behavior will be similar to that which Luke found. In that case you might as well just do as he did; chuck all this and go use cfengine -- it's designed for bringing this sort of mess under some semblance of control. But you can forget about easy disaster recovery -- the code path which you used to create your oldest hosts is gone; you've morphed it into something else.

Keep host management code small. The features added to ISconf 3 increased the size of the code by an order of magnitude. This concerned me -- any host management tool is self-modifying code that runs as root -- and the larger it is, the harder it is to audit. (ISconf version 2 is 4127 lines of code. Version 4, a cleaner implementation with much better ease-of-use, is more auditable, currently at 1233 lines. But version 3.1 weighs in at a hefty 17165 lines of code.)

In 'Beyond', Luke describes at length the usage and shortcomings of 'make', which ISconf versions 1-3 used as their state engine -- the paper says "ISconf is an interface to make, and not much else." But ISconf 4 does not use 'make', and was never intended to. He must have meant versions 1-3. See Luke's own description of version 4 at the bottom of the history page, written earlier.

Luke dropped support for ISconf 3 a few months after publishing 'Beyond'.

For the record, the rest of this page is what Luke wrote about ISconf, mostly concerning his thoughts as he worked on version 3:


ISconf is a framework for recording and playing back all sysadmin work done to a network of Unix machines. This is a relatively complicated statement, so let's break it down piece by piece:

Framework

ISconf is not itself a toolset, it is only a framework. In fact, it's not so much a framework as a methodology that currently only has one implementation. The framework has three key features,
Failure on Error, State Maintenance, and Deterministic Ordering, and should have a fourth, Atomic Operation.

Existence of these features largely guarantees us the type of network we want, and lack of these features would mean we may have the network we want but we cannot be sure. Because of that, it is reasonable to think of these not as features of the framework but rather as axioms without which our system could not function. If anything were to happen to call these axioms into question, then our system could, but is not guaranteed to, break down.

Let's describe in more detail what each of these actually means:

Atomic Operation

This should be an axiom but is not, because it is actually a feature of the tools which are executed by ISconf, not so much a feature of ISconf itself.

Tools are atomic if they either succeed completely or fail without modifying the system at all. A simple example of an atomic tool is mkdir; it either succeeds at making the directory, or it fails completely and does not modify the system at all. An example of a non-atomic tool would be something which makes a backup of a file, begins reading the backup and writing lines to the real file, encounters an error, and dies, leaving the backup of the file and an incomplete installed file.

If all tools used within ISconf are completely atomic, then we can safely try tools which we think will work without a lot of testing beforehand; if the tool succeeds, then we are fine and the work is done, but if the tool fails, then we are also fine because although the work is not done we have also not modified the system in any way. This leaves us free to discover what works without having to have as much intelligence in either the admin or the tool.

Failure on Error

Our framework is responsible for executing a list of code chunks on a given system in a given order, so it is extremely important that our framework halt if any of those code chunks fails. If our code chunk were instead to continue moving on, then we either a. would never find out the code failed, or b. would not know the actual execution order on our system, because that code chunk would actually end up running later than its occurrence in the list (see the deterministic ordering section for why this is bad).

Deterministic Ordering

A Deterministically Ordered execution list is a list whose execution order can be determined exactly and is always the same. For instance, given code chunks A, B, and C, executed in that order once on a system, a deterministically ordered system will guarantee that those code chunks are again run in the order of A->B->C if that system is ever rebuilt or another one of those systems created.

Why is this important? Because work done to a system is not necessarily associative. In this case, it means that just because you were able to successfully able to execute A->B->C does not mean that executing C->A->B will work. Once we find an execution order that works for us, we want to make sure that that order is always followed. That isn't to say that there are no other execution orders which could succeed, but we already have one that we know succeeds, so it's silly to not take advantage of that knowledge.

State Maintenance

One of the main jobs of the ISconf framework is to keep track of where in the execution list a given host is. Because ISconf runs locally on each host, state is mainained on the host itself. Within the ISconf framework, state is used to guarantee that a given chunk of work is only done once, unless specified as a repeating chunk.

Usually the term state implies that the tool can fully describe what the system--which in this case is a single machine--looks like. However, because all work being done on the system is on a deterministically linear execution path, state merely means where we are along the path. It's obvious how much simplicity is derived from the fact that our tools only need to describe linear state rather than total machine state. This simplicity is one of the main benefits of ISconf

Incidentally, the state of a system is nearly always "complete". That is, ISconf's job is to execute any scheduled work, and the majority of the time there is no scheduled work which has not been completed.

The Four-Axiom Gestalt

Taken all together, these four axioms give us a consistent linear execution path composed of atomic elements and a cursor to indicate progress along that path. The deterministic ordering guarantees both the consistency and linearity, the failure on error guarantees that no steps are skipped accidentally and that error conditions don't allow steps to be executed out of order, the atomic elements allow us to be cleanly before or after a given step, never in the middle, and the state engine always lets us know exactly where we are.

When used correctly, this system allows one to create an infrastructure which has full knowledge of every machine within it, and is capable of rebuilding each of its member machines from scratch, exactly the same. In theory.

Recording

Now that we know why ISconf does what it does, we need to know how. The recording and the playing of the execution path are very similar, but there's one key difference between the initial recording and any later playbacks. During the process of recording what work is done to a system, one often attempts various tasks or sets of tasks multiple times, even in multiple orders, until all necessary tasks succeed. It's only this final successful pathway that's finally recorded.

Let's use the rsynconfig tool currently available with ISconf as an example. This tool works in three phases: Create a config for a module to be added to the rsyncd.conf, add the name of that module to the list of modules to be added, and then rebuild the config file with all currently listed modules. If you add all three of these phases to your execution path for a given module, but then find that the step to create the config fails, you're fine because the later two steps were never executed; just keep working on the config step until it succeeds, at which point it's added to the permanent execution path.

If instead you find that rather than the config step failing, it succeeded but you created the wrong config, it is too late. You cannot take those three steps off the execution path, and instead must add new steps to fix your mistake. In other words, as long as something returns a failure code, it is not recorded as being part of the execution path, but as soon as something succeeds it is added to the execution path and cannot be removed, because the system has been modified.

The difficult problem here is the definition of "success"; it's the same old problem of the computer doing what you tell it to do instead of what you want it to do. If you tell the computer to do something that it cannot actually do, at least in the way you told it to, then it will throw an error and ISconf will halt and give you a chance to fix it. However, if you tell the computer to do something it can do quite easily--such as overwriting your /etc/services file with gibberish--then it will happily break your system and not throw an error. It's now up to you to add something else to your execution path to fix this problem.

I'd love to be able to say that ISconf hopes to solve this difference in the definition of success, but the only real way to do so is through testing, and lots of it. What ISconf does give you is a much easier way to built a test bed, because you can replay any system onto a test host and run your experimental code there.

Playing Back

As mentioned in the History section, ISconf 3 has added a stronger typing system and has separated host types from host instances. ISconf figures out what work to do to a system by finding the types associated with a given host in the hosts configuration , retrieving the list of work to do for each type from the types configuration file, ordering that list according to timestamps found in the types file, and then finding each item in that list in one of the appropriate make files.

Once the list is retrieved and ordered, a temporary make file is created with each item in the correct order, and make is executed against that list. Make is then responsible for traversing the list, determining if the given stanza has already been executed, executing it if it has not, and failing if that stanza encounters an error.

The job make performs is now a relatively simple one, and in fact the limitations of it are easily run into, which is why the next version of ISconf will hopeful do away with make as the arbiter of the execution list.

Sysadmin Work

ISconf clearly defines that it is meant to automate system administration work for a few simple reasons. First, there is usually a strong division in responsibility between the sysadmins and the application admins, such that it is often difficult or impossible for the sysadmins to convince the application admins use provided automatation tools. Second, many applications are programmed in such a way that they are difficult and sometimes even impossible to administer.

However, in cases where it is technically and politically possible, applications should also be automated using ISconf. Artificial lines should not be drawn just because they are convenient, but one should also not let individual applications, which after all can just be dumped to tape, stop one from taking advantage of as much automation as possible.

Network of Unix Machines

ISconf has only been used to manage Unix machines, but there is actually no reason why it couldn't manage other types of machines. Any machine which can be controlled completedly programmatically can be managed using ISconf.

Lies and Damn Lies

Unfortunately, the world we present isn't quite as rosy as it looks. There are a number of problems with the system we have come up with:
[an error occurred while processing this directive]