[an error occurred while processing this directive] [an error occurred while processing this directive]

ISconf version 3 was originally intended to be a 2002 rewrite in Perl, by Luke Kanies, of the shell scripts which make up version 2. Version 3 is completely unsupported, but still available for download here. I know of at most two production environments which have ever used it.

For version 3, Luke wanted to try some things, including having another go at making ISconf work on legacy machines -- we had last tried this in 1994, with version 1; we evolved out of that practice real fast back then -- at least now we have a better understanding of why it's a bad idea; see the 2002 Turing Equivalence (TE) paper by Lance Brown and myself.

Luke published a LISA 2003 paper critiquing ISconf version 3 and its fitness (or lack of it) for the job, and faulting several things he attributed to TE. Unfortunately, the paper didn't mention that Luke was talking about his own code and techniques; because he didn't mention version numbers, all ISconf versions got painted with an overly broad brush. You should find the paper and reach your own conclusions; a copy should be here, or Google for 'ISconf: Theory, Practice, and Beyond'. For the rest of this page I'll refer to this paper as 'Beyond'.

Some things we learned (or re-learned) from this round:

Don't try to do deterministic automated management of previously ad-hoc, manually maintained machines. We already knew this. You need to re-image your machines to get them to a known state first. If you don't want to do this, then don't use ISconf, please. Those of us who use ISconf need it for deterministic maintenance of known machines, where availability and reproducibility for disaster recovery are primary concerns. Cfengine is optimized for 'computer immunology' of machines in environments which expect drift due to manual or other out-of-band changes. Cfengine can be used to partially corral ad-hoc machines, but it doesn't work as well for guaranteeing bare-metal disaster recovery. (One tool that might split the difference well is Radmind -- it in effect re-images those parts of disk you care about each time it runs).

The Turing Equivalence theory that 'Beyond' faulted actually predicted the results of what happens if you don't re-image. Luke was attempting to use a congruent tool to manage a convergent infrastructure. According to TE, the results of doing that will be convergent at best -- things will break in production sometimes, as he describes. If you can't afford production breakage, then you need to first rebuild all of your machines so you know what's on them, then use a congruent tool to keep them that way.

Before publishing 'Beyond', Luke began experimenting more with cfengine, and found more traction there -- this makes perfect sense -- to manage an infrastructure containing ad-hoc machines you really want a convergent tool like cfengine.

Toward the end of the paper, 'Beyond' discusses combining cfengine and ISconf version 3 in various ways. There are many things that can be done here -- the simplest is to use cfengine's file replication capability rather than rsync. I did this once at a NASA site, for political reasons (they explicitly wanted to use cfengine) -- and we quickly found scaling issues, which in turn resulted in Greg Smith's perl-cfd code, and later improvements to cfengine's file replication performance. Aside from simple file replication, the thing to remember is that combining a convergent and congruent tool can inadvertently result in a convergent infrastructure, not a congruent one. A little discussion of how to avoid this effect in cfengine's case is here.

Don't deploy untested changes to production machines. This one seems obvious, but few shops follow it. You need a test environment. Your test environment needs to support all of your hardware and O/S combinations -- sorry, no free lunch in software testing.

'Beyond' makes an attempt to invalidate TE by describing an example in which a hypothetical sysadmin intentionally deploys a command sequence which works in test and fails in production, perhaps due to a different IP address, hostname, or domain. This is an example of poor testing, not poor theory -- TE would indicate that a representative test environment would be identical to production in every way, right down to the details of host names, IP addresses, and domain names, and would catch our hypothetical sysadmin in the act. Because this isn't practical in the overwhelming majority of shops, we do have to use our wits to compensate for imperfect test environments. TE helps a great deal here, by offering a least-cost path with the lowest risk, giving human brains a better chance of catching the few remaining edge cases, like the example in 'Beyond', before they reach production. The cost of testing is already factored into this least-cost analysis, and TE already takes the inadequacy of testing into account.

ISconf 4 makes testing and managing forks much easier, and greatly lessens the financial commitment needed for test environments as well -- see the version 4 page.

Editing history is (still) bad. Much of what 'Beyond' responds to is a common desire among sysadmins; the wish to modify code that has already been used to build production machines -- code which might be called upon to rebuild those machines in case of disaster. This practice has come to be known as "editing history". The reasons for doing this are appealing -- being able to re-use existing code, with slight modifications, to support new hardware combinations, for instance (see Beyond's SCSI/IDE discussion for one example of this). In object-oriented programming terms, "editing history" would be equivalent to modifying a base class that is already in use throughout several legacy applications, in order to support a new application -- it's considered a risky practice. In accounting, "editing history" would be equivalent to altering or removing transactions which have already been entered and reported -- this violates generally accepted accounting principles (GAAP) because it alters audit trails. These class structures and audit trails are nothing other than human inventions intended to help us encapsulate and understand change and risk.

Modifying code that is known to correctly build existing machines is expensive as well as risky. Don't do it. Just like in OO programming, you'll have to re-test all of the legacy stuff all over again, and there is no guarantee that your testing will catch any new bugs you may have introduced. It's cheaper to fork the code and modify the new copy. In an OO programming world you would create a new subclass which modifies the behavior of the base class, and use that subclass for your new application instead of fiddling with the base class. Under GAAP, you would add an adjusting entry to the end of the journal. Disk is cheap -- cheaper than the labor involved in modifying and re-testing legacy code, cheaper than the time it can take to unravel the results of editing history, and much cheaper than the cost of downtime from a failed disaster recovery. 'Beyond' claims that leaving the legacy code on disk, untouched, encourages "software rot"; that the assumptions of the program become outdated. This is incorrect -- when using ISconf we should assume only that the legacy code can be used to rebuild the same host, on the same hardware, starting from the same base image. We can make no assumptions about new combinations -- they need to be tested.

You need to expect new combinations to break, and you need to expect to use your wet, mushy, trial-and-error brain to fix that breakage with new code and configurations. You need to expect to fork your disk images, configuration files, and binaries to support new hardware. Of all of the hardware and O/S combinations I've worked with, AIX and RS/6000 are the easiest to deal with in this regard, requiring the fewest forks. IBM knows how to do hardware microcode right. Solaris is further down the list, and HP/UX isn't even in the running. Luke was working with Solaris and HP/UX machines.

The question is frequently raised "But what if I deploy a change to all of my production machines which breaks the change tool, crashes machines, or deletes critical data? Can't I go back and edit history in that case?" Of course you can. But the only way you're going to find yourself in this situation is if you've already skipped some critical testing or code review. The whole point of ISconf is to reduce the cost of deployment, testing and review, make it easier to do right, and reduce the risk of this sort of thing happening to you.

The reason you want to avoid "editing history" in routine operations is because, every time you do it, you create a new host class, adding that much more to your testing load. As this reduces your ability to do complete testing, you will find more breakage in production, causing you to want to edit history again... The more you edit history, the more you'll find a need to edit history. In theory you can even produce a combinatorial explosion of required testing and review. In practice, what you'll find is that you are creating a slowly diverging infrastructure, with convergence in those places where you are currently focusing your attention. The behavior will be similar to that which Luke found. In that case you might as well just do as he did; chuck all this and go use cfengine -- it's designed for bringing this sort of mess under some semblance of control. But you can forget about easy disaster recovery -- the code path which you used to create your oldest hosts is gone; you've morphed it into something else.

Keep host management code small. The features added to ISconf 3 increased the size of the code by an order of magnitude. This concerned me -- any host management tool is self-modifying code that runs as root -- and the larger it is, the harder it is to audit. (ISconf version 2 is 4127 lines of code. Version 4, a cleaner implementation with much better ease-of-use, is more auditable, currently at 1233 lines. But version 3.1 weighs in at a hefty 17165 lines of code.)

In 'Beyond', Luke describes at length the usage and shortcomings of 'make', which ISconf versions 1-3 used as their state engine -- the paper says "ISconf is an interface to make, and not much else." But ISconf 4 does not use 'make', and was never intended to. He must have meant versions 1-3. See Luke's own description of version 4 at the bottom of the history page, written earlier.

Luke dropped support for ISconf 3 a few months after publishing 'Beyond'.

For the record, the rest of this page is what Luke wrote about ISconf, mostly concerning his thoughts as he worked on version 3:

ISconf is a framework for recording and playing back all sysadmin work done to a network of Unix machines. This is a relatively complicated statement, so let's break it down piece by piece:

Framework

ISconf is not itself a toolset, it is only a framework. In fact, it's not so much a framework as a methodology that currently only has one implementation. The framework has three key features, Failure on Error, State Maintenance, and Deterministic Ordering, and should have a fourth, Atomic Operation.

Existence of these features largely guarantees us the type of network we want, and lack of these features would mean we may have the network we want but we cannot be sure. Because of that, it is reasonable to think of these not as features of the framework but rather as axioms without which our system could not function. If anything were to happen to call these axioms into question, then our system could, but is not guaranteed to, break down.

Let's describe in more detail what each of these actually means:

Atomic Operation

This should be an axiom but is not, because it is actually a feature of the tools which are executed by ISconf, not so much a feature of ISconf itself.

Tools are atomic if they either succeed completely or fail without modifying the system at all. A simple example of an atomic tool is mkdir; it either succeeds at making the directory, or it fails completely and does not modify the system at all. An example of a non-atomic tool would be something which makes a backup of a file, begins reading the backup and writing lines to the real file, encounters an error, and dies, leaving the backup of the file and an incomplete installed file.

If all tools used within ISconf are completely atomic, then we can safely try tools which we think will work without a lot of testing beforehand; if the tool succeeds, then we are fine and the work is done, but if the tool fails, then we are also fine because although the work is not done we have also not modified the system in any way. This leaves us free to discover what works without having to have as much intelligence in either the admin or the tool.

Usually the term state implies that the tool can fully describe what the system--which in this case is a single machine--looks like. However, because all work being done on the system is on a deterministically linear execution path, state merely means where we are along the path. It's obvious how much simplicity is derived from the fact that our tools only need to describe linear state rather than total machine state. This simplicity is one of the main benefits of ISconf

Incidentally, the state of a system is nearly always "complete". That is, ISconf's job is to execute any scheduled work, and the majority of the time there is no scheduled work which has not been completed.

The Four-Axiom Gestalt

Taken all together, these four axioms give us a consistent linear execution path composed of atomic elements and a cursor to indicate progress along that path. The deterministic ordering guarantees both the consistency and linearity, the failure on error guarantees that no steps are skipped accidentally and that error conditions don't allow steps to be executed out of order, the atomic elements allow us to be cleanly before or after a given step, never in the middle, and the state engine always lets us know exactly where we are.

When used correctly, this system allows one to create an infrastructure which has full knowledge of every machine within it, and is capable of rebuilding each of its member machines from scratch, exactly the same. In theory.

Recording

Now that we know why ISconf does what it does, we need to know how. The recording and the playing of the execution path are very similar, but there's one key difference between the initial recording and any later playbacks. During the process of recording what work is done to a system, one often attempts various tasks or sets of tasks multiple times, even in multiple orders, until all necessary tasks succeed. It's only this final successful pathway that's finally recorded.

Let's use the rsynconfig tool currently available with ISconf as an example. This tool works in three phases: Create a config for a module to be added to the rsyncd.conf, add the name of that module to the list of modules to be added, and then rebuild the config file with all currently listed modules. If you add all three of these phases to your execution path for a given module, but then find that the step to create the config fails, you're fine because the later two steps were never executed; just keep working on the config step until it succeeds, at which point it's added to the permanent execution path.

If instead you find that rather than the config step failing, it succeeded but you created the wrong config, it is too late. You cannot take those three steps off the execution path, and instead must add new steps to fix your mistake. In other words, as long as something returns a failure code, it is not recorded as being part of the execution path, but as soon as something succeeds it is added to the execution path and cannot be removed, because the system has been modified.

The difficult problem here is the definition of "success"; it's the same old problem of the computer doing what you tell it to do instead of what you want it to do. If you tell the computer to do something that it cannot actually do, at least in the way you told it to, then it will throw an error and ISconf will halt and give you a chance to fix it. However, if you tell the computer to do something it can do quite easily--such as overwriting your /etc/services file with gibberish--then it will happily break your system and not throw an error. It's now up to you to add something else to your execution path to fix this problem.

I'd love to be able to say that ISconf hopes to solve this difference in the definition of success, but the only real way to do so is through testing, and lots of it. What ISconf does give you is a much easier way to built a test bed, because you can replay any system onto a test host and run your experimental code there.

Playing Back

As mentioned in the History section, ISconf 3 has added a stronger typing system and has separated host types from host instances. ISconf figures out what work to do to a system by finding the types associated with a given host in the hosts configuration , retrieving the list of work to do for each type from the types configuration file, ordering that list according to timestamps found in the types file, and then finding each item in that list in one of the appropriate make files.

Once the list is retrieved and ordered, a temporary make file is created with each item in the correct order, and make is executed against that list. Make is then responsible for traversing the list, determining if the given stanza has already been executed, executing it if it has not, and failing if that stanza encounters an error.

The job make performs is now a relatively simple one, and in fact the limitations of it are easily run into, which is why the next version of ISconf will hopeful do away with make as the arbiter of the execution list.

Sysadmin Work

ISconf clearly defines that it is meant to automate system administration work for a few simple reasons. First, there is usually a strong division in responsibility between the sysadmins and the application admins, such that it is often difficult or impossible for the sysadmins to convince the application admins use provided automatation tools. Second, many applications are programmed in such a way that they are difficult and sometimes even impossible to administer.

However, in cases where it is technically and politically possible, applications should also be automated using ISconf. Artificial lines should not be drawn just because they are convenient, but one should also not let individual applications, which after all can just be dumped to tape, stop one from taking advantage of as much automation as possible.

Network of Unix Machines

ISconf has only been used to manage Unix machines, but there is actually no reason why it couldn't manage other types of machines. Any machine which can be controlled completedly programmatically can be managed using ISconf.

Lies and Damn Lies

Unfortunately, the world we present isn't quite as rosy as it looks. There are a number of problems with the system we have come up with:

The current typing system still isn't always deterministically ordered
If you create a server as a single type, instantiate it entirely, and then add a second existing type, with it's own list of ordered steps, to the host, then on that host those steps will not be executed in the order of their timestamps, but on any subsequently built hosts, all work will be done in the order of the timestamps, and thus in a different order.

The only real way to solve this problem is to change the type assignment process. Instead of storing the timestamps with the type, the timestamps should be stored with the host and created when a work chunk is associated with a host, not when the chunk is associated with the type. There is no planned method of making this change right now.

We currently don't have a standard method of creating atomic work chunks
Even though atomic work chunks are so fundamental to our system, we haven't been able to come up with a sufficient method of guaranteeing the atomic nature of all our code. Really, the difficulty is the atomicity plus the state maintenance. Look for more information regarding potential solutions in the documentation for Version 4, or possibly a separate writeup.
We currently don't have sufficient code-sharing mechanisms in place
At the present, each team is an island. There is no standard within ISconf to facilitate code-sharing. This is one of the major goals of Version 4
Separation between code and data isn't as good as it could be
Currently the make files are required to contain both the code to perform any work and the data describing what work to do. Significant amounts of research are being done towards finding both short-term and long-term solutions to this problem.
Deterministical ordering can be broken by the CONTEXT system
Work chunks which require a reboot will only be executed when the system is running in BOOT context, usually during a reboot. Because of this, this type of work chunk will sometimes be executed after work chunks which actually appear later in the ordered list. For instance, if you add two work chunks, first one requiring a reboot and then one which does not, and then run ISconf immediately on the host, then chunk which does not require a reboot will execute immediately even though the reboot chunk is before it in the list and has not executed yet.
There is no known solution to this problem at this point.
As long as you use any file management methods such as rsync or cvs, your systems cannot be managed deterministically
If you are using something like rsync to manage any files on your system, then changes in those files will not happen deterministically. For instance, if you build a system with a services file which does not list the NetBackup services, and then use rsync to load a services file which has those services, if you build another host like this one, that host will immediately have the services file with the NetBackup services.
This isn't usually a problem, but it does kind of break the rules. The real solution to this problem is to use work chunks to manage all of your files, and not to use cvs or rsync to do so.

[an error occurred while processing this directive]