Crouton: corpus rule transformation notation

Crouton is a small but fairly complete functional programming language for querying and transforming parsed manuscripts, such as the PPCME. It is intended as an alternative to Corpus Search, based on a different philosophy. It is written in (and largely based on) the very nice functional programming language Haskell using the Parsec library. Development is taking place on SourceForge, and when a distribution is ready, it will be available there.

June 2, 2006: I've released version 0.1.0. It's a fully functional release, just not tested or documented very well. You can download archives containing the source code and compiled executables for Linux and Windows from SourceForge. (It can run on Macs, too, I just don't have easy access to one where I can install GHC to compile it.)

Here's the basic idea: Suppose you are interested in charting some syntactic change over time, and you have a corpus of parsed manuscripts containing a variety of sentences, some of which use the old form and some use the new form, and some of which may be ambiguous. A typical Corpus Search approach would be to specify search conditions that describe the relevant sentences, use a preliminary query to filter those out, then use additional queries to split those into old, new, and ambiguous.

The difficulty is that changes of this type seem to be conditioned on all sorts of things. For instance, I'm interested in the loss of verb-second in Middle English, and the choice of whether to use V2 or not seems to depend on whether the subject is a pronoun or a full noun phrase, whether the pre-posed constituent is "now" or "then" or an object or a more general adverb, and so on. There are additional complexities that come from conjunctions and subjects omitted when two sentences are joined with "and" and so on. Furthermore, at the time I began this project, Corpus Search 1 was the available version and it wasn't entirely clear what it was doing with conjunctions and negation.

So I'm developing Crouton with this plan: Instead of counting how many sentences fit a particular query, go through the corpus and summarize each sentence. That way, you're sure no sentence has been left out. Also, Crouton is supposed to be completely transparent about what it's doing, and there's no hocus-pocus about conjunctions and such.

You might want to look at the tutorial that goes through a Crouton script in detail. Here's a PDF file for the tutorial. (It's my second adventure in JavaScript, so not all browsers can render it. Mozilla Firefox can handle it fine.)