Crouton: corpus rule transformation notation
Crouton is a small but fairly complete functional programming
language for querying and transforming parsed manuscripts, such
as the
PPCME. It is
intended as an alternative to
Corpus Search,
based on a different philosophy. It is written in (and largely
based on) the very nice functional programming language
Haskell using the
Parsec library.
Development is taking place on
SourceForge,
and when a distribution is ready, it will be available there.
June 2, 2006: I've released version 0.1.0. It's a
fully functional release, just not tested or documented very
well. You can download archives containing the source code and
compiled executables for Linux and Windows from
SourceForge.
(It can run on Macs, too, I just don't have easy access to one
where I can install
GHC to compile it.)
Here's the basic idea: Suppose you are interested in charting
some syntactic change over time, and you have a corpus of parsed
manuscripts containing a variety of sentences, some of which use
the old form and some use the new form, and some of which may be
ambiguous. A typical Corpus Search approach would be to specify
search conditions that describe the relevant sentences, use a
preliminary query to filter those out, then use additional queries
to split those into old, new, and ambiguous.
The difficulty is that changes of this type seem to be
conditioned on all sorts of things. For instance, I'm interested
in the loss of verb-second in Middle English, and the choice of
whether to use V2 or not seems to depend on whether the subject is
a pronoun or a full noun phrase, whether the pre-posed constituent
is "now" or "then" or an object or a more general adverb, and so
on. There are additional complexities that come from conjunctions
and subjects omitted when two sentences are joined with "and" and
so on. Furthermore, at the time I began this project, Corpus
Search 1 was the available version and it wasn't entirely clear
what it was doing with conjunctions and negation.
So I'm developing Crouton with this plan: Instead of counting
how many sentences fit a particular query, go through the corpus
and summarize each sentence. That way, you're sure no sentence
has been left out. Also, Crouton is supposed to be completely
transparent about what it's doing, and there's no hocus-pocus
about conjunctions and such.
You might want to look at the
tutorial that goes through a Crouton
script in detail. Here's a
PDF file
for the tutorial. (It's my second adventure in JavaScript, so not
all browsers can render it.
Mozilla
Firefox can handle it fine.)
This project hosted on:
This site Copyright 2005--2006 by W. Garrett Mitchener.
Last modified: Fri Jun 2 14:53:49 EDT 2006