Saturday, December 4, 2010

That whole MT project

OK, so the post-editing project I foolishly agreed to help with consisted of:
  • OCR with Able2Extract
  • MT with a mixture of (I think) Google Translate and Systran
  • First-pass proofreading
  • Second-pass post-editing
So let's talk about that. A far, far better workflow would have been:
  • OCR with whatever
  • Source-language spell checking and correction
  • Identification of key phrases and terminology as cues for MT
  • TRADOS or similar to avoid rework of existing sentences
  • MT with whatever
  • Target-language spell checking, feeding results back through MT until at least everything is English
  • First-pass post-editing
  • Second-pass proofreading
This workflow uses (or at least could use) the exact same tools as above, but without the introduction of errors at each step that make later steps impossible to manage. First-pass post-editing should be done by a bilingual translator, using specialized post-editing tools (not yet written) plus a normal translation memory (and of course the TM should also be used before passing text off to the MT stage). Systematic errors should be documented and recycled through the MT process.

One key insight: terminology research really starts to get a lot more important in this workflow than in normal CAT.

Thursday, December 2, 2010

More thoughts on a non-stupid text editor

I'm doing some post-editing for Portuguese today (I know, I know, never do MT post-editing, but this customer is a good one and I just couldn't say no). As usual with post-Systran work, there is a lot of dragging and dropping involved, and frankly? Word freaking sucks at dragging and dropping. Why should that be? Why can't I drag a word from the end of a punctuated sentence into its middle and have Word get the spacing right?

The mind boggles.

So it looks like I'm just going to have to break down and address non-stupid text editing again.

Tuesday, November 16, 2010

Another workflow with PDFs

I have a set of documents that consist of PDFs that have been highlighted and scanned. That is, each PDF consists of a set of documents. The text to be translated has been highlighted - with a physical marker, I mean - and the documents scanned. The PDFs were unfortunately not encoded as allowing comments (this is unfortunately a flag of the digital signature, not a flag in the PDF standard, and Adobe has not provided the key for digital signature from what I'm reading - thus there is no tool in the world that can flag a PDF to allow comments from Adobe Reader except for the full paid version of Adobe Acrobat.)

So my workflow is to go through the documents and use the snapshot tool to copy the highlighted bits. I put each bit into one column of a Word file, and the translation in the other column. It's nearly as good as comments in the PDF.

It seems to me that this would be a simple tool to implement: create the Word file, create the table, then every time I select something that's graphical, put it into the Word file for me and bring Word to the top. It's not a huge help, but it's the principle of the thing.

Saturday, November 6, 2010

Working with source text

There are open-source ways to break text into sentences and to find terms. Those need to be part of the toolkit.

The splitta library is a sentence boundary finder. This I have to incorporate, as segmentation is an extremely important function of any translation system. So that should be Perl-ized here.

The Topia term extractor is the other thing I wanted to point out here.

Also, the fact that both of these libraries are in Python. An awful lot of natural language work ends up in Python. That's kind of interesting, actually.

Friday, October 8, 2010


This has been done to death, of course, but I need to start thinking about a terminology engine, and also about specific terminology - right now I'd like a database of titles of industrial standards in various languages. They come up rather a lot.

Wednesday, October 6, 2010

PDF reading

There are a couple of workflows where PDFs are needed.

First is where a series of pages have been scanned and need to be translated starting from the graphics. OCR can come in handy here (if it works, which it usually doesn't), but I want to highlight the fact that (1) the pages are very often disjoint (think medical records) and (2) sometimes have Bates numbers (legal annotations identifying each individual page in a set of documents). This overall structure could do with some software support. I'm thinking something that takes individual document segments and ties them back into a structured overall document with, say, the Bates numbers.

[With respect to that OCR: it would be nice to have a pre-OCR stage that finds and identifies pages that are similar - this could simplify finding letterhead, headings, and so on.]

The second workflow of interest is text PDFs. See, PDFs don't have document structure like Word documents. If a header appears on every page, well then it will be reproduced on every page in text. So it would be nice to be able to impose - to recognize - this sort of structure in order to take PDFs and translate them. (You could argue that a TM tool would do this for you - but I would prefer to abstract out the different document parts in order to translate them separately, when we start thinking about machine translation. The MT tool will need as much help as it can get.)

Anyway. Just a thought I'm too busy to follow up on right at the moment.

Friday, October 1, 2010


So I tried Systran on a new potential project in German and Italian (SOPs from the same company in both languages, for translation to English). I figured after the corporate charter, with its quite passable results, I'd try Systran on these as well.


Here's just the first sentence of the German:

The available SOP serves for the Sicherstellung of the requirements and conditions, which must fulfill the suppliers, so that them for the supply of a supplier sample to become certified to be able.

You can't edit that. All you can do is retranslate it - either directly from the German or from this intermediate not-German not-English near-gibberish. So maybe the corporate charter was a fluke, or maybe Systran performs better on French than on German (or Italian - the results were equally unreadable on my Italian sample). Either way, my initial hopes for being able to use Systran are pretty much shot. This does not speed me up, and it's clear that careful glossary work, while it might help a little, wouldn't be enough - Systran doesn't actually appear to understand or use syntax.

Thursday, September 30, 2010

Thoughts on practical use of machine translation

So since I haven't had the time to get OpenLogos running (I swear, just when I started, the work just came pouring in - I'm at 123,000 words for the month, phew) and given that I was far, far behind schedule on a large and boring corporate charter in French, I decided to try Systran.

(Oh, no, he didn't go there!)

I hadn't looked at Systran since 2005, when I had some work post-editing its abysmal output for an agency in Italy. I came to the conclusion then that it was normally just as easy to translate a given text myself than to try to decipher what Systran had come up with and whip it into something comprehensible by an English speaker, and that translating it myself paid five times as well. So: no-brainer, and I actually lost my Systran install.

But, well, it's been five years. Surely they could do a better job by now, right? And hey, it's only 100 bucks for the home version, which now includes a whole raft of languages - in fact, with the exception of Hungarian, all the languages I work with. So.

Here's the workflow I used: I ran Systran on my file, then aligned it with the original, and loaded it into my TM. Then I started down the file sentence by sentence in the normal manner, with the aligned segments coming up as I went.

This worked pretty damn well, actually. OK, there were some Systranisms - mille, in a year, is generally not translated "millet" and I'm not sure why that would be default. I dealt with these by loading the TM transfer file in my editor of choice, and doing global search-and-replace on them as I went. Then I'd import the edited segments back into the TM, and proceed. So commonly mistranslated terms got better as I went. Since the file was 13,000 words, this approach had time to work.

I should note that nearly every sentence needed modification. There were some real screamers in terms of Martian word order - so this should be considered kind of a rock-bottom minimum; what I wanted to know is whether it would accelerate my work even so.

My normal "fast" progress is 700 to 1000 words an hour. For this dreary text, I would probably have managed no more than 400 or 500 an hour. With this procedure, though, I managed a throughput of something between 1500 and 2500 words an hour. That ... that freaking works.

I think quality suffered somewhat, although as it was a corporate charter, I don't think I would have done fantastic quality anyway, so it's hard to say. I should continue to give this a try - certainly the preliminary results on this one job were entirely convincing and I now have much more confidence that machine translation should be part of my toolkit.

How would I improve things, you ask? Pretty much using the same tools I want to implement anyway:
  • Global search and replace for terms in a bilingual list. (This has two aspects: replacement should be sensitive to grammar in the target language, i.e. pluralizing correctly, but it should also be sensitive to the source phrase, sort of a "replace X with X' only if it's a translation for Y".)
  • Automation of simple TRADOS tasks (e.g. reloading the TM after I do a global search and replace.)
  • A database of rewording rules. This is slowly taking shape in my mind - it would be a valuable tool for any proofreader. It could also "translate" between American and British, if you see what I mean. Kind of a spellchecker on steroids, if you will.
  • Automation of Systran itself; the home version runs inside Word or with a standalone tool and they don't really want you to do things like automating it without giving them a lot of money for the Professional or Enterprise versions.
Anyway, I wanted to post this while the job was fresh in my memory. Now it's back to work for me, this time without the Systran crutch.

The real takeaway for me was: even bad MT, if well managed, would augment my throughput, potentially by a lot. And the various accessories I would need for Systran work will also be applicable to work with OpenLogos, so it's not wasted work if I get around to writing some.

Wednesday, September 15, 2010


So I started a File::XLIFF module yesterday. XLIFF [spec] is an interesting format. Like many XML formats, it's overengineered to the point that I suspect nobody will ever use it to its fullest extent. It maps onto the much simpler TTX format only with a lot of folding, spindling, and mutilation.

The basic Xlat::File model of a file as a simple set of segments may turn out to be oversimplified when it comes to XLIFF. As the most obvious example, a single XLIFF file can contain multiple sections, each of which refers to content from a separate file, and thus each of which has its own header and its own body.

Under the assumption that an XLIFF file will usually correspond to a single source file, I'm going to define a "default section" (that being the first in the file) that will be the target for the API against the file object; using XLIFF-specific functions, I'll expose a way to get a list of sections and create a separate file object pointing to a section that's numbered 2 or above.

Each of the File:: modules should probably have an Xlat::File superclass. I don't want to introduce needless dependencies, though; perhaps I can test for installation of Xlat::File before superclassing? Or maybe this is a plate of beans.

Tuesday, September 14, 2010

Patent language

Not software-related, but patent-related, I just wanted to link to this incredible example of clear exposition explaining the structure of patent claims.

Friday, September 3, 2010

OpenLogos on SourceForge

OpenLogos isn't really part of the xlat project, so I'll be transitioning its blog over to its own home on SourceForge. The real news being that it's now on SourceForge, with yours truly as maintainer.

Installing OpenLogos on Ubuntu 10.x (32-bit)

So, as noted below in the Fedora post, I'm giving up on 64-bit Fedora right now and falling back to an older machine, installing 32-bit Ubuntu on it so I can follow the instructions in Torsten Scheck's article without needing to work too hard. I'll post any discoveries here as I go; right now, I'm downloading the Ubuntu installer.

(10/17/2010) It's embarrassing, but I'm only now to this point. I spent some time getting Ubuntu installed on an old machine, but an update seems to have clobbered the boot sector or something - and frankly, that machine has been a problem for a while now. So this week I built a 32-bit Ubuntu virtual machine on my desktop box, and I'm chugging along.

After making my earlier changes again (the ones I made on Fedora), things are compiling well. I'm getting a lot of warnings from including logos_libs/ruleengine/rulebase.h of the form: "%.2d" expects type 'int' but argument 3 has type 'long unsigned int', but aside from those warnings, things worked fine. I'm going to have to look into those.

Ah. In lgsentity.cpp, "warning: deprecated conversion from string constant to 'char*'", and in a couple of other files, as well.

I ended up adding various headers to about ten files all in all. Not too bad.

The installation routine failed to create /usr/local/share/openlogos/bin for some reason - acting as though it wasn't running as sudo root. Strange, and something that should be examined.

But ... I seem to have installed OpenLogos at long last.

Friday, August 27, 2010

Tesseract OCR

Google's Tesseract seems to be just about the best OCR out there. It doesn't seem to play well with others yet (it's written on the assumption that it's a standalone utility, not a library) but given that it's Google, it'll probably get a lot better fast.

I should probably investigate. OCR is an important component of a lot of translation jobs, and all existing OCR sucks. Sigh. That's only partly hyperbole.

Thursday, August 19, 2010


So investigating some of the background for OpenLogos led me to the NooJ project, the brainchild of one Max Silberztein. Weirdly, it's in .NET, but aside from its choice of platform and the worryingly closed source, it appears to be manna from heaven and crack for my natural-language habit.

Lemme put it this way: 90% of the heavy lifting of the xlat project has now already been done. All that's left is integrating all this stuff into something like a coherent toolset. I fully intend to enjoy myself immensely. (While chafing at the closed source - but them's the breaks, kid.)

Compiling OpenLogos under Fedora Core 11

I've definitely gone down the rabbit hole with OpenLogos. It is truly a thing of utter archaic beauty. It's forty years old this year! Which, in terms of software, makes it one of the oldest existing codebases on the planet - certainly the oldest open-source codebase in existence.

I'm trying to get it running on my Fedora Core 11 box, very much in spare time. I'll continue to update this post as I get things running. There will be later posts on how to run the thing once it's built. Assuming I can; this is a 64-bit machine.

1. Dependencies: Java and unixODBC

Although most of the code is written in C++, there appear to be some Java components. I've had nothing like time even to make a cursory survey of the codebase yet, so I don't know what's using Java and what isn't, but Java is definitely a prerequisite for the build. Since Fedora ships with OpenJava, not Sun's Java, the first thing to do is to get the java-devel package installed (my runtime is 1.6.0, the latest as of this writing, so I obtained the matching devel):

yum install java-1.6.0-openjdk-devel

Once that's done, you'll specify the installation directory in your configure command to build the make environment for OpenLogos, like this:

./configure --with-java=/usr/lib/jvm/java-openjdk/

Don't run that yet, though, because the other compilation prerequisite is unixODBC. I tried installing it with yum, but it didn't work for me, so I fell back on the ancient technique of downloading and compiling it yourself. I'm going to assume you can manage that (otherwise, trust me here, you're going to have a hell of a time with OpenLogos) - the download is where you expect it, so get that, unpack it, do the configure-make-make install thing, and you're good to go.

2. gcc 4.3 header cleanup

Now you can run your configure. At this point, this worked fine for me. However, you're not quite done yet. Assuming you're using the DFKI distro 1.03, like I am, and gcc 4.4.1, you'll find that as of 4.3, the gcc headers have been cleaned up, and so there are dependencies missing. What compiled last time DFKI built (obviously gcc 4.2 or earlier) needs patching now. That is the status as I write this post; I'll update as I go, and provide a patch file at some point.

Update 8/23/2010:
The errors here take the form of:

error: 'xxxx' was not declared in this scope

And apply to the following functions:

strchr was assumed to be in string.h, but is now in cstring.h (affects lgsstring.h).
atoi was assumed to be in string.h, but is now in cstdlib.h (affects lgsstring.h).

That might be it, actually.

The other sloppy programming (not casting aspersions! I'm guilty of plenty of sloppiness, which is why I just gave up and decided to use Perl from now on in the first place) exposed by the move to gcc 4.3 is a duplicated parameter name in the declaration of rightTrim (two parameters 's', oops!) I renamed the const char * s to 't' to match the cpp file, but man, that looks like something I would have done. Weird that earlier compiler versions didn't flag that.

3. 32-bit architectural assumptions

Those fixes complete (and it's still 8/23/2010), the next problem is:

error: cast from 'const char*' to 'int' loses precision

Whoops. Did I mention I'm compiling on a 64-bit architecture? Yeah. So int is a 32-bit value, and addresses are 64 bits now. The answer is to replace with intptr_t, a guaranteed right-sized integer value defined in stdint.h and mandated in the C99 standard, so really there was no excuse to be casting pointers to vanilla int in 2006 (not that I would have done differently, but I'm old and distracted and prefer Perl anyway, allowing the interpreter contributors to worry about this stuff). Anyway, this little gem affects the parser, which uses addresses throughout as integer hash lookups. That's gotta go, but that's probably going to take some more thorough investigation and I've got deadlines for tomorrow morning, so that's it for August 23.

I wish more of the individual modules had unit tests. I'm going to shoot myself in the foot fast with this stuff sooner or later. Perhaps I should write some (if I only knew what to test, that would probably work out great - and I have to admit, it would be a great way to start understanding internals).

Anyway, the int usage appears to be just in private members of the CParser class, but I worry that they're going to end up getting used to talk to PostgreSQL, and then where will I be? I should probably worry about that if and when it comes up.

Update 9/3/2010:
I've been too busy to keep up with the 64-bit conversion, so I'm repurposing an older box I have as a 32-bit Ubuntu box (by which I mean, I pulled it out of the storage room, where it was gathering dust for just such an occasion), just so I can get a fresh compile and see this thing run once in my life. I may or may not get back to compiling under FC11 on the 64-bit machine.

Saturday, August 14, 2010

tf-idf weights

Quoth Wikipedia, "The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining."

The idea is that you determine the weights of terms based on their frequency in both the current document and in your overall corpus. This lets you find documents based on terms they use that are less frequent overall, and thus that are likely to indicate what the document is about.

Terminology mining is a technique by means of which "interesting" terms can be found in a document. The interesting terms can then be researched in advance of the translation process, so that the translation itself can be both consistent and quick.

There are lots of links I want to save that are tangentially related to this sort of textual analysis.
  • Gensim is a textual analysis library in Python.
  • An earlier paper on term weighting.
  • tdidf library in Python at Google Code.
  • And another at Github.

Monday, August 2, 2010


So my roadmap, or to-do list, or what have you, is kind of like this:
  1. Word client
    • Port Anaphraseus
      • Write OOo <=> Word Basic cross parser
    • Use an IP-based server connection to a TM of my own devising (below)
  2. TTX/Xliff client
    • Based on wxPerl and Wx::Declarative
    • Features can be taken largely from Xliff editor in Translation Workspace
    • Also talks to TM via IP-based protocol
  3. TM server with IP-based protocol
    • Basic database is easy
    • Fuzzy matching needs some examination
    • Also Wx::Declarative target
Here's how I expect to increase my productivity:
  • Simultaneous spell checking and terminology checking as I work; separate query window pops up queries unobtrusively after each segment committed
  • Decisions made in the query window are propagated back into the active document and any other documents in the same open project - this includes both terminology checks and spell checker dictionary additions. (Terminology and the spell checker will share a database.)
  • Frequent words are identified for accelerators; accelerators for terminology in the open segment are displayed in a cheat sheet window. Any repeated words in incoming segment translations are also identified as potential accelerators.
That's the first phase.

The second phase will probably start to incorporate some MT. Note OpenLogos especially in this regard; there's a library I could use with confidence. Post-editing will include the syntax-aware editor in some way.

Well - this has definitely been a late-night post; it's really more note-taking than anything.


Open-source machine translation. Open-source. Machine. Translation.

ATA Translation Tools overview


Useful catalog of translation-related software

Software for translation.

I thought I remembered something like Anaphraseus for Word-native Basic, but I was apparently suffering from hopeful memory. Anaphraseus uses only; I'm wondering, though, whether I could port it.

I really want to use Word.

Wednesday, July 21, 2010

Syntactically savvy editors

So I just finished 15,000 words of a community impact study from Hungarian to English. It was a mind-expanding experience, as HU>EN always is, and I found myself thinking hard about how Word is insufficient for my needs when translating between languages with radically different sentence structure.

This doesn't bother me so much any more with German (I compensate automatically, looking ahead in the German sentence for the structure I know will end up at the start of the English sentence), but in the Romance languages, I tend to backtrack a lot to insert adjectives that I hadn't noticed before starting to type a phrase.

That much I could probably learn to use word navigation for (I've just never learned that because text editors don't have word navigation, so it's not built into my motor cortex like other navigation commands). But in Hungarian, things are freaking different.

In Hungarian, it's not at all unusual for me to need to construct a sentence painfully, phrase by phrase, realizing again and again that the words I'm finding at the end of what I thought was the full phrase actually need to go at the front in English. What I'd like in a situation like this is an editor that understands the syntax of what I'm doing. (Which, of course, is in general impossible - but sometimes, it could probably work.) Some way of keying into a separate tree mode for this sort of editing would really be very useful.

More thought is required; I just wanted to mark the idea now.

Monday, July 19, 2010

3 writing-quality metric scripts

I'm not sure how relevant this is to translation, but it's still an interesting approach to handling natural language automatically. I'm guilty of using weasel words, so I find this interesting.

I think some sort of quality metric facility would be a useful tool in the kit.

Friday, July 16, 2010

Link dump: interesting natural language modules from CPAN

Algorithm::WordLevelStatistics - finds keywords in generic text. This should be a useful analysis tool for terminology research.

Lingua::Stem - finds stems for a smallish set of languages.

Lingua::StarDict::Gen - generates StarDict dictionaries. (Might be useful.)

Lingua::StarDict itself (2004) and the StarDict project (2007) at SourceForge, hmm. (console version, dates to 2006) - This might be dead, but it's intriguing.

Lingua::YaTeA - extracts noun phrase candidates from a corpus. Definitely to be studied. Seems to have a lot of innards.

Lingua::WordNet - pure Perl WordNet. Apparently. Needs study.

Lingua::Translate - interface to a Web-accessible machine translator, e.g. Babelfish.

Lingua::Sentence - Hello, segmentation! Thanks, CPAN!

Text::Ngrams, Text::Ngramize, Algorithm::NGram - n-gram analysis of text. Oh, and Text::WordGrams, too. And maybe Text::Positional::Ngram.

Proposal: generic natural-language-smart string handling

... or something.

There are operations I'd like to do on these sentences that are weakened by the assumptions made for strings. For instance, I'd like not to do terminology checking on a case-insensitive basis, because there are words that are incorrect if not capitalized. But that simply means that everything that starts a sentence will register as misspelled, which is also wrong.

Spell checkers probably already take this into consideration, but I'd like to be able to point at a file, say definitively "this file contains textually encoded natural language", and do some smarter things than normal file I/O will allow. Even assumptions about character encoding are different if we know something is language.

Similarly, just some function to extract words and n-phrases correctly from a punctuated sentence would be a great help; this is one of the many things the current Xlat::Termbase is overly naive about.

So that's my thought for the day.

Thursday, July 15, 2010

Configuration and the command line

OK, so I'll admit it - I've used computers for well over two decades now and I'm comfortable with Linux, and I still don't prefer the command line. Oh, for some stuff it's great (like, power tools) - but I just have no head for remembering commands and parameters. I sucked at the Rocky Horror Picture Show, too. Always inadvertently paraphrasing.

It makes me a good translator, actually: what is translation but reading a German sentence and paraphrasing it in English? But for command line manipulation, I'm not your man.

However, in this initial stage and for the foreseeable future, Xlat will be a set of command-line tools. (And those command-line tools are damned important, anyway, so they'll stick around.)

Here's an idea, though. When I'm in a directory, I want the entire Xlat suite to know some important things about that directory, i.e. the termbase I want to use there, the customer's name and ID perhaps, I don't know what all, but I want an open-ended scheme to set it all up.

And that scheme needs to cascade. If a value isn't found in the context for a directory, we should check the parent directory (i.e. if it's not in the project, I want it in the customer's main directory). And so on.

Moreover, I want to be able to override things on the command line if I want to use an alternative termbase or something.

In addition to this, I want a session context to be saved, i.e. the last file touched and things like that. The next time I do a termcheck, if I don't give it a file, it'll pull the last file I used. That kind of thing. Just a way to make this stuff easier to use, while preserving the power and convenience of command-line utilities.

I'm not sure what to call the module. Config:: something, but both Cascading and Context have been taken for things that aren't entirely what I want. So it deserves some thought.

A note on character encodings

Here's a sticky wicket (as character encodings always are). By default, Notepad (my text editor of choice for simple files) represents umlauted vowels in the normal ISO eight-bit character set. Padre (my Perl IDE of choice) represents umlauted vowels within strings as UTF-8, which is much better.

Here's the problem: if I edit a German word in a text file, and the same German word in a string, they don't test as equivalent. This is a problem. It's a widespread enough problem that I'm going to have to come up with a principled, central way to deal with it. So watch this space, I guess.

Terminology checking v0.01

So. I just posted v0.01 of a terminology checker script to the Wiki. It is painfully naive in its structure and coding, but it got the job done tonight for some terminology checking I wanted to do, and it illustrates just how simple these basic tools can be. The key of it is this:
foreach $s ($ttx->segments()) {
my $c = $t->check ($s->source, $s->translated);
if ($c) {
foreach my $missing (keys %$c) {
$terms->{$missing} = $c->{$missing};
$bad->{$missing} = [] unless defined $bad->{$missing};
push @{$bad->{$missing}}, $s;

Now, note that it's using a termbase module I haven't published yet (because it's even more terribly naive), but the key here is that this loop is really, really simple.

This is what translation tools should look like. I'm pretty happy with this.

Wednesday, July 14, 2010

Announcing File::TTX

File::TTX is the first fruit of the project. It's the early version of a Perl module that works with TRADOS TTX files. It will probably end up as a plugin for something like Xlat::Document, if all goes according to plan.

Saturday, July 10, 2010


The Xlat project is an open-source set of Perl tools and modules to facilitate translation. A bit of background might be useful: I'm a technical translator, mostly German to English. But before I ever started that career, I was a programmer, mostly C at the time. Lately I lean towards Perl; CPAN is just so useful.

At any rate, I tend to want to write tools to help me work. Until recently, I hadn't been organized enough to publish any of my scripts, and so every time I had a script need, I'd start from scratch. Now that I've started publishing on CPAN, I'm no longer losing quite so much ground.

This blog is where I organize my thoughts and plans, and announce new tools in the toolkit. If you're reading it, great! I would appreciate any and all feedback; as I'm sure you know, handling natural language is very hard indeed. My philosophy is to release early and raw, then iterate. That means that for any given project, these tools may well fail in egregious ways (character encodings are always a great way to get that to happen). Caveat emptor - a full refund is always available. (A little open-source joke.)

If you've got comments, you know what to do. I'd appreciate any feedback.