The Xlat Project: July 2010

Wednesday, July 21, 2010

Syntactically savvy editors

So I just finished 15,000 words of a community impact study from Hungarian to English. It was a mind-expanding experience, as HU>EN always is, and I found myself thinking hard about how Word is insufficient for my needs when translating between languages with radically different sentence structure.

This doesn't bother me so much any more with German (I compensate automatically, looking ahead in the German sentence for the structure I know will end up at the start of the English sentence), but in the Romance languages, I tend to backtrack a lot to insert adjectives that I hadn't noticed before starting to type a phrase.

That much I could probably learn to use word navigation for (I've just never learned that because text editors don't have word navigation, so it's not built into my motor cortex like other navigation commands). But in Hungarian, things are freaking different.

In Hungarian, it's not at all unusual for me to need to construct a sentence painfully, phrase by phrase, realizing again and again that the words I'm finding at the end of what I thought was the full phrase actually need to go at the front in English. What I'd like in a situation like this is an editor that understands the syntax of what I'm doing. (Which, of course, is in general impossible - but sometimes, it could probably work.) Some way of keying into a separate tree mode for this sort of editing would really be very useful.

More thought is required; I just wanted to mark the idea now.

Monday, July 19, 2010

3 writing-quality metric scripts

I'm not sure how relevant this is to translation, but it's still an interesting approach to handling natural language automatically. I'm guilty of using weasel words, so I find this interesting.

I think some sort of quality metric facility would be a useful tool in the kit.

Friday, July 16, 2010

Link dump: interesting natural language modules from CPAN

Algorithm::WordLevelStatistics - finds keywords in generic text. This should be a useful analysis tool for terminology research.

Lingua::Stem - finds stems for a smallish set of languages.

Lingua::StarDict::Gen - generates StarDict dictionaries. (Might be useful.)

Lingua::StarDict itself (2004) and the StarDict project (2007) at SourceForge, hmm. (console version, dates to 2006) - This might be dead, but it's intriguing.

Lingua::YaTeA - extracts noun phrase candidates from a corpus. Definitely to be studied. Seems to have a lot of innards.

Lingua::WordNet - pure Perl WordNet. Apparently. Needs study.

Lingua::Translate - interface to a Web-accessible machine translator, e.g. Babelfish.

Lingua::Sentence - Hello, segmentation! Thanks, CPAN!

Text::Ngrams, Text::Ngramize, Algorithm::NGram - n-gram analysis of text. Oh, and Text::WordGrams, too. And maybe Text::Positional::Ngram.

Proposal: generic natural-language-smart string handling

... or something.

There are operations I'd like to do on these sentences that are weakened by the assumptions made for strings. For instance, I'd like not to do terminology checking on a case-insensitive basis, because there are words that are incorrect if not capitalized. But that simply means that everything that starts a sentence will register as misspelled, which is also wrong.

Spell checkers probably already take this into consideration, but I'd like to be able to point at a file, say definitively "this file contains textually encoded natural language", and do some smarter things than normal file I/O will allow. Even assumptions about character encoding are different if we know something is language.

Similarly, just some function to extract words and n-phrases correctly from a punctuated sentence would be a great help; this is one of the many things the current Xlat::Termbase is overly naive about.

So that's my thought for the day.

Thursday, July 15, 2010

Configuration and the command line

OK, so I'll admit it - I've used computers for well over two decades now and I'm comfortable with Linux, and I still don't prefer the command line. Oh, for some stuff it's great (like, power tools) - but I just have no head for remembering commands and parameters. I sucked at the Rocky Horror Picture Show, too. Always inadvertently paraphrasing.

It makes me a good translator, actually: what is translation but reading a German sentence and paraphrasing it in English? But for command line manipulation, I'm not your man.

However, in this initial stage and for the foreseeable future, Xlat will be a set of command-line tools. (And those command-line tools are damned important, anyway, so they'll stick around.)

Here's an idea, though. When I'm in a directory, I want the entire Xlat suite to know some important things about that directory, i.e. the termbase I want to use there, the customer's name and ID perhaps, I don't know what all, but I want an open-ended scheme to set it all up.

And that scheme needs to cascade. If a value isn't found in the context for a directory, we should check the parent directory (i.e. if it's not in the project, I want it in the customer's main directory). And so on.

Moreover, I want to be able to override things on the command line if I want to use an alternative termbase or something.

In addition to this, I want a session context to be saved, i.e. the last file touched and things like that. The next time I do a termcheck, if I don't give it a file, it'll pull the last file I used. That kind of thing. Just a way to make this stuff easier to use, while preserving the power and convenience of command-line utilities.

I'm not sure what to call the module. Config:: something, but both Cascading and Context have been taken for things that aren't entirely what I want. So it deserves some thought.

A note on character encodings

Here's a sticky wicket (as character encodings always are). By default, Notepad (my text editor of choice for simple files) represents umlauted vowels in the normal ISO eight-bit character set. Padre (my Perl IDE of choice) represents umlauted vowels within strings as UTF-8, which is much better.

Here's the problem: if I edit a German word in a text file, and the same German word in a string, they don't test as equivalent. This is a problem. It's a widespread enough problem that I'm going to have to come up with a principled, central way to deal with it. So watch this space, I guess.

Terminology checking v0.01

So. I just posted v0.01 of a terminology checker script to the Wiki. It is painfully naive in its structure and coding, but it got the job done tonight for some terminology checking I wanted to do, and it illustrates just how simple these basic tools can be. The key of it is this:

foreach $s ($ttx->segments()) {
   my $c = $t->check ($s->source, $s->translated);
   if ($c) {
      foreach my $missing (keys %$c) {
         $terms->{$missing} = $c->{$missing};
         $bad->{$missing} = [] unless defined $bad->{$missing};
         push @{$bad->{$missing}}, $s;
      }
   }
}

Now, note that it's using a termbase module I haven't published yet (because it's even more terribly naive), but the key here is that this loop is really, really simple.

This is what translation tools should look like. I'm pretty happy with this.

Wednesday, July 14, 2010

Announcing File::TTX

File::TTX is the first fruit of the project. It's the early version of a Perl module that works with TRADOS TTX files. It will probably end up as a plugin for something like Xlat::Document, if all goes according to plan.

Saturday, July 10, 2010

Introduction

The Xlat project is an open-source set of Perl tools and modules to facilitate translation. A bit of background might be useful: I'm a technical translator, mostly German to English. But before I ever started that career, I was a programmer, mostly C at the time. Lately I lean towards Perl; CPAN is just so useful.

At any rate, I tend to want to write tools to help me work. Until recently, I hadn't been organized enough to publish any of my scripts, and so every time I had a script need, I'd start from scratch. Now that I've started publishing on CPAN, I'm no longer losing quite so much ground.

This blog is where I organize my thoughts and plans, and announce new tools in the toolkit. If you're reading it, great! I would appreciate any and all feedback; as I'm sure you know, handling natural language is very hard indeed. My philosophy is to release early and raw, then iterate. That means that for any given project, these tools may well fail in egregious ways (character encodings are always a great way to get that to happen). Caveat emptor - a full refund is always available. (A little open-source joke.)

If you've got comments, you know what to do. I'd appreciate any feedback.

Pages