The Xlat Project: 2012

Monday, December 10, 2012

File::TMX

I just started scratching the surface for TMX files. I'm going to end up with some generalized useful tools for XML files.

The European Medicines Agency publishes patient information about European-approved (not nationally approved) drugs in all the European languages. This would be a useful corpus for terminological (and syntactic) analysis.

Wednesday, October 17, 2012

OLIF

Open Lexicon Interchange Format (OLIF) is an XML terminology format that SDL Multiterm 2009 can import. (In other news, unlike the first time I bought them, TRADOS 2007 and Multiterm 2009 are now interoperable. Must have been an upgrade between then and when I bought this laptop. Bodes well for my everyday work!)

So ... building on OLIF and my new SQP tool, maybe it's time to consider writing that terminology database thing with a nice Perly wrapper.

By the way, OLIF was initially supported by SAP (and it's an SAP-related job I'm working on right now!) and the OLIF Consortium is like a who's-who of the big players in the translation industry. So it's probably worth grokking.

Tuesday, September 11, 2012

Translation of chemical names

Here's a pretty fascinating survey of chemical name translation (I've been doing a lot of pharma translation this month). Turns out it's pretty tricky - looking at it, I'm not 100% sure it's as tricky as people make it out to be, because it's typical of language people that they find software magical, and typical of programmers to find natural language unreasonably hairy. But still - I think there's probably a (small) market for this kind of tool.

Cross-posted.

Friday, June 15, 2012

Terminology from patent databases

It should be relatively easy to automate a crawl of any patent database and extract terminology from the abstracts and translations of abstracts.

Just a thought.

Thursday, March 8, 2012

Terminology identification

On term extraction: I checked Okapi; they're just doing the same up-to-n-gram extraction I've always done and found wanting. Their English tokenizer is also comparable to mine (better tested, to be sure). So all in all, maybe I'm capable of doing competent work here. I need to do better testing, though.

But anyway, I figured maybe NLTK might have something more suited to my needs, so I searched on "NLTK term extraction" and came upon this miscategorized post on the topic at nltk-dev. That post led me to the Wikipedia page for named-entity recognition (not so fantastically relevant but interesting nonetheless) and - gold - a suggestion to check Chapter 6 of Manning, Ragavan, and Schütze (which is one of the texts for the upcoming Stanford online NLP course, actually).

Chapter 6 addresses term weighting. Finding relevant terms for indexing of documents is equivalent to finding interesting terms for terminology searches, and it turns out that the best way to weight terms is by inverse document frequency. (Which makes sense; clustering of terms in documents indicates that they're low-entropy and contain more information than, say, "the", whose inverse document frequency is 1.)

Long story short, a term in a document is interesting in proportion to the number of times it appears in the document and in inverse proportion to the number of documents in which it appears in a given relevant corpus. Given that my corpus is about five million words of translation memories, I have a good corpus, provided that I organize it into something document-like.

I'm going to consider a "document" to be each group of entries in the TM clustered by date. Since not all my translation goes through one TM, I can pretty much guarantee that all my work will be easy to cluster; from that I can derive an overall set of terminology for the entire corpus and calculate inverse document frequency for each term. From that, I can score each term found in a new document. If it's a known term for which I already have a gloss, I'm happy. If it's a new term with high relevance for which I have no gloss, I can research it. And if it's a known term for which I have no gloss, I can study my own corpus to try to extract a likely candidate for translation.

(This leads to a terminology extraction tool, too - working on both target and source languages and trying to correlate presence in segments, I'll bet I can come up with some pretty good guesses at glosses. Make that service a free one and you ought to do well.)

So I've got a pretty good plan for terminology identification at this point. Just have to find the time to implement it. Here's kind of the list of subtasks:

Tool to convert Trados TM to TMX without my getting my hands dirty. I love Win32 scripting anyway - and this can run on the laptop for a day or two to chew its way through my five million words.
Do that TMX module for Perl.
For all my TMs, cluster the segments into "documents".
Polish up a terminology extraction tool (e.g. the same n-grams between stop words strategy I've used in the past).
Run said tool on the five million. Might want to do some kind of proper database and index at this point so I never have to do this again.
Calculate inverse document frequencies for everything.
Take a target TTX, extract terms, and classify them. This is the actual task.

Wednesday, February 29, 2012

Okapi

Okapi (Java) is a pretty comprehensive set of open-source tools to facilitate the translation process - including a simple workflow manager. (You can group sets of steps together to define your own processes, a technique I'm going to steal.)

One more thing to take note of.

Incidentally, its token types are rather similar to the ones I've proposed.

Monday, February 13, 2012

Task: concordance-to-glossary tool

I want to be able to look up one or more terms in a TM in the same way that concordances work now, then make a decision for a given document or customer, then have that decision checked globally. I'm most of the way to having this ready to go.

Task: find "actionable" terms in a given source

This is probably solved by NLTK somehow, but given a source text I want to be able to find probable glossary items to be researched and to be checked against a TM or glossary.

Tuesday, January 24, 2012

Task: Generalize File::XLIFF to work on zipped XLIFF

The files Lionbridge uses in their XLIFF editor are actually zipped XLIFF (with a .xlz extension) and include a "skeleton" file that seems to have some kind of information about placeables.

It would be nice to have a way of dealing with those for batch manipulation (global find-and-replace, etc.).

Saturday, January 7, 2012

OpenTag, TMX, and translation memory manipulation

Here's an interesting thing: opentag.com, including the format definitions for TMX and a few other rather fascinating XML interchange formats (including one for segmentation rules!)

I'm off onto a new tangent: a TMX manipulation module. I still don't have a fantastic API for it, but you know, I think I'm going to dump the xmlapi for real now. It's been 12 years now and I think it's time to move on. So I'm going to rewrite File::TTX to work with a different XML library (probably XML::Reader/XML::Writer) and do the same with TMX. This will allow me to choose between loading the file into memory in toto, or just writing a stream processor to filter things out on the fly for really large files.

I envision an overarching Xlat::TM API that will work with File::TMX in specific, and perhaps with others if and when.

The Xlat Project

Pages