Pages

Thursday, March 8, 2012

Terminology identification

On term extraction: I checked Okapi; they're just doing the same up-to-n-gram extraction I've always done and found wanting. Their English tokenizer is also comparable to mine (better tested, to be sure). So all in all, maybe I'm capable of doing competent work here. I need to do better testing, though.

But anyway, I figured maybe NLTK might have something more suited to my needs, so I searched on "NLTK term extraction" and came upon this miscategorized post on the topic at nltk-dev. That post led me to the Wikipedia page for named-entity recognition (not so fantastically relevant but interesting nonetheless) and - gold - a suggestion to check Chapter 6 of Manning, Ragavan, and Schütze (which is one of the texts for the upcoming Stanford online NLP course, actually).

Chapter 6 addresses term weighting. Finding relevant terms for indexing of documents is equivalent to finding interesting terms for terminology searches, and it turns out that the best way to weight terms is by inverse document frequency. (Which makes sense; clustering of terms in documents indicates that they're low-entropy and contain more information than, say, "the", whose inverse document frequency is 1.)

Long story short, a term in a document is interesting in proportion to the number of times it appears in the document and in inverse proportion to the number of documents in which it appears in a given relevant corpus. Given that my corpus is about five million words of translation memories, I have a good corpus, provided that I organize it into something document-like.

I'm going to consider a "document" to be each group of entries in the TM clustered by date. Since not all my translation goes through one TM, I can pretty much guarantee that all my work will be easy to cluster; from that I can derive an overall set of terminology for the entire corpus and calculate inverse document frequency for each term. From that, I can score each term found in a new document. If it's a known term for which I already have a gloss, I'm happy. If it's a new term with high relevance for which I have no gloss, I can research it. And if it's a known term for which I have no gloss, I can study my own corpus to try to extract a likely candidate for translation.

(This leads to a terminology extraction tool, too - working on both target and source languages and trying to correlate presence in segments, I'll bet I can come up with some pretty good guesses at glosses. Make that service a free one and you ought to do well.)

So I've got a pretty good plan for terminology identification at this point. Just have to find the time to implement it. Here's kind of the list of subtasks:
  • Tool to convert Trados TM to TMX without my getting my hands dirty. I love Win32 scripting anyway - and this can run on the laptop for a day or two to chew its way through my five million words.
  • Do that TMX module for Perl.
  • For all my TMs, cluster the segments into "documents".
  • Polish up a terminology extraction tool (e.g. the same n-grams between stop words strategy I've used in the past).
  • Run said tool on the five million. Might want to do some kind of proper database and index at this point so I never have to do this again.
  • Calculate inverse document frequencies for everything.
  • Take a target TTX, extract terms, and classify them. This is the actual task.