Pages

Friday, August 27, 2010

Tesseract OCR

Google's Tesseract seems to be just about the best OCR out there. It doesn't seem to play well with others yet (it's written on the assumption that it's a standalone utility, not a library) but given that it's Google, it'll probably get a lot better fast.

I should probably investigate. OCR is an important component of a lot of translation jobs, and all existing OCR sucks. Sigh. That's only partly hyperbole.

Thursday, August 19, 2010

NooJ

So investigating some of the background for OpenLogos led me to the NooJ project, the brainchild of one Max Silberztein. Weirdly, it's in .NET, but aside from its choice of platform and the worryingly closed source, it appears to be manna from heaven and crack for my natural-language habit.

Lemme put it this way: 90% of the heavy lifting of the xlat project has now already been done. All that's left is integrating all this stuff into something like a coherent toolset. I fully intend to enjoy myself immensely. (While chafing at the closed source - but them's the breaks, kid.)

Compiling OpenLogos under Fedora Core 11

I've definitely gone down the rabbit hole with OpenLogos. It is truly a thing of utter archaic beauty. It's forty years old this year! Which, in terms of software, makes it one of the oldest existing codebases on the planet - certainly the oldest open-source codebase in existence.

I'm trying to get it running on my Fedora Core 11 box, very much in spare time. I'll continue to update this post as I get things running. There will be later posts on how to run the thing once it's built. Assuming I can; this is a 64-bit machine.

1. Dependencies: Java and unixODBC

Although most of the code is written in C++, there appear to be some Java components. I've had nothing like time even to make a cursory survey of the codebase yet, so I don't know what's using Java and what isn't, but Java is definitely a prerequisite for the build. Since Fedora ships with OpenJava, not Sun's Java, the first thing to do is to get the java-devel package installed (my runtime is 1.6.0, the latest as of this writing, so I obtained the matching devel):

yum install java-1.6.0-openjdk-devel

Once that's done, you'll specify the installation directory in your configure command to build the make environment for OpenLogos, like this:

./configure --with-java=/usr/lib/jvm/java-openjdk/

Don't run that yet, though, because the other compilation prerequisite is unixODBC. I tried installing it with yum, but it didn't work for me, so I fell back on the ancient technique of downloading and compiling it yourself. I'm going to assume you can manage that (otherwise, trust me here, you're going to have a hell of a time with OpenLogos) - the download is where you expect it, so get that, unpack it, do the configure-make-make install thing, and you're good to go.

2. gcc 4.3 header cleanup

Now you can run your configure. At this point, this worked fine for me. However, you're not quite done yet. Assuming you're using the DFKI distro 1.03, like I am, and gcc 4.4.1, you'll find that as of 4.3, the gcc headers have been cleaned up, and so there are dependencies missing. What compiled last time DFKI built (obviously gcc 4.2 or earlier) needs patching now. That is the status as I write this post; I'll update as I go, and provide a patch file at some point.

Update 8/23/2010:
The errors here take the form of:

error: 'xxxx' was not declared in this scope

And apply to the following functions:

strchr was assumed to be in string.h, but is now in cstring.h (affects lgsstring.h).
atoi was assumed to be in string.h, but is now in cstdlib.h (affects lgsstring.h).

That might be it, actually.

The other sloppy programming (not casting aspersions! I'm guilty of plenty of sloppiness, which is why I just gave up and decided to use Perl from now on in the first place) exposed by the move to gcc 4.3 is a duplicated parameter name in the declaration of rightTrim (two parameters 's', oops!) I renamed the const char * s to 't' to match the cpp file, but man, that looks like something I would have done. Weird that earlier compiler versions didn't flag that.

3. 32-bit architectural assumptions

Those fixes complete (and it's still 8/23/2010), the next problem is:

error: cast from 'const char*' to 'int' loses precision

Whoops. Did I mention I'm compiling on a 64-bit architecture? Yeah. So int is a 32-bit value, and addresses are 64 bits now. The answer is to replace with intptr_t, a guaranteed right-sized integer value defined in stdint.h and mandated in the C99 standard, so really there was no excuse to be casting pointers to vanilla int in 2006 (not that I would have done differently, but I'm old and distracted and prefer Perl anyway, allowing the interpreter contributors to worry about this stuff). Anyway, this little gem affects the parser, which uses addresses throughout as integer hash lookups. That's gotta go, but that's probably going to take some more thorough investigation and I've got deadlines for tomorrow morning, so that's it for August 23.

I wish more of the individual modules had unit tests. I'm going to shoot myself in the foot fast with this stuff sooner or later. Perhaps I should write some (if I only knew what to test, that would probably work out great - and I have to admit, it would be a great way to start understanding internals).

Anyway, the int usage appears to be just in private members of the CParser class, but I worry that they're going to end up getting used to talk to PostgreSQL, and then where will I be? I should probably worry about that if and when it comes up.

Update 9/3/2010:
I've been too busy to keep up with the 64-bit conversion, so I'm repurposing an older box I have as a 32-bit Ubuntu box (by which I mean, I pulled it out of the storage room, where it was gathering dust for just such an occasion), just so I can get a fresh compile and see this thing run once in my life. I may or may not get back to compiling under FC11 on the 64-bit machine.

Saturday, August 14, 2010

tf-idf weights

Quoth Wikipedia, "The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining."

The idea is that you determine the weights of terms based on their frequency in both the current document and in your overall corpus. This lets you find documents based on terms they use that are less frequent overall, and thus that are likely to indicate what the document is about.

Terminology mining is a technique by means of which "interesting" terms can be found in a document. The interesting terms can then be researched in advance of the translation process, so that the translation itself can be both consistent and quick.

There are lots of links I want to save that are tangentially related to this sort of textual analysis.
  • Gensim is a textual analysis library in Python.
  • An earlier paper on term weighting.
  • tdidf library in Python at Google Code.
  • And another at Github.

Monday, August 2, 2010

Roadmap

So my roadmap, or to-do list, or what have you, is kind of like this:
  1. Word client
    • Port Anaphraseus
      • Write OOo <=> Word Basic cross parser
    • Use an IP-based server connection to a TM of my own devising (below)
  2. TTX/Xliff client
    • Based on wxPerl and Wx::Declarative
    • Features can be taken largely from Xliff editor in Translation Workspace
    • Also talks to TM via IP-based protocol
  3. TM server with IP-based protocol
    • Basic database is easy
    • Fuzzy matching needs some examination
    • Also Wx::Declarative target
Here's how I expect to increase my productivity:
  • Simultaneous spell checking and terminology checking as I work; separate query window pops up queries unobtrusively after each segment committed
  • Decisions made in the query window are propagated back into the active document and any other documents in the same open project - this includes both terminology checks and spell checker dictionary additions. (Terminology and the spell checker will share a database.)
  • Frequent words are identified for accelerators; accelerators for terminology in the open segment are displayed in a cheat sheet window. Any repeated words in incoming segment translations are also identified as potential accelerators.
That's the first phase.

The second phase will probably start to incorporate some MT. Note OpenLogos especially in this regard; there's a library I could use with confidence. Post-editing will include the syntax-aware editor in some way.

Well - this has definitely been a late-night post; it's really more note-taking than anything.

OpenLogos

Open-source machine translation. Open-source. Machine. Translation.

ATA Translation Tools overview

Hmm.

Useful catalog of translation-related software

Software for translation.

I thought I remembered something like Anaphraseus for Word-native Basic, but I was apparently suffering from hopeful memory. Anaphraseus uses OpenOffice.org only; I'm wondering, though, whether I could port it.

I really want to use Word.