Monday, November 7, 2011

TRADOS XML noncompliant

So I'm working on a command-line utility for doing things with TTX files and ran into an unpleasantness: TTX files that are generated with the Word converter from Word documents with soft hyphens contain hex 0x1F values - but those values are illegal in XML. And when the XML standard says "illegal" they actually mean you're not supposed to call any parser that accepts them an XML parser.

This is really quite dismaying - and I can only imagine the discussions that must have gone on at TRADOS when they clearly suborned this restriction in their XML parser. It would have been far cleaner from an XML standpoint to have translated soft hyphens into a tag - but that would have made the editing experience far less clean. So they were stuck.

And now, I am too - I have to preprocess all TTX before passing it through the XML parser, which is a performance hit (which doesn't bother me too much) - but far worse, non-preprocessed TTX will infect the TM, so if I now make changes to the sanitized file and write it back out, it won't match the TM. This would be OK if we could be sure of sanitization before the TM were affected, but that's clearly too much to hope for in most real-world agency/freelancer workflows.

It's also rather nasty that TMX dumps from an infected TM will also contain 0x1F characters - meaning non-TRADOS tools won't be able to parse those, either. And they are supposed to be interoperable.

I think as a matter of policy I'm just going to sanitize and not worry overmuch about the rather small operability hit - at least until some actual project requires me to worry about it. Then I'll cross that bridge.

Thursday, November 3, 2011


The future of open source spelling may be Enchant. There's no Perl binding. Yet.

Text::Aspell on Win32 - non-trivial

Aspell is the default open-source spell checking engine; its Perl binding is Text::Aspell. The problem is that both Aspell and Text::Aspell are developed on Unix, and Things Are Different under Windows and MINGW32. Not insuperably different, but different enough that if you're the first person to try something, you'll live to regret it.

OK. So, first things first; the W32 installation of Aspell is back a release version but very stable. It doesn't actually have the include and library files bundled, but they're readily available - the problem being that W32 Aspell is developed with MSVC, and Strawberry Perl (my Perl of choice) compiles with MINGW32. Joy. So the library files are useless; we have to build our own. But let's make include and lib directories under Aspell.

Now, we set environment variables: CPATH should point (at least) to the Aspell include directory and LIBRARY_PATH to the Aspell lib directory. Don't forget that your PATH should also include Aspell's bin directory - which will make it easier to use Aspell's command line tools for your dictionary maintenance anyway. So do it!

Figuring out those environment variables, by the way, cost me about three hours. The remainder of the day was occupied with the next step: building a .DEF file that dlltool likes (some help was had from this page in remembering how a .DEF file is supposed to work), and then finding the appropriate combination of dlltool parameters. Turns out this:
   dlltool -d libaspell.defined --dllname aspell-15.dll
--output-lib libaspell.a --kill-at
is the only incantation that will work. Leaving out the --dllname, even though it is specified in the .DEF file, will cause linkage failures at runtime. Not helpful ones, either. This took me four hours, ultimately culminating in this page, which at least mentions the --dllname parameter.

When dynamically linked, Aspell assumes the location of the DLL linked is either the root for dictionary searches or is in a 'bin' directory which is itself in the root for dictionary searches - in either case, the 'dict' directory of that root is where dictionaries should be. I had placed a local DLL in the Text::Aspell directory while flailing around; it took me half an hour to remember that.

Anyway, I finally managed to get it running. Next step: extract words from a TTX to throw against it.

Tuesday, September 6, 2011

Blogs to follow

Here are a couple of blogs I'm going to be following:
It's really shocking how little I know about my adoptive industry.

Tuesday, June 14, 2011

General TTX utility

So File::TTX may be slipping ever closer to irrelevance, but I'm still using it for a number of things. The only problem is, it's a pain always having to write a special-purpose Perl script just to change, say, the source language of a TTX.

Obviously, a command-line program would be the first step towards usability. (And way easier than a GUI program, obviously.) Let this stand as my to-do for that command-line utility.

Also: I think it's time to admit that I'm going to write the UI portions of the Xlat project in Decl, not plain Perl. This will probably require the definition of a Xlat::Declarative module. (That's a good thing.)

Sunday, May 15, 2011


Another non-Xlat post!

Automotive terminology is kind of tricky and I'm finding it hard to find good references - although I'm seeing more demand. Here are a couple of links not to forget.
Second topic: I really want to mine the SAP help site for accounting terminology. Here's just a teaser link that's been open on my browser for a couple of weeks now - the technique is simple. Google " xxx" for a likely term, then replace the language in the link with "en". Then align your results. It works! A list of likely terms (from a tagger, perhaps) is the right place to start.

A generalized terminology research framework would be useful.

Saturday, March 12, 2011

File::TTX 0.03 released

I haven't been moving very fast on this project, have I?