Monday, November 7, 2011

TRADOS XML noncompliant

So I'm working on a command-line utility for doing things with TTX files and ran into an unpleasantness: TTX files that are generated with the Word converter from Word documents with soft hyphens contain hex 0x1F values - but those values are illegal in XML. And when the XML standard says "illegal" they actually mean you're not supposed to call any parser that accepts them an XML parser.

This is really quite dismaying - and I can only imagine the discussions that must have gone on at TRADOS when they clearly suborned this restriction in their XML parser. It would have been far cleaner from an XML standpoint to have translated soft hyphens into a tag - but that would have made the editing experience far less clean. So they were stuck.

And now, I am too - I have to preprocess all TTX before passing it through the XML parser, which is a performance hit (which doesn't bother me too much) - but far worse, non-preprocessed TTX will infect the TM, so if I now make changes to the sanitized file and write it back out, it won't match the TM. This would be OK if we could be sure of sanitization before the TM were affected, but that's clearly too much to hope for in most real-world agency/freelancer workflows.

It's also rather nasty that TMX dumps from an infected TM will also contain 0x1F characters - meaning non-TRADOS tools won't be able to parse those, either. And they are supposed to be interoperable.

I think as a matter of policy I'm just going to sanitize and not worry overmuch about the rather small operability hit - at least until some actual project requires me to worry about it. Then I'll cross that bridge.

Thursday, November 3, 2011


The future of open source spelling may be Enchant. There's no Perl binding. Yet.

Text::Aspell on Win32 - non-trivial

Aspell is the default open-source spell checking engine; its Perl binding is Text::Aspell. The problem is that both Aspell and Text::Aspell are developed on Unix, and Things Are Different under Windows and MINGW32. Not insuperably different, but different enough that if you're the first person to try something, you'll live to regret it.

OK. So, first things first; the W32 installation of Aspell is back a release version but very stable. It doesn't actually have the include and library files bundled, but they're readily available - the problem being that W32 Aspell is developed with MSVC, and Strawberry Perl (my Perl of choice) compiles with MINGW32. Joy. So the library files are useless; we have to build our own. But let's make include and lib directories under Aspell.

Now, we set environment variables: CPATH should point (at least) to the Aspell include directory and LIBRARY_PATH to the Aspell lib directory. Don't forget that your PATH should also include Aspell's bin directory - which will make it easier to use Aspell's command line tools for your dictionary maintenance anyway. So do it!

Figuring out those environment variables, by the way, cost me about three hours. The remainder of the day was occupied with the next step: building a .DEF file that dlltool likes (some help was had from this page in remembering how a .DEF file is supposed to work), and then finding the appropriate combination of dlltool parameters. Turns out this:
   dlltool -d libaspell.defined --dllname aspell-15.dll
--output-lib libaspell.a --kill-at
is the only incantation that will work. Leaving out the --dllname, even though it is specified in the .DEF file, will cause linkage failures at runtime. Not helpful ones, either. This took me four hours, ultimately culminating in this page, which at least mentions the --dllname parameter.

When dynamically linked, Aspell assumes the location of the DLL linked is either the root for dictionary searches or is in a 'bin' directory which is itself in the root for dictionary searches - in either case, the 'dict' directory of that root is where dictionaries should be. I had placed a local DLL in the Text::Aspell directory while flailing around; it took me half an hour to remember that.

Anyway, I finally managed to get it running. Next step: extract words from a TTX to throw against it.