Pages

Monday, November 7, 2011

TRADOS XML noncompliant

So I'm working on a command-line utility for doing things with TTX files and ran into an unpleasantness: TTX files that are generated with the Word converter from Word documents with soft hyphens contain hex 0x1F values - but those values are illegal in XML. And when the XML standard says "illegal" they actually mean you're not supposed to call any parser that accepts them an XML parser.

This is really quite dismaying - and I can only imagine the discussions that must have gone on at TRADOS when they clearly suborned this restriction in their XML parser. It would have been far cleaner from an XML standpoint to have translated soft hyphens into a tag - but that would have made the editing experience far less clean. So they were stuck.

And now, I am too - I have to preprocess all TTX before passing it through the XML parser, which is a performance hit (which doesn't bother me too much) - but far worse, non-preprocessed TTX will infect the TM, so if I now make changes to the sanitized file and write it back out, it won't match the TM. This would be OK if we could be sure of sanitization before the TM were affected, but that's clearly too much to hope for in most real-world agency/freelancer workflows.

It's also rather nasty that TMX dumps from an infected TM will also contain 0x1F characters - meaning non-TRADOS tools won't be able to parse those, either. And they are supposed to be interoperable.

I think as a matter of policy I'm just going to sanitize and not worry overmuch about the rather small operability hit - at least until some actual project requires me to worry about it. Then I'll cross that bridge.

No comments:

Post a Comment