Pages

Sunday, June 8, 2014

Compiling corpi

So there's this German news corpus obtained between 1996 and 2000 from online retrieval that I intend to use for some of my NLP work, and it occurred to me that I could build a similar corpus (well, the monolingual side of it, anyway) by doing my own periodic retrievals.

To that end, here's the RSS feed pages for the Süddeutsche Zeitung, the Népszabaság Online, and the Népszava (published in New York for Hungarian-Americans).

Analysis of chemical names

Turns out the linguistic structure of chemical names is non-trivial. Unfortunately, as it's also quite profitable, it all seems to be behind paywalls, but I'm visiting Bloomington this summer and will have the opportunity to spend some time in the library, so this is one of the things I hope to make some headway on.

In the meantime, here's a paywalled article from the promisingly named Journal of Chemical Information and Modeling, which describes an early version of Name>Struct, a closed-source interpreter for chemical names that strives to understand them in a way similar to a human chemist - that is, they attempt to model actual usage, not just reflect the official definitions of usage. Descriptive, not prescriptive, chemical linguistics.

Anyway, the folks at CambridgeSoft who make Name>Struct have also highlighted some of the pitfalls of chemical linguistics here.

Ah - silly me. A search on "Name>Struct open source" quickly returns OPSIN, an open-source algorithm that I could probably adapt pretty easily. It's here at BitBucket, and written in (shudder) Java. Nifty Web interface here.