Sunday, September 14, 2008

Intersections of human and computer languages in our daily lives

One of the programs I'm writing at work has to glean some very simple information from general HTML documents (that should be general enough not to get me in any trouble, on the off chance that anyone ever reads this stuff anyway). The original idea, suggested by the guy that came up with the design for the larger project that this is part of, was to find a library to represent the data as a tree structure, and search through the tree for what I needed.

I want to be really clear what this all means. HTML is commonly thought of as a Markup Language, a way to communicate to a computer some useful facts about a body of text. In HTML, the markup, that is, the elements that aren't text data but are metadata, come in the form of tags. These tags can enclose text. Some of the tags lay out the structure of the document (i.e. <h1>, for a top-level section heading) or give some other semantic information about the enclosed text (i.e. <samp>, for sample output from a computer program). Other tags tell how the author wants the text to be presented. <em>, for text that should be emphasized, and <i> for text that should be italicized; usually a web browser presents these the same way, but they mean different things. The Internet Police don't want you using <i> anymore, and HTML doesn't have tags that cover all the reasons that an author might specifically want something italicized, so people (like the people that wrote Blogger) write stupid shit like <span style="font-style:italic;">.

That's not the whole story on italics, even. People that want to write really “good” web pages will come up with classes of text they always want italicized (say, names of dinosaurs, or of planets), mark them up as instances of these classes (like, Everyone bitched and moaned when <span class="planet-name">Pluto</span> was banished from the Brotherhood of Planets, but where were all they when the majestic <span class="dino-name">Brontosaurus</span> had its name reduced to the utterly unpoetic <span class="dino-name">Apatosaurus</span>?), and define a rule in a stylesheet that text belonging to those classes should be italicized (I don't remember how to do this, it's been a while). And then anyone that wanted to find all the planet and dinosaur names in your document could do so easily by writing a simple program looking for these classes. This last part is the promise of the Semantic Web: that the content of the web will be understandable by humans and computers alike.

The problem, though, is that HTML marks up text. That's not exactly true. I mentioned earlier that HTML is commonly thought of as a markup language; it's technically a hierarchical container language. That's because sections of text defined by tags are only allowed to overlap heirarchically. That is, if you were writing a “Before and After” puzzle for Geek Wheel of Fortune (in which you can use vowels for free but have to buy shell control characters) you couldn't illustrate the premise like <i>If you duz good, you can has a cookie from teh Life Tree in teh garden of teh ceiling <b>cat</i> /dev/mem | strings | grep -i llama</b>. So HTML contains text, hierarchically. Language is not really hierarchical. You can sort of make it appear that way by diagramming sentences, but even that doesn't really represent all the relationships between words in a sentence. It's an awfully reductionist way of looking at language that doesn't do justice to its subtleties, nor to how it evolves. So HTML does a pretty good job of organizing the things that can be organized hierarchically, and leaves the rest to the human ability to process language. Otherwise we'd have this: <p><sentence type="command"><subject implied="yes">you</subject><verb>see</verb><object direct="yes"><noun id="1">Jane</noun><verb performer="1">run</verb></object></sentence><sentence><verb id="2">run</verb><subject>Jane</subject><restate id="2" /></sentence></p>. And if that hurts your head, view the source of it and see the shit I had to write to express an HTML-like language in HTML. I need beer.

<sentence><subject>Al</subject><verb>drink</verb><object with-article="indefinite" plural="yes">beer</object></sentence>

My point is that as long as our Brontosaur-loving friend writes his rants in a natural human language computers will never understand his pleas. They'll go right on renaming dinosaurs, reclassifying planets, and destroying the game of baseball. And that he'll never get anything written in a computer-friendly language, especially if he has to spell out all the relationships between the words and the concepts they refer to, which he will, because computers aren't any good at figuring them out (there are attempts to make computer-friendly language models that suck much less than my token attempt; one is predictably named Babel, and it doesn't appear to represent non-hierarchical word representations at all). His clever use of stylesheets and classes comes out to little more than a very indirect way to say <i> (yes, I know there are theoretically advantages in presentation flexibility and maintainability; I submit that they just about never matter in the case where you're making up your own classes).

Now there's XML. XML is a container language similar to HTML, aimed at representing hierarchical data. My language-modeling language up there is basically XML, though I'd need to flesh out a schema (a definition of the possible elements and how they can relate) for the data to be usable in computer programs. Because I used a lot of well-known grammatical terms (and probably misused some) a human could figure out much of the schema just from reading those simple examples. There seems to be a bit of an XML dream, related to the promise of the semantic web. People get all excited when they see the flexibility of XML and think that it could model all information some day, when computers have the capacity. But it's not that simple. Can the computer program deal with the ambiguity of the same information arriving in different forms and from different sources? Only if the programmer specifically knows about the ambiguity and writes code to handle it. So in reality XML is just used as a way to pass arbitrary hierarchical data around the Internet, much like ASCII is used to pass sequential data around in Unix programs, especially shells. Both present plenty of flexibility and transparency to make up for their inefficiency compared to more specialized data formats, but neither can describe the universe. Only Nil can do that (also see here for an example of a Nil paradox).

So as far as my little program went, I tried to use the hierarchy of the HTML document to interpret its contents (using Perl's HTML::TreeBuilder). It didn't work very well. An HTML document expressing the same information can be structured in a staggering number of different ways. But in order to make sense to its readers it has to present its text in a natural order. So it turned out to be fairly easy to put together a much more robust version of the program by scanning through it with HTML::Parser. And it's faster and uses less memory that way, too (I don't expect that the performance difference will matter, but it's a good example of an optimization coming from thinking about a problem in a better way). Instead of viewing the documents as structures of containers holding text, I viewed them as text with some helpful tags giving hints to their structure. Human sequential understanding, 1; computer hierarchical understanding, 0.

1 comment:

Silent Five said...

Brief Note Of Appreciation: I have just come off of reading Steven Pinker's "The Language Instinct," and this was really fascinating and topical. Rock.