Will Texai Parse Natural Texts?
On the AGI-list, Mike Tintner offered:
…my guess is that if you are typical of both NLP and most of linguistics, your approach to language will have been focused on sentences (typically, toy, artificial sentences). I do recommend – nay insist – that the focus should also be natural texts – and the structure of texts.Take any newspaper’s articles.
Mike is right on two counts:
1. Texai will initially have tightly scripted conversations with its volunteer mentors for the purpose of being taught the lexical knowledge that it currently lacks, e.g. about the plurals of many nouns, or into what categories, noun, verb, adjective and adverb, many unknown words fall. These scripts of prompt utterances and mentor responses are necessarily artificial.
2. The purpose of the initial phase of the Texai bootstrap English dialog system is to gather sufficient lexicon, for all parts of speech, and to gather enough commonsense, contextualized, relevance facts to disambiguate word senses in the English text that it reads. As Texai is taught more about English words, then it will be able to understand a greater variety of utterances. I am using Construction Grammar for implementation, and Double R Grammar for phrase structure, so that Texai may eventually be taught enough English grammar constructions to read any newspaper’s articles, although the first goal beyond understanding its own word sense definitions, will be to read and understand Wikipedia articles.
Mike goes on to support his recommendation with a reference to work on discourse analysis:
What you will rapidly realise when you look at them, is that their structure is *not* logical. Each sentence is *not* an inevitable sequitur from the next. Sentences connect/relate but not logically or in any way, rationally. How they do relate – how the mind forms collages of ideas -is a fascinating and essential subject.
There has been some work on it – on “textual grammars” and the like. You might find this interesting:
From Text Grammar to Critical Discourse Analysis
A brief academic autobiography
Version 2.0. August 2004
Teun A. van Dijk
Universitat Pompeu Fabra, Barcelona
[Some interesting conclusions:]
Kintsch and I introduced another crucial notion, viz., that of a (situation) model, a notion that was also used, though in a different way, by the psycholinguist Johnson-Laird in his books Mental Models (1983). The point of that notion is that language users do not merely construct a (semantic) representation of the text in their episodic memory, but also a
representation of the event or situation the text is about. This notion of model proved to be very successful. It explained many things that hitherto were obscure or ignored:
1.First of all, it beautifully ‘grounded’ the theory of referential coherence: Sentences (or their propositions) were simply defined to be coherent relative to a model.
2.Secondly, macrostructures of texts can be explained in terms of the higher level ‘macrostructures’ of models: They may not be directly visible or expressed in the discourse itself, but the fact that people know what its general topics are is represented in their mental model of an event.
3.what people remember of a text is not so much its meaning, as rather the subjective model they build about the event the text is about.
It is noteworthy that Mike Tintner uses Teun A. van Dijk to support his points, which I agree with. Walter Kintsch was not only van Dijk’s inspiration and mentor, but is also my own with regard to how Texai performs reading comprehension.
Construction grammar is not limited to explaining the pairing of form and meaning in words, phrases, utterances and sentences. I hypothesize that it can easily be extended to multiple sentences, paragraphs, whole documents, series of documents and layout of text on a web page. Wherever there is pairing between some form and a particular meaning, then that pairing can be represented as a construction.
Immediate updates on Twitter



Abram Demski on 14 Jun 2009 at 10:18 pm #
Hi,
I hope I am not being redundant here, as I have not read all that there is to read on texai. But, I am curious. It seems to me that there are several types of indirect language that need to be handled before texai can read wikipedia or other natural texts. Among them:
–Figuring out what general concept people are pointing out when they use an example as an explanation
–Determining what is meant by metaphors, especially brief metaphors that people may use without thinking (but which may nonetheless not be standard phrases)
–Figuring out what people intend when they are using sarcasm, beating around the bush, or qualifying their statements heavily to avoid controversy.
What plans I have read seem aimed more at learning english sentence structure and word meanings. Do you have plans for these more abstract linguistic constructs?
Steve Reed on 15 Jun 2009 at 7:40 am #
Abram,
I distinguish Wikipedia from other natural texts because it is very much less likely to display the harder-to-understand features that you mention. The last point in particular is discouraged for WIkipedia editors.
There are strategies for dealing with each of these features, but I would like to postpone dealing with them until after Texai is capable of skill acquisition. Then it can be taught these strategies, as skills, by volunteer mentors, rather than be me programming experimental solutions for them.
-Steve
Abram Demski on 15 Jun 2009 at 12:44 pm #
Stephen,
Sampling a random wikipedia page, I have to admit that you are correct! I saw practically no indirect language of the kind I was referring to.
By the way, are you planning on reading simple.wikipedia.org first?
Steve Reed on 16 Jun 2009 at 1:57 pm #
Derek,
Thanks for the reminder. When I first learned about the Simple English Wikipedia, I planned to try it as a knowledge acquisition target, but I had forgotten about it. Simplified English chooses words from Basic English which is a controlled language consisting of basic grammatical constructions and a vocabulary of 850 words, or from the larger BE 1500 or Voice Of Americal (VOA) Special English vocabularies.
I would like to move on to Wikipedia as soon as possible after parsing the Simple English Wikipedi. This is because the contextuallized spreading activation relevance links learned from simplified English may be misleading when applied to the more complex vocabulary and grammatical constructions found in the more natural English found in the full Wikipedia.