On the AGI-list, Mike Tintner offered:

…my guess is that if you are typical of both NLP and most of linguistics, your approach to language will have been focused on sentences (typically, toy, artificial sentences). I do recommend – nay insist – that the focus should also be natural texts – and the structure of texts.Take any newspaper’s articles.

Mike is right on two counts:

1. Texai will initially have tightly scripted conversations with its volunteer mentors for the purpose of being taught the lexical knowledge that it currently lacks, e.g. about the plurals of many nouns, or into what categories, noun, verb, adjective and adverb, many unknown words fall. These scripts of prompt utterances and mentor responses are necessarily artificial.

2. The purpose of the initial phase of the Texai bootstrap English dialog system is to gather sufficient lexicon, for all parts of speech, and to gather enough commonsense, contextualized, relevance facts to disambiguate word senses in the English text that it reads. As Texai is taught more about English words, then it will be able to understand a greater variety of utterances. I am using Construction Grammar for implementation, and Double R Grammar for phrase structure, so that Texai may eventually be taught enough English grammar constructions to read any newspaper’s articles, although the first goal beyond understanding its own word sense definitions, will be to read and understand Wikipedia articles.

Mike goes on to support his recommendation with a reference to work on discourse analysis:

What you will rapidly realise when you look at them, is that their structure is *not* logical. Each sentence is *not* an inevitable sequitur from the next. Sentences connect/relate but not logically or in any way, rationally. How they do relate – how the mind forms collages of ideas -is a fascinating and essential subject.
There has been some work on it – on “textual grammars” and the like. You might find this interesting:

http://www.discourses.org/OldArticles/From%20text%20grammar%20to%20critical%20discourse%20analysis.pdf

From Text Grammar to Critical Discourse Analysis
A brief academic autobiography
Version 2.0. August 2004
Teun A. van Dijk
Universitat Pompeu Fabra, Barcelona

[Some interesting conclusions:]
Kintsch and I introduced another crucial notion, viz., that of a (situation) model, a notion that was also used, though in a different way, by the psycholinguist Johnson-Laird in his books Mental Models (1983). The point of that notion is that language users do not merely construct a (semantic) representation of the text in their episodic memory, but also a
representation of the event or situation the text is about. This notion of model proved to be very successful. It explained many things that hitherto were obscure or ignored:

1.First of all, it beautifully ‘grounded’ the theory of referential coherence: Sentences (or their propositions) were simply defined to be coherent relative to a model.

2.Secondly, macrostructures of texts can be explained in terms of the higher level ‘macrostructures’ of models: They may not be directly visible or expressed in the discourse itself, but the fact that people know what its general topics are is represented in their mental model of an event.

3.what people remember of a text is not so much its meaning, as rather the subjective model they build about the event the text is about.

It is noteworthy that Mike Tintner uses Teun A. van Dijk to support his points, which I agree with. Walter Kintsch was not only van Dijk’s inspiration and mentor, but is also my own with regard to how Texai performs reading comprehension.

Construction grammar is not limited to explaining the pairing of form and meaning in words, phrases, utterances and sentences. I hypothesize that it can easily be extended to multiple sentences, paragraphs, whole documents, series of documents and layout of text on a web page. Wherever there is pairing between some form and a particular meaning, then that pairing can be represented as a construction.