Linking the Visual Module with the Language Module

Once the visual module is built, what good is it? By itself, not much. It only becomes useful when it is linked by knowledge with other cognitive modules. This sub-section presents a brief sketch of an example of how, via instruction by a human educator, a vision module could be usefully linked with a language architecture.

A problem that has been widely considered is the automated text annotation of video describing objects within video scenes and some of those object's attributes. For example, such annotations might be useful for blind people if the images being annotated were taken by a camera mounted on a pair of glasses (and the annotations were synthesized into speech provided by the glasses to the wearer's ears via small tubes issuing from the temples of the glasses near the ears).

Figure 7.13 illustrates a simple concept for such a text annotation system. Video input from the eyeglasses-mounted camera are operated upon by the gaze controller and objects that it selects are segmented and represented by the already-developed visual module, as described in the previous sub-section. The objects that were used in the visual module development process were those that a blind person would want to be informed of (curbs, roads, cars, people, etc.). Thus, by virtue of its development, the visual module will search each new frame of video for an object of operational interest (because these were the objects

Fig. 7.13. Image text annotation. A simple example of linking a visual architecture with a (text) language architecture. See text for description

sought out by the human educator whose examples were used to train the gaze controller perceptron) and then that object will be segmented and, after multi-confabulation, represented by the architecture on all of its three layers.

To build the knowledge links from the visual architecture to the text architecture, another human educator is used. This educator looks at each fixation point object selected by the vision architecture (while it is being used out on the street in an operationally realistic manner), and, if this is indeed an object that would be of interest to a blind person, enters a few sentences describing that object. These sentences are designed to convey to the blind person useful information about the nature of the object and its visual attributes (information that can be extracted by the human educator just by looking at the visual representation of the object).

To train the links from the vision architecture to the language architecture (every visual module is afforded a knowledge base to every phrase module), the educator's sentences are entered, in order, into the word modules of the sentence architecture (each of which represents one sentence - see Fig. 7.13); each sentence is parsed into phrases (see Sect. 7.4); and these phrases are represented on the sentence summary module of each sentence. Counts are accumulated between the symbols active on the visual architecture's tertiary modules and those active on the summary modules. If the educator wishes to describe specific visual sub-components of the object, they may designate a local window in the eyeball image for each sub-component and supply the sentence(s) describing each such sub-component. The secondary and tertiary module symbols representing the sub-components within each image are then linked to the summary modules of the associated sentences. Before being used in this application, all of the internal knowledge bases of the language architecture have already been trained using a huge text training corpus.

After a sufficient number of education examples have been accumulated (as determined by final performance - described below), the link use counts are converted into p(y|X) probabilities and frozen. The knowledge bases from the visual architecture's modules to all of the sentence summary modules are then combined (so that the available long-range context can be exploited by a sentence in any position in the sequence of sentences to be generated). The annotation system is now ready for testing.

The testing phase is carried out by having a sighted evaluator walk down the street wearing the system (yes, the idea is that the entire system is in the form of a pair of glasses!). As the visual module selects and describes each object, knowledge link inputs are sent to the language module. These inputs are used, much as in the example of Sect. 7.3: as context that drives formation of complete sentences. Using multiconfabulation, the language architecture composes one or more grammatical sentences that describe the object and its attributes (see Chap. 2 and the DVD video presentation for examples of whole-sentence generation).

The number of sentences is determined by a meaning content critic subsystem (not shown in Fig. 7.13) which stops sentence generation when all of the distinctive, excited, sentence summary module symbols have been "used" in one or more of the generated sentences.

This sketch illustrates the monkey-see/monkey-do principle of cognition: there is never any complicated algorithm or software. No deeply principled system of rules or mathematical constraints. Just confabulation and multicon-fabulation. It is a lot like that famous cartoon where scientists are working at a blackboard, attempting, unsuccessfully, to connect up a set of facts on the left with a desired conclusion on the right, via a complicated scientific argument spanning the gap between them. In frustration, one of the scientists erases a band in the middle of the argument and puts in a box (equipped with input and output arrows) labeled "And Then a Miracle Occurs." That is the nature of cognition.

Was this article helpful?

0 0

Post a comment