Representation by similarity works on the same principle as a global positioning system (GPS), an electronic navigation aid that so conveniently absolves us from the need to know how to get where we’re going. A GPS receiver figures out where it is by estimating how far it is from each of the orbiting satellites, whose own positions are precisely determined at all times. Likewise, a cell phone can triangulate its location from the strengths of the signals it receives from several relay towers. This robust analogy between positioning by triangulation, on the one hand, and representation by similarity, on the other hand, leads us right up to a key conceptual tool for understanding how the mind works: representation spaces.
Representation Space: The Final Frontier
To stake a claim in the conceptual frontier-land of mind science, think of each face snapshot as a point in a big representation space—let’s call it the
face space
. (The concept of a point in a representation space is so widely applicable that we might just as well give it a name: this is our no. 4 conceptual tool.) Storing a bunch of face snapshots then becomes a matter of marking their representations in the face space; think of sticking little flags on a map in the war room. Once the “flags” are up, a new face can be described precisely and efficiently by how close it is to each of these reference points, because proximity between points in face space corresponds to similarity between the faces.
If a face-space point represents a particular snapshot, then all the different directions away from it correspond to all the different ways in which the appearance of the face can change. We already know that some of these changes actually need to be ignored because they stem from illumination or orientation shifts and are of no consequence to face identity. A smart procedure for gauging face similarity—that is, face-space proximity—must therefore give less weight to these face-space directions and more to those that correspond to identity differences. It turns out that brains can learn this kind of smarts on the job, simply by storing a few examples that represent each kind of change.
A stored example is represented in the brain by the standard building block—a neuron. A typical neuron in the visual system can learn to represent a snapshot of a face (or of some other object) by being “imprinted” with it—having its input synapses adjusted so as to evoke a selective response to the stimulus in question. Such learning results in the neuron becoming tuned to the stimulus, so that subsequently it responds the strongest to it and progressively less strong to stimuli that are less and less similar. An ensemble of such coarsely tuned neurons has exactly what is needed to pinpoint the face-space location of a new face, using triangulation by graded similarity (the same computation used by GPS receivers).
19
But how can this mechanism distinguish between important and non-important face-space directions? By seeking regularities in the skein of face-space tracks laid down in the representation space by individual faces as they are observed under varying conditions.
As Romeo sees Juliet for the first time, some of his neurons that usually respond to visual objects become active, each one just so—depending on how excited it gets by being exposed to Juliet’s features, which in turn depends on how close these are to this particular neuron’s prior experience. The list of numbers that denote the activities of this ensemble of neurons defines the point in Romeo’s face space that lights up in response to Juliet’s face (as Shakespeare almost wrote—“It is the brain, and Juliet is the sun”
20
). Here’s what happens when Juliet turns aside: as the orientation of her face changes gradually, its representation in Romeo’s face space splits off from the original point and gradually moves away, along a very specific face-space track. Romeo has never seen Juliet’s face from the side before; will his visual system now do the smart thing?
To realize that Juliet seen from any angle is still Juliet, Romeo’s visual system can draw on the memories of its experience with other faces. By the time he meets Juliet, he has seen other faces undergo the same turning-aside transformation. For each of those faces, he has stored a sparse sequence of representations that blaze the “turning-aside” track across face space. He can now use this knowledge, distilled from the regularities observed in the past experience, to interpret the ongoing change in the appearance of Juliet’s face.
Distilling knowledge from visual experience puts available memory to good use. Instead of grabbing face snapshots indiscriminately, the visual system samples and strings them together in strands that run through the face space in parallel to each other, corresponding to different faces undergoing the same transformation. As a result, it can make up for changes in the appearance of a new face through a kind of analogy, by interpolating its prior experience with other, similar faces. The very same trick also works for other categories of visual objects and, interestingly enough, for other cognitive tasks, such as motor control, reasoning, and language: knowledge grows out of experience, and analogy rules all.
21
Being in the World
For the mind’s war cabinet that meets in the dark shelter of the skull, the patterns of data that flicker across the sensor arrays are the only signs that it can ever have of the external world. What if there is danger out there? Imagine how terrifying it would be to know that there is a madman with a sword a few paces away from you, but not whether he is before or behind you. For obvious evolutionary reasons, such an ability to recognize the “what” without the “where” of a stimulus is unheard of in animal cognition, where even a yeast cell’s chemical sense is directional.
In us primates, who are equipped with camera-like eyes capable of highly directional quality imaging, the face space and the representation spaces for other objects are all yoked, as it were, to a common “space” space. This common underlying spatial scaffolding supports not only vision but also other senses, the data from all of which are brought into spatial register. This sensory integration effectively creates a complete virtual world, centered on the perceiver.
The realization that one’s perceptual world is a virtual construct (even if it does reflect faithfully many aspects of external reality) may take some time to sink in. As a first aid in overcoming the understandable, yet false, belief that minds have an unmediated grasp of the real world, ponder this: your vantage point in this world seems to be located right behind the bridge of your nose, yet what you see in front of you is this page, not the inside of your skull, where, as I keep pointing out, it is perpetually dark. This little observation suggests that our perception of the external world, effective as it may be in delivering those cues that matter for our evolutionary fitness, is not to be trusted blindly in all matters. And yet, trust it blindly we must, because we have no choice: the cabinet members sequestered in the war room may become collectively aware of their predicament, but they may not escape it.
22
The predicament of being confined to a virtual reality rig may feel disturbing when one first becomes aware of it, but in practical terms it is quite benign, because the rig has evolved to function well enough in the environment it is situated in. The first order of business for the senses is to make the spatial structure of this environment available for the decision-making processes in the rig. Given that sensations are spatially tagged, the proper way of describing how they present themselves for the executive cabinet’s consideration is to envisage a grand annotated map that wraps around the war room.
The grand map represents the surrounding space, which is how it conveys the “where” information. It is also annotated with “what” information in the form of multiple face/object spaces, each attached to some location on the master map. The function of this map is to generate behavior by coordinating perception and action—battle intelligence and battle plans. It does so by expressing the inputs from the senses and the potential outputs from the behavioral control processes in the same embodied and situated language of cues and action “handles,” all anchored in extra-personal space.
It is this kind of battle intelligence that guides young Romeo’s behavior as he scales the wall of the Capulets’ orchard, on the night of the masked ball, where he is about to fall for the daughter of the lady of the house. A pale visage appears in the dusk. On the grand map in Romeo’s brain, it receives a tentative label of “female face” and is represented by several numbers conveying its similarities to the faces of some women known to him. It is attached to a particular location in the visual field: up there, on some kind of balcony. He is drawn to it; he approaches. The face, as it comes into focus, resolves itself into that of Juliet.
The Instruments of Change
The grand map that presents itself to the mind’s executive cabinet is far removed from the raw data gathered by the senses. It is loaded with actionable, useful information, which the perceptual processes work hard to extract and refine. What form does that information take? Sci-fi movies that involve robots occasionally offer the viewer a glimpse of an imagined cybernetic protagonist’s display-like internal map, on which various objects of interest are labeled in plain English (set in a typeface intended to look futuristic). This cinematic trope does get one thing right: any embodied agent, including us biological robots, needs to coordinate perception and action, and a map is a very convenient prop for doing so. There is also, however, something very wrong here: a map that is annotated in English cannot be a
part
of a mind, because it would only be intelligible to a
whole
mind that can read. The labels on the human mind’s grand map are not human-readable, because the consumers of the information that they carry are not human: they are the many computations that may in themselves be simple, yet are collectively complex enough to be a mind.
23
Some of these computations are directly aimed at making things happen by steering the body to physically engage the rest of the world. A mind can only fulfill its function of channeling forethought if it is capable of bringing about change, by moving the body of which it is part, or by using the body to move other things. As in perception, in the control of movement there is a need for representations that downplay irrelevant variation, such as the postural context in which a reaching movement needs to be executed. This is why motor control “scripts,” like perceptual representations, rely on similarity to stored examples and, more generally, on analogy. You learn to balance yourself on a snowboard by just doing it (on moderate enough slopes) while retaining and organizing motor memories of the more successful of your moves; the resulting experience is likely to help you somewhat in your first attempts at surfing, despite the differences between the kinds of support and resistance offered by snow and water.
Animal brains control the posture and movement of the bodies in which they reside by sending to muscles signals that cause them to contract and exert force. To move a body around, or even just to prevent it from collapsing on itself like a skin stuffed with organs, a brain needs to compute some numbers, one per muscle, and send them to their destinations. Some simple actions may be completely specified by a single number, as in the case of a scallop closing its shell in response to the passing shadow of a cuttlefish. In comparison, to animate a human skeleton, a bundle of scripts controlling a multitude of muscles and unfolding in lockstep need to be played out, each consisting of a sequence of numbers, generated in the proper order and with proper timing.
These numbers, their order, and their timing all depend very much on the mechanics of the body part or parts that the script needs to control and on the kind of environment in which the body is situated. In a body that has a few hundred muscles with which to pull itself around, the computational problem of motor control is difficult indeed. Which muscles to activate, how strongly, for how long, and in what order—all these details need to be figured out, and this needs to happen in a timely fashion (whether to avoid becoming someone’s dinner or to turn someone else into your dinner, bar access to your gene pool, or gain access to someone else’s). Because of all that, in behaviorally sophisticated animals with mechanically complex bodies, such as humans or ravens or octopuses, motor control has to be hierarchical, with simpler muscle synergies serving as building blocks for the construction of progressively more and more complex ones.
The motor control problem can be made more tractable by perceiving in the geometry and physics of the body and the environment not just obstacles that must be overcome but opportunities that can be exploited. Such opportunities are called
affordances
: a flat horizontal surface affords sitting on, a basketball backboard affords directing the ball into the hoop, and a pond surface affords running on if you are a water spider or a basilisk lizard.
24
Experienced biological perceptual systems seek out affordances automatically, because in animals perception and motor control are inextricably interwoven: a body can get eaten because of a failure of either.
If predators, prey, competitors, or other objects of interest in your ecological niche move fast, and if your performance depends on matching their speed, your motor control system cannot rely too much on incremental corrections driven by perceptual feedback. The only resort in such cases is to distill behavioral experiences into task-specific models that capture as much as possible of the mechanics of your own body (“if
these
commands are sent to
those
muscles while I am in
that
posture, I duck”), the situation (“if a sword thrust is coming from
there
, duck”), and, if the situation involves other animate agents, their own statistically likely behavior (“what would Tybalt do?”). Such process models are embodiments of forethought, used to
simulate
ahead of time both what may happen in a highly dynamic situation and how to deal with it, in the most literal physical sense.
25