Jump to ContentJump to Main Navigation
Neural Basis of Motivational and Cognitive Control$

Rogier B. Mars, Jerome Sallet, Matthew F. S. Rushworth, and Nick Yeung

Print publication date: 2011

Print ISBN-13: 9780262016438

Published to MIT Press Scholarship Online: August 2013

DOI: 10.7551/mitpress/9780262016438.001.0001

Show Summary Details
Page of

PRINTED FROM MIT PRESS SCHOLARSHIP ONLINE (www.mitpress.universitypressscholarship.com). (c) Copyright The MIT Press, 2022. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in MITSO for personal use. Subscriber: null; date: 21 May 2022

Neural Correlates of Hierarchical Reinforcement Learning

Neural Correlates of Hierarchical Reinforcement Learning

Chapter:
(p.284) (p.285) 16 Neural Correlates of Hierarchical Reinforcement Learning
Source:
Neural Basis of Motivational and Cognitive Control
Author(s):

José J. F. Ribas-Fernandes

Yael Niv

Matthew M. Botvinick

Publisher:
The MIT Press
DOI:10.7551/mitpress/9780262016438.003.0016

Abstract and Keywords

This chapter discusses the relevance of reinforcement learning (RL) to its hierarchical structure. It first reviews the fundamentals of RL, with a focus on temporal-difference learning in actor-critic models. Next, it discusses the scaling problem and the computational issues that stimulated the development of hierarchical reinforcement learning (HRL). The potential neuroscientific correlates of HRL are also described. The chapter also presents the results of some initial empirical tests and ends with directions for further research.

Keywords:   reinforcement learning, RL, hierarchical reinforcement learning, HRL, temporal-difference learning, actor-critic models

Over the past two decades, ideas from computational reinforcement learning (RL) have had an important and growing effect on neuroscience and psychology. The impact of RL was initially felt in research on classical and instrumental conditioning.13,104,112 Soon thereafter, its reach extended to research on midbrain dopaminergic function, where the temporal-difference learning paradigm provided a framework for interpreting temporal profiles of dopaminergic activity.10,51,67,93 Subsequently, actor-critic architectures for RL have inspired new interpretations of functional divisions of labor within the basal ganglia and cerebral cortex52 (see also chapters 17 and 18, this volume), and RL-based accounts have been advanced to address issues as diverse as motor control,66 working memory,74 performance monitoring,48 and the distinction between habitual and goal-directed behavior.31

Despite this widespread absorption of ideas from RL into neurobiology and cognitive science, important questions remain concerning the scope of its relevance. In particular, RL-inspired research has generally focused on highly simplified decision-making situations involving choice among a small set of elementary actions (e.g., left vs. right saccades), or on Pavlovian settings involving no action selection at all. It thus remains uncertain whether RL principles can help us understand learning and action selection in more complex behavioral settings, akin to those arising in everyday life.30

In the present chapter, we consider the potential relevance of RL to one part icular aspect of complex behavior, namely, its hierarchical structure. Since the inception of cognitive psychology, it has been noted that naturalistic behavior displays a stratified or layered organization.58,64,89 As stated by Fuster,38 “Successive units with limited short-term goals make larger and longer units with longer-term objectives. … Thus we have a pyramidal hierarchy of structural units of increasing duration and complexity serving a corresponding hierarchy of purposes” (p. 159). A concern with hierarchical action structure has continued to inform behavioral research to the present day,19,25,91,117 and has figured importantly in neuroscientific research bearing on the prefrontal cortex, where evidence has arisen (p.286) that representations of successive levels of task structure may map topographically onto the cortical surface.7,20,27,38,57,114

Can hierarchical behavior be understood in terms provided by RL, or does it involve fundamentally different computational principles? One lead in pursuing this question can be gleaned from recent RL research. As it turns out, a great deal of recent work in computational RL has focused precisely on the question of how RL methods might be elaborated to accommodate hierarchical behavioral structure. This has given rise to a general framework referred to, aptly enough, as hierarchical reinforcement learning (HRL).11,34,106 In considering whether RL might be relevant to hierarchical behavior in animals and humans, a natural approach is to evaluate whether the brain might implement anything like the mechanisms stipulated in computational HRL. In recent work, we have undertaken this project.21,26,82 Our objective in the present chapter is to review the results obtained so far, and to offer an interim evaluation of the HRL hypothesis.

Before getting to neuroscientific data, a number of preliminaries are in order. We begin by reviewing the basics of RL, and in particular temporal-difference learning within the actor-critic architecture. Next, we discuss the computational issues that stimulated the development of HRL, and introduce the fundamental elements of HRL itself. With this foundation in place, we consider potential neuroscientific correlates of HRL, describe results of some initial empirical tests, and finally chart out some directions for further research.

Fundamentals of RL: Temporal Difference Learning in Actor-Critic Models

RL problems comprise four elements: a set of world states; a set of actions available to the agent in each state; a transition function, which specifies the probability of transitioning from one state to another when performing each action; and a reward function, which indicates the amount of reward (or cost) associated with each such transition. Given these elements, the objective for learning is to discover a policy, that is, a mapping from states to actions, that maximizes cumulative discounted long-term reward.

There are a variety of specific algorithmic approaches to solving RL problems.16,105 We focus here on the approach that has arguably had the most direct influence on neuroscientific translations of RL, referred to as the actor-critic paradigm. 10,52 In actor-critic implementations of RL,14,51,52,103 the learning agent is divided into two parts, an actor and a critic (see figure 16.1a). The actor selects actions according to a modifiable policy, π(s), which is based on a set of weighted associations from states to actions, often called action strengths. The critic maintains a value function, V(s), associating each state with an estimate of the cumulative, long-term reward that can be expected subsequent to visiting that state. Importantly, both the action strengths (p.287)

Neural Correlates of Hierarchical Reinforcement Learning

Figure 16.1 An actor-critic implementation. (a) Schematic of the basic actor-critic architecture. (b) An actor-critic implementation of HRL. (c,d) Putative neural correlates of components of elements diagramed in panels a and b. DA, dopamine; DLPFC, dorsolateral prefrontal cortex, plus other frontal structures potentially including premotor, supplementary motor and pre-supplementary motor cortices; DLS, dorsolateral striatum; HT+: hypothalamus and other structures, potentially including the habenula, the pedunculopontine nucleus, and the superior colliculus; OFC: orbitofrontal cortex; VS, ventral striatum. Adapted from Botvinick, Niv, and Barto.21

(p.288) and the value function must be learned based on experience with the environment. At the outset of learning, the value function and the actor’s action strengths are initialized, for instance, uniformly or randomly, and the agent is placed in some initial state. The actor then selects an action, following a rule that favors high-strength actions but also allows for exploration. Once the resulting state is reached and its associated reward is collected, the critic computes a temporal-difference prediction error, δ. Here, the value that was attached to the previous state is treated as a prediction of the reward that would be received in the successor state, R(s), plus the value attached to that successor state. A positive prediction error indicates that this prediction was too low, meaning that things turned out better than expected. Of course, things can also turn out worse than expected, yielding a negative prediction error.

The prediction error is used to update both the value attached to the previous state and the strength of the action that was selected in that state. A positive prediction error leads to an increase in the value of the previous state and the propensity to perform the chosen action at that state. A negative error leads to a reduction in these values. After the appropriate adjustments, the agent selects a new action, a new state is reached, a new prediction error is computed, and so forth. As the agent explores its environment and this procedure is repeated, the critic’s value function becomes progressively more accurate, and the actor’s action strengths change so as to yield progressive improvements in behavior, in terms of the amount of reward obtained.

The actor-critic architecture, and the temporal-difference learning procedure it implements, have provided a very useful framework for decoding the neural substrates of learning and decision making. Although accounts relating the actor-critic architecture to neural structures do vary,52 one influential approach has been to identify the actor with the dorsolateral striatum (DLS), and the critic with the ventral striatum (VS) and the mesolimbic dopaminergic system32,71 (figure 16.1c). Dopamine (DA), in particular, has been associated with the function of conveying reward prediction errors to both actor and critic.10,67,93 This set of correspondences provides an important backdrop for our later discussion of HRL and its neural correlates.

The Scaling Problem in RL

Even as excitement initially grew concerning potential applications of RL within neuroscience, concerns were already arising in computer science over the limitations of RL. In particular, it became clear very early on in the history of RL research that RL algorithms face a scaling problem: They do not cope well with tasks involving a large space of environmental states or possible actions. It is this scaling problem (p.289) that immediately stimulated the development of HRL, and it is therefore worth characterizing the problem before turning to the details of HRL itself.

A key source of the scaling problem is the fact that an RL agent can learn to behave adaptively only by exploring its environment, trying out different courses of action in different situations or states of the environment, and sampling their consequences. As a result of this requirement, the time needed to arrive at a stable behavioral policy increases with both the number of different states in the environment and the number of available actions. In most contexts, the relationship between training time and the number of environmental states or actions is a positively accelerating function. Thus, as problem size increases, standard RL eventually becomes infeasible.

Several computational maneuvers have been proposed to address the scaling problem. For example, one important approach is to simplify the state space by treating subsets of environmental states as behaviorally equivalent, a measure referred to as state abstraction.60 Another approach aims at optimizing the search for an optimal behavioral policy by balancing judiciously between exploration and exploitation of established knowledge.55

HRL methods arose as another way of addressing the scaling problem in RL. The key to HRL is the use of temporal abstraction.11,34,77,106 Here, the basic RL framework is expanded to include “temporally abstract” actions, representations that group together a set of interrelated actions (for example, grasping a spoon, using it to scoop up some sugar, moving the spoon into position over a cup, and depositing the sugar), casting them as a single higher-level action or skill (“add sugar”). These new representations are described as temporal abstractions because they abstract over temporally extended, and potentially variable, sequences of lower-level steps. A number of other terms have been used as well, including “skills,” “operators,” “macro-operators,” and “macro-actions.” In what follows, we often refer to temporally abstract actions as options.106

In most versions of RL that use temporal abstraction, it is assumed that options can be assembled into higher-level skills in a hierarchical arrangement. Thus, for example, an option for adding sugar might form part of other options for making coffee and tea. It is the importance of such hierarchical structures in work using temporal abstraction that gave rise to the moniker HRL.

Adding temporal abstraction to RL can ease the scaling problem in two ways. The first way is through its impact on the exploration process. In order to see how this works, it is useful to picture the agent as searching a tree structure (figure 16.2a). At the apex is a node representing the state occupied by the agent at the outset of exploration. Branching out from this node are links representing primitive actions, each leading to a node representing the state (and, possibly, reward) consequent on that action. Further action links project from each of these nodes, leading to their (p.290)

Neural Correlates of Hierarchical Reinforcement Learning

Figure 16.2 The options framework in HRL. (a–c) illustrate how options can facilitate search. (a) A search tree with arrows indicating the pathway to a goal state. A specific sequence of seven independently selected actions is required to reach the goal. (b) The same tree and trajectory, the arrows indicating that the first four and the last three actions have been aggregated into options. Here, the goal state is reached after only two independent choices (option selections). (c) Search using option models allows the consequences options to be forecast without requiring consideration of the lower-level steps involved in executing the option. (d) Schematic illustration of HRL dynamics. a, primitive actions; o, option. On the first time step (t = 1), the agent executes a primitive action (forward arrow). Based on the consequent state (i.e., the state at t = 2), a prediction error δ is computed (arrow running from t = 2 to t = 1), and used to update the value (V) and action/option strengths (π) associated with the preceding state. At t = 2, the agent selects an option (long forward arrow), which remains active through t = 5. During this time, primitive actions are selected according to the option’s policy (lower tier of forward arrows), with prediction errors (lower tier of curved arrows) used to update Vo and πo associated with the preceding state, taking into account pseudo-reward received throughout option execution (lower asterisk). The option is terminated once its subgoal state is reached. Prediction error computed for the entire option (long curved arrow) is used to update the values and option strengths associated with the state in which the option was initiated. The agent then selects a new action at the top level, yielding external reward (higher asterisk). The prediction errors computed at the top level, but not at the level below, take this reward into account. Adapted from Botvinick, Niv, and Barto,21 with permission.

(p.291) consequent states, and so forth. The agent’s objective is to discover paths through the decision tree that lead to maximal accumulated rewards. However, the set of possible paths increases with the set of actions available to the agent, and the number of reachable states. With increasing numbers of either, it becomes progressively more difficult to discover, through exploration, the specific traversals of the tree that would maximize reward.

Temporally abstract actions can alleviate this problem by introducing structure into the exploration process. Specifically, the policies associated with temporally abstract actions can guide exploration down specific partial paths through the search tree, potentially allowing earlier discovery of high-value traversals. The principle is illustrated in figure 16.2. Discovering the pathway illustrated in figure 16.2a,using only primitive, one-step actions, would require a specific sequence of seven independent choices. This changes if the agent has acquired—say, through prior experience with related problems—two options corresponding to the differently colored subsequences in figure 16.2b. Equipped with these, the agent would only need to make two independent decisions to discover the overall trajectory, namely, select the two options. Here, options reduce the effective size of the search space, making it easier for the agent to discover an optimal trajectory.

The second, and closely related, way in which temporally abstract actions can ease the scaling problem is by allowing the agent to learn more efficiently from its experiences. Without temporal abstraction, learning to follow the trajectory illustrated in figure 16.2a would involve adjusting parameters at seven separate decision points. With predefined options (figure 16.2c), policy learning is required at only two decision points, the points at which the two options are to be selected. Thus, temporally abstract actions allow the agent not only to explore more efficiently, but also to make better use of its experiences.

Hierarchical Reinforcement Learning

In order to frame specific hypotheses considering neural correlates of HRL, it is necessary to get into the specifics of how HRL works. In this section we introduce the essentials of HRL, focusing on the options framework106 as adapted to the actor-critic framework.21 We focus on aspects of HRL that we believe are potentially most relevant to neuroscience (see ref. 11 for a detailed and comparative discussion of HRL algorithms).

The options framework supplements the set of single-step, primitive actions with a set of temporally abstract actions or options. An option is, in a sense, a “mini-policy.” It is defined by an initiation set, indicating the states in which the option can be selected; a termination function, which specifies a set of states that will trigger (p.292) termination of the option; and an option-specific policy, mapping from states to actions (which now include other options).

Like primitive actions, options are associated with strengths, and on any time step the actor may select either a primitive action or an option. Once an option is selected, actions are selected based on that option’s policy until the option terminates. At that point, a prediction error for the option is computed (figure 16.2d). This error is defined as the difference between the value of the state where the option terminated and the value of the state where the option was initiated, plus whatever rewards were accrued during execution of the option. A positive prediction error indicates that things went better than expected since leaving the initiation state, and a negative prediction error means that things went worse. As in the case of primitive actions, the prediction error is used to update the value associated with the initiation state, as well as the action strength associating the option with that state.

Implementing this new functionality requires several extensions to the actor-critic architecture, as illustrated in figure 16.1b. First, the actor must maintain a representation of which option is currently in control of behavior (o) or, in case of options calling other options, of the entire set of active options and their calling relations. Second, because the agent’s policy now varies depending on which option is in control, the actor must maintain a separate set of action strengths for each option, πo(s), together with option-dependent reward functions, Ro(s), and value functions, Vo(s). Important changes are also required in the critic. Because prediction errors are computed when options terminate, the critic must receive input from the actor, telling it when such terminations occur (the arrow from o to δ). Finally, to be able to compute the prediction error at these points, the critic must also keep track of the amount of reward accumulated during each option’s execution and the identity of the state in which the option was initiated.

Learning Option Policies

The description provided so far explains how the agent learns a top- or root-level policy, which determines what action or option to select when no option is currently in control of behavior. We turn now to the question of how option-specific policies are learned.

In versions of the options framework that address such learning, it is often assumed that options are initially defined in terms of specific subgoal states. The question of where these subgoals come from is an important one, to which we will return later. It is further assumed that when an active option reaches its subgoal, the actions leading up to the subgoal are reinforced. To distinguish this reinforcing effect from the one associated with external rewards, subgoal attainment is said to yield pseudo-reward.34

(p.293) In order for subgoals and pseudo-reward to shape option policies, the critic in HRL must maintain not only its usual value function, but also a set of option-specific value functions, Vo(s)(see figure 16.1b). As in ordinary RL, these value functions predict the cumulative long-term reward that will be received subsequent to occupation of a particular state. However, they are option-specific in the sense that they take into account the pseudo-reward that is associated with each option’s subgoal state. A second reason that option-specific value functions are needed is that the reward (and pseudo-reward) that the agent will receive following any given state depends on the actions it will select. These depend, by definition, on the agent’s policy, and under HRL the policy depends on which option is currently in control of behavior. Thus, only an option-specific value function can accurately predict future rewards.

Despite the additions discussed here, option-specific policies are learned in quite the usual way: On each step of an option’s execution, a prediction error is computed based on the (option-specific) values of the states visited and the reward received (including pseudo-reward). This prediction error is then used to update the option’s action strengths and the values attached to each state visited during the option (figure 16.2d). With repeated cycles through this procedure, the option’s policy evolves so as to guide behavior, with increasing directness, toward the option’s subgoals.

Potential Neural Correlates

Having laid out the basic mechanisms of HRL, we are now in a position to consider its potential implications for understanding neural function. To make these concrete, we will leverage the actor-critic formulation of HRL21 presented earlier. As previously noted, existing research has proposed parallels between the elements of the actor-critic framework and specific neuroanatomical structures. Situating HRL within the actor-critic framework thus facilitates the formation of hypotheses concerning how HRL might map onto functional neuroanatomy.

As figure 16.1 makes evident, elaborating the actor-critic architecture for HRL requires only insertion of a very few new elements. The most obvious of these is the component labeled“o” in figure 16.1b. As established previously, the role of this component is to represent the identity of the option currently in control of behavior. From a neuroscientific point of view, this function seems very closely related to those commonly ascribed to the dorsolateral prefrontal cortex (DLPFC). The DLPFC has long been considered to house representations that guide temporally integrated, goal-directed behavior.38,40,41,78,96,114 Recent work has refined this idea by demonstrating that DLPFC neurons play a direct role in representing task sets. Here, a single pattern of DLPFC activation serves to represent an entire mapping from stimuli to (p.294) responses, that is, a policy.5,23,50,53,85,99,108,110 According to the guided activation theory,63 prefrontal representations do not implement policies directly, but instead select among stimulus-response pathways implemented outside the prefrontal cortex. This division of labor fits well with the distinction in HRL between an option’s identifier and the policy with which it is associated.

There is evidence that, in addition to the DLPFC, other frontal areas may also carry representations of task set, including presupplementary motor area (pre-SMA)86 and premotor cortex (PMC).69,109 Furthermore, like options in HRL, neurons in several frontal areas including DLPFC, pre-SMA, and supplementary motor area (SMA) have been shown to code for particular sequences of low-level actions.6,17,97,98 Research on frontal cortex also accords well with the stipulation in HRL that temporally abstract actions may organize into hierarchies, with the policy for one option (say, an option for making coffee) calling other, lower-level options (say, options for adding sugar or cream). This fits with numerous accounts suggesting that the frontal cortex serves to represent action at multiple, nested levels of temporal structure,41,102,114,118 possibly in such a way that higher levels of structure are represented more anteriorly.20,39,40,46,57

As reviewed earlier, neuroscientific interpretations of the basic actor-critic architecture generally place policy representations within the DLS. It is thus relevant that such regions as the DLPFC, SMA, pre-SMA, and PMC—areas potentially representing options—all project heavily to the DLS.4,76 Frank, O’ Reilly, and colleagues36,74,85 (see also chapter 17, this volume) have put forth detailed computational models that show how frontal inputs to the striatum could switch among different stimulus-response pathways. Here, as in guided activation theory, temporally abstract action representations in frontal cortex select among alternative (i.e., option-specific) policies.

In order to support option-specific policies, the DLS would need to integrate information about the currently controlling option with information about the current environmental state, as is indicated by the arrows converging on the policy module in figure 16.1b. This is consistent with neurophysiological data showing that some DLS neurons respond to stimuli in a way that varies with task context.81,88 Other studies have shown that action representations within the DLS can also be task dependent.2,42,43,59 For example, in rats, different DLS neurons fire in conjunction with simple grooming movements, depending on whether those actions are performed in isolation or as part of a grooming sequence.1 This is consistent with the idea that option-specific policies (action strengths) might be implemented in the DLS, since this would imply that a particular motor behavior, when performed in different task contexts, would be selected via different neural pathways.

Unlike the selection of primitive actions, the selection of options in HRL involves initiation, maintenance, and termination phases. At the neural level, the (p.295) maintenance phase would be naturally supported within DLPFC, which has been extensively implicated in working memory function.27,28,80 With regard to initiation and termination, it is intriguing that phasic activity has been observed, both within the DLS and in several areas of frontal cortex, at the boundaries of temporally extended action sequences.37,68,116 Since these boundaries correspond to points where new options would be selected, boundary-aligned activity in the DLS and frontal cortex is also consistent with a proposed role of the DLS in gating information into prefrontal working memory circuits.74,85

The points considered so far all relate to control, that is, the guidance of action selection. Also critical to HRL is the machinery that drives learning, centering on the temporal-difference prediction error. Here, too, HRL gives us some very specific things to look for in terms of neural correlates. In particular, moving from RL to HRL brings about important alterations in the way the prediction error is computed. One important change is that HRL widens the scope of the events that the prediction error addresses. In standard RL, the prediction error indicates whether things went better or worse than expected since the immediately preceding single-step action. HRL, in addition, evaluates at the completion of an option whether things have gone better or worse than expected since the option was initiated (see figure 16.2d). Thus, unlike standard RL, the prediction errors associated with options in HRL are framed around temporally extended events. Formally speaking, the HRL setting is no longer a Markov decision process, but rather a semi-Markov decision process (SMDP).

The widened scope of the prediction error computation in HRL resonates with work on midbrain DA function. In particular, Daw29 suggested, based on midbrain responses to delayed rewards, that dopaminergic function is driven by representations that divide event sequences into temporally extended segments. In articulating this account, Daw provided a formal analysis of DA function that draws on precisely the same principles of temporal abstraction that also provide the foundation for HRL, namely, an SMDP framework.

Note that in HRL, in order to compute a prediction error when an option terminates, certain information is needed. In particular, the critic needs access to the reward prediction it made when the option was initially selected, and for purposes of temporal discounting it also needs to know how much time has passed since that prediction was made. These requirements of HRL resonate with data concerning the orbitofrontal cortex (OFC). Neurophysiologic data have shown that within OFC, unlike some other areas, reward-predictive activity tends to be sustained, spanning temporally extended segments of task structure.94 In addition, in line with the integration of reward and delay information in HRL, the response of OFC neurons to the receipt of primary rewards has been shown to vary depending on the wait time leading up to the reward.83

(p.296) Another difference between HRL and ordinary temporal-difference learning is that prediction errors in HRL occur at all levels of task structure (see figure 16.2d). At the top-most or root level, prediction errors signal unanticipated changes in the prospects for primary reward. However, in addition, once the HRL agent enters a subroutine, separate prediction error signals indicate the degree to which each action has carried the agent toward the currently relevant subgoal and its associated pseudo-reward. Note that these subroutine-specific prediction errors are unique to HRL. In what follows, we refer to them as pseudo-reward prediction errors (PPE), reserving reward prediction error (RPE) for prediction errors relating to primary reward.

Because the PPE is not found in ordinary RL, it can be considered a functional signature of HRL. If the neural mechanisms underlying hierarchical behavior are related to those found in HRL, it should be possible to uncover a neural correlate of the PPE. On grounds of parsimony, one would expect to find PPE signals in the same structures that have been shown to carry RPE-related signals, in particular targets of midbrain dopaminergic projections including VS45,71,75 and anterior cingulate cortex,48,49 as well as other structures including the habenula62,87,107 and amygdala.22,115 Unlike some of the other predictions from HRL that we have discussed, for which at least circumstantial evidence can be drawn from the literature, we are aware of no previous work that sheds light on whether anything like the PPE is computed in the brain. Given this, we undertook a set of experiments assaying for neural correlates of the PPE. In the following section, we present an overview of these experiments and their results. (For full details, see ref. 82.)

Testing for the Pseudo-Reward Prediction Error

To ground the distinction between the RPE and PPE, and to set the scene for our experiments, consider the video game illustrated in figure 16.3, which is based on a benchmark task from the computational HRL literature.33 Only the icon elements in the figure (truck, house, and package) appear in the task display. The overall objective of the game is to complete a “delivery” as quickly as possible, using joystick movements to guide the truck first to the package and from there to the house. The task has a transparent hierarchical structure, with delivery serving as the (externally rewarded) top-level goal and acquisition of the package as an obvious subgoal. For an HRL agent, delivery would be associated with primary reward, and acquisition of the package with pseudo-reward.

Consider now a version of the task in which the package sometimes unexpectedly jumps to a new location before the truck reaches it. According to RL, a jump to point A in the figure, or any location within the ellipse shown, should trigger a positive RPE, because the total distance that must be covered to deliver the package has decreased. We assume temporal discounting or effort costs that imply attaining (p.297)

Neural Correlates of Hierarchical Reinforcement Learning

Figure 16.3 Left: Task display and geometry for the delivery task. Right: Prediction errors elicited by the four jump destinations in the task display. + and–indicate positive and negative prediction errors, respectively.

the goal faster is more rewarding. This was enforced by making each movement of the truck effortful. By the same token, a jump to point B (or any other exterior point) should trigger a negative RPE. Cases C through E are quite different. Here, there is no change in the overall distance to the goal, and so no RPE should be triggered. However, in case C, the distance to the subgoal has decreased. According to HRL, a jump to this location should thus trigger a positive PPE. Similarly, a jump to location D should trigger a negative PPE. (Note that location E is special, being the only location that should trigger neither an RPE or a PPE.)

These points translate directly into neuroscientific predictions. As noted earlier, previous research has revealed neural correlates of the RPE in numerous structures.15,22,45,48,49,73,75,107,115 HRL predicts that neural correlates should also exist for the PPE. To test this, we had normal undergraduate participants perform the delivery task from figure 16.3 while undergoing electroencephalography (EEG) and, in two further experiments, functional magnetic resonance imaging (fMRI).

Inourfirst experiment, participants performed the delivery task while undergoing EEG recording. Over the course of the recording session, one third of trials involved a jump event of type D from figure 16.3; these events were intended to elicit a negative pseudo-reward prediction error. Earlier EEG research indicates that ordinary negative reward prediction errors trigger a midline negativity, commonly referred to as the feedback error-related negativity, or fERN48,49,65 (see also chapters 17 and 18, this volume). Based on HRL, we predicted that a similar negativity would occur following the critical changes in pseudo-reward. To provide a baseline for comparison, another third of trials included jump events of type E (following figure 16.3).

(p.298)

Neural Correlates of Hierarchical Reinforcement Learning

Figure 16.4 (a) Evoked potentials at electrode Cz, aligned to jump events. D and E refer to jump destinations in figure 16.3. The data-series labeled D-E shows the difference between curves D and E, isolating the pseudo-reward prediction error effect. (b) Regions displaying a positive correlation with the pseudo-reward prediction error (independent of subgoal displacement per se) in the first fMRI experiment. The [x y z] coordinates (Talairach space) of peak statistical significance are, for dorsal anterior cingulate cortex, [0 9 39], left anterior insula [–45 9–3], right anterior insula [45 12 0], and lingual gyrus [0–66 0].

Stimulus-aligned EEG averages indicated that class-D jump events, which should induce negative pseudo-reward prediction errors, triggered a phasic negativity in the EEG as shown in figure 16.4a. Like the fERN, this negativity was largest in the midline leads, and the time course was consistent with the fERN, as observed in studies where information about the outcome and the appropriate stimulus-response mapping are shown simultaneously.8

In a second experiment, we examined neural correlates of the PPE using fMRI. A new group of normal participants underwent fMRI while performing a slightly different version of the delivery task. The task was again designed to elicit negative pseudo-reward prediction errors. As in the EEG experiment, one third of trials included a jump of type D (as in figure 16.3) and another third included a jump of type E. In contrast to the EEG task, the increase in subgoal distance in this experiment varied in size across trials. By this means, type D jumps were intended to induce PPEs that varied in magnitude. Our analyses tested for regions that showed phasic activation correlating with predicted PPE size.

A whole-brain general linear model analysis revealed such a correlation, negative in sign, in the dorsal anterior cingulate cortex (ACC; figure 16.4b). This region is believed to contain the generator of the fERN,48 and the fMRI result is thus consistent with the result of our EEG experiment. The same parametric fMRI effect was also observed bilaterally in the anterior insula, a region often coactivated with (p.299) ACC in the setting of unanticipated negative events.79 The only other region displaying the same effect is a small focus within the lingual gyrus.

A set of region-of-interest (ROI) analyses focused in on additional neural structures that, like the ACC, were previously proposed to encode negative reward prediction errors: the habenular complex,87,107 nucleus accumbens,95 and amygdala.22,115 The habenular complex was found to display greater activity following type D than type E jumps, consistent with the idea that this structure is also engaged by negative pseudo-reward prediction errors. A comparable effect was also observed in the right, though not the left, amygdala. In the nucleus accumbens (NAcc), where some studies have observed deactivation accompanying negative reward prediction errors,56 no significant pseudo-reward prediction error effect was observed in this first fMRI study. However, it should be noted that NAcc deactivation with negative reward prediction errors has been an inconsistent finding in previous work.24,72 More robust is the association between NAcc activation and positive reward prediction errors.15,45,71,75 With this in mind, we ran a second, smaller ROI fMRI study specifically looking for NAcc activation within a region of interest, with positive pseudo-reward prediction errors. Fourteen participants performed the delivery task, with jumps of type C (in figure 16.3) occurring on one third of trials, and jumps of type E on another third. As described earlier, a positive pseudo-reward prediction error is predicted to occur in association with type C jumps, and in this setting significant activation was observed in the right NAcc, scaling with predicted pseudo-reward prediction error magnitude.

Directions for Further Investigation

Our initial experiments, together with the evidence we have pieced together from the existing literature, suggest that HRL may provide a useful framework for investigating the neural basis of hierarchical behavior. As detailed in our recent work,21 HRL also gives rise to further testable predictions, each of which presents an opportunity for further research.

To take just one example, HRL predicts that neural correlates should exist for option-specific state-value representations. As explained earlier, in addition to the top-level state-value function, the critic in HRL must also maintain a set of option-specific value functions. This is because the value function indicates how well things are expected to go following arrival at a given state, which obviously depends on which actions the agent will select. Under HRL, the option that is currently in control of behavior determines action selection, and also determines which actions will yield pseudo-reward. Thus, whenever an option is guiding behavior, the value attached to a state must take the identity of that option into account. (p.300) If there is a neural structure that computes something like option-specific state values, this structure would be expected to communicate closely with the VS, the region typically identified with the locus of state or state-action values in RL. However, the structure would also be expected to receive inputs from the portions of frontal cortex that we have identified as representing options. One brain region that meets both of these criteria is the orbitofrontal cortex (OFC), an area that has strong connections with both VS and DLPFC.3,84 The idea that the OFC might participate in computing option-specific state values also fits well with the behavior of individual neurons within this cortical region. OFC neurons have been extensively implicated in representing the reward value associated with environmental states.84,94 However, other data suggest that OFC neurons can also be sensitive to shifts in response policy or task set.70 Critically, OFC representations of event value have been observed to change in parallel with shifts in strategy,92 a finding that fits precisely with the idea that the OFC might represent option-specific state values. Although these findings are consistent with an HRL interpretation, further research on OFC directly guided by the relevant predictions from HRL could be quite informative.

Discovering Hierarchical Structure

Another avenue for further research arises from a question we have so far avoided: Where do options come from? Throughout this chapter, we have assumed that the HRL agent simply has a toolbox full of options available for selection. The same assumption is adopted in much purely computational work in HRL. The question inevitably arises, however, of how this toolbox of options is initially assembled. This question, sometimes referred to as the option discovery problem, is obviously relevant to human learning.21 Indeed, influential behavioral work has characterized childhood development as involving a process of building up a hierarchical set of skills.35,111 Characterizing this building-up process in specific computational terms turns out to be a challenging task. Indeed, option discovery stands as an open problem in computational HRL.12,101

One interesting recent machine-learning proposal for how an HRL agent might discover useful options centers on the notion of bottleneck states.100 These are states that give access to a wide range of subsequent states. As an example, consider a stairwell connecting two floors in a house. To reach any location on one floor from any location on the other, one must pass through the stairwell. The stairwell is, in this sense, a bottleneck location in the house. Both intuitively and formally, such bottleneck locations make good subgoals, around which useful subroutines can be built.

An explicit definition for what makes a state a bottleneck can be derived from graph theory, where the property of being a bottleneck state corresponds to a (p.301)

Neural Correlates of Hierarchical Reinforcement Learning

Figure 16.5 Betweenness for states of a graph. Note how state E is a clear bottleneck for transitioning from states A–D to F–I. This is reflected in the value of betweenness, which elects this state as a useful subgoal.

measure called betweeness centrality.100 Figure 16.5 shows a graph with an obvious bottleneck, with each node labeled with its corresponding betweeness value. If this graph represented a behavioral domain, and an agent wanted to carve this domain “at its joints” by identifying useful subgoal states, the node at the center of the graph would make a good candidate, a point that simulation work has borne out.100

One may well ask, however, whether human learners identify bottleneck states, and if so whether they use such states as a basis for parsing tasks. Some affirmation of this is provided by Cordova and colleagues.26 In this study, participants were presented with a set of “landmarks” (e.g., post office, school), and learned their adjacency relations within a fictive town. These adjacencies were based on simple graphs like the one in figure 16.5, each of which included a clear bottleneck (although the graph itself was never shown to the participants). Once the adjacency relations among the town’s landmarks had been learned, participants were then asked to make “deliveries” in the town, navigating each time from a specified point of origin to a specified goal, and receiving a reward that varied inversely with the number of steps taken to complete the delivery. Before beginning these deliveries, however, the participant was asked to select one landmark as a location for a “bus stop,” understanding that he or she could “jump” to this location from any other during the deliveries, potentially reducing the number of steps taken. Without knowledge of the specific upcoming delivery assignments, the optimal choice for the bus stop location corresponds to the bottleneck location, and participants overwhelmingly selected this location.

In a related experiment, we have shown that human learners not only identify bottleneck states, but also use these as a basis for segmentation.90 Here, a distinctive visual stimulus was assigned to each vertex of a bottleneck graph (which was itself not shown), and participants viewed these stimuli in sequences generated based on a random walk through the underlying graph. Subjects were then asked to parse a further set of stimulus sequences, generated in the same way, pressing a button when they perceived a transition between subsequences. Participants showed a significant (p.302) tendency to parse at junctures where the sequence traversed a bottleneck in the underlying graph, consistent with the idea that events or tasks can be decomposed based on a structural analysis of their underlying topological structure. Together, these initial findings suggest that ideas from computational HRL may help answer the question of how humans discover hierarchical structure in the task environment, developing action hierarchies tailored to this structure.

Dual Modes of Control

We believe that further HRL-related research, and indeed all research drawing on RL, must attend to an important but often neglected distinction between two forms of learning or decision making. Work on animal and human behavior suggests that instrumental actions arise from two modes of control, one built on established stimulus-response links or “habits,” and the other on prospective planning.9 Recent work has mapped these modes of control onto RL constructs,31 characterizing the former as relying on cached action values or strengths and model-free RL, and the latter as looking ahead based on an internal model relating actions to their likely effects, that is, model-based RL.18 Here, we have cast HRL in terms of the cache-based system, both because this is most representative of existing work on HRL and because the principles of model-based search have not yet been as fully explored, either at the computational level or in terms of neural correlates. However, incorporating temporal abstraction into model-based, prospective control is straightforward. This is accomplished by assuming that each option is associated with an option model, a knowledge structure indicating the ultimate outcomes likely to result from selecting the option, the reward or cost likely to be accrued during its execution, and the amount of time this execution is likely to take.106 Equipped with models of this kind, the agent can use them to look ahead, evaluating potential courses of action. Importantly, the search process can now “skip over” potentially large sequences of primitive actions, effectively reducing the size of the search tree.47,54,61 This kind of saltatory search process seems to fit well with everyday planning, which introspectively seems to operate at the level of temporally abstract actions (“Perhaps I should buy one of those new cell phones. … Well, that would cost me a few hundred dollars. … But if I bought one, I could use it to check my email …”). The idea of action models, in general, also fits well with work on motor control, which strongly suggests the involvement of predictive models in guiding bodily movements.113 Because option models encode the consequences of interventions, it is interesting to note that recent neuroimaging work has mapped representations of action outcome information in part to prefrontal cortex,44 a region whose potential links with HRL we have already considered. Further investigating the potential relevance of model-based HRL to human planning and decision making offers an inviting area for further research.

(p.303) Conclusion

Computational RL has proved extremely useful in research on behavior and brain function. Our aim here has been to explore whether HRL might prove similarly applicable. An initial motivation for considering this question derives from the fact that HRL addresses an inherent limitation of RL, the scaling problem, which would clearly be relevant to any organism relying on RL-like learning mechanisms. Implementing HRL along the lines of the actor-critic framework, thereby bringing it into alignment with existing mappings between RL and neuroscience, reveals direct parallels between components of HRL and specific functional neuroanatomic structures, including the DLPFC and OFC. HRL suggests new ways of interpreting neural activity in these as well as several other regions. We have reported results from initial experiments prospectively testing predictions from HRL, which provide evidence for a novel form of prediction-error signal. All things considered, HRL appears to offer a potentially useful set of tools for further investigating the computational and neural basis of hierarchical structured behavior.

Acknowledgments

The present work was completed with support from Fundação para a Ciência e Tecnologia (SFRH/BD/33273/2007, J. R-F.), the National Institute of Mental Health (P50 MH062196, M.M.B.), and the James S. McDonnell Foundation (M.M.B.).

OutstandingQuestions

  • • How is hierarchically structured behavior represented at the neural level? Are levels of behavioral structure represented discretely, maximizing compositional flexibility, or in a continuous distributed fashion, maximizing generalization and information sharing? Does the answer to this question differ across decision-making systems (habitual versus goal-directed, for example)?

  • • How are internal representations of hierarchical behavior learned? How does the brain establish and refine a toolbox of subroutines, skills, subtasks or options, which can be exploited across a wide range of potential future activities? How are useful subgoals discovered, when they are not associated with primary reward?

  • • Given a set of options, how does learning discover the particular combinations and sequences that solve new task challenges?

  • • Does the framework of computational reinforcement learning provide a useful heuristic for pursuing the answers to the above questions? In particular, does the (p.304) resemblance between dopaminergic function and temporal-difference learning extend to the hierarchical case, and does recent evidence concerning hierarchical representation in prefrontal cortex bear any logical relation to hierarchical representation in HRL?

Further Reading

1. Badre D. 2008. Cognitive control, hierarchy, and the rostro-caudal axis of the frontal lobes. Trends Cogn Sci 12: 193–200. An excellent brief review of empirical findings concerning hierarchical representation in prefrontal cortex.

2. Badre D, Frank MJ. in press. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 2: Evidence from fMRI. Cereb Cortex. Frank MJ, Badre D. in press. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: Computational analysis. Cereb Cortex. Companion empirical and theoretical papers, voicing a different but not incompatible perspective on hierarchical learning mechanisms.

3. Barto AG, Mahadevan S. 2003. Recent advances in hierarchical reinforcement learning. Discret Event Dyn Syst 13: 41–77. A review of the HRL framework and its various implementations, from a machine-learning perspective.

4. Botvinick MM. 2008. Hierarchical models of behavior and prefrontal function. Trends Cogn Sci 12: 201–208. An overview of computational models that have addressed hierarchical action and its neural underpinnings.

5. Botvinick MM, Niv Y, Barto AC. 2009. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113: 262–280. An introduction to HRL and its potential neural correlates.

6. Ribas-Fernandes J, Solway A, Diuk C, Barto AG, Niv Y, Botvinick M. in press. A neural signature of hierarchical reinforcement learning. Neuron. A set of neuroimaging experiments testing predictions from HRL.

References

Bibliography references:

1. Aldridge JW, Berridge KC. 1998. Coding of serial order by neostriatal neurons: a “natural action” approach to movement sequence. J Neurosci 18: 2777–2787.

2. Aldridge JW, Berridge KC, Rosen AR. 2004. Basal ganglia neural mechanisms of natural movement sequences. Can J Physiol Pharmacol 82: 732–739.

3. Alexander GE, Crutcher MD, DeLong MR. 1990. Basal ganglia-thalamocortical circuits: parallel substrates for motor, oculomotor, “prefrontal” and “limbic” functions. Prog Brain Res 85: 119–146.

4. Alexander GE, DeLong MR, Strick PL. 1986. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu Rev Neurosci 9: 357–381.

5. Asaad WF, Rainer G, Miller EK. 2000. Task-specific neural activity in the primate prefrontal cortex. J Neurophysiol 84: 451–459.

6. Averbeck BB, Lee D. 2007. Prefrontal neural correlates of memory for sequences. J Neurosci 27: 2204–2211.

7. Badre D. 2008. Cognitive control, hierarchy, and the rostro–caudal organization of the frontal lobes. Trends Cogn Sci 12: 193–200.

8. Baker TE, Holroyd CB (in press). Dissociated roles of the anterior cingulate cortex in reward and conflict processing as revealed by the feedback error-related negativity and N200. Biol Psychol.

(p.305) 9. Balleine BW, Dickinson A. 1998. Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology 37: 407–419.

10. Barto AG. 1995. Adaptive critics and the basal ganglia. In: Models of Information Processing in the Basal Ganglia (Houck JC, Davis J, Beiser D, eds), pp 215–232. Cambridge, MA: MIT Press.

11. Barto AG, Mahadevan S. 2003. Recent advances in hierarchical reinforcement learning. Discret Event Dyn Syst: Theory Appl 13: 41–77.

12. Barto AG, Singh S, Chentanez N. (2004) Intrinsically motivated learning of hierarchical collections of skills. Proceedings of the 3rd International Conference on Development and Learning.

13. Barto AG, Sutton RS. 1981. Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev 88: 135–170.

14. Barto AG, Sutton RS, Anderson CW. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13: 834–846.

15. Berns GS, McClure SM, Pagnoni G, Montague PR. 2001. Predictability modulates human brain response to reward. J Neurosci 21: 2793–2798.

16. Bertsekas DP, Tsitsiklis JN. 1996. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific.

17. Bor D, Duncan J, Wiseman RJ, Owen AM. 2003. Encoding strategies dissociate prefrontal activity from working memory demand. Neuron 27: 361–367.

18. Botvinick M, An J. 2009. Goal-directed decision making in prefrontal cortex: a computational framework. In: Advances in Neural Information Processing Systems 21 (Koller D, Schuurmans D, Bengio Y, Bottou L, eds). Cambridge, MA: MIT Press.

19. Botvinick M, Plaut DC. 2004. Doing without schema hierarchies: a recurrent connectionist approach to normal and impaired routine sequential action. Psychol Rev 111: 395–429.

20. Botvinick MM. 2008. Hierarchical models of behavior and prefrontal function. Trends Cogn Sci 12: 201–208.

21. Botvinick MM, Niv Y, Barto AC. 2009. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113: 262–280.

22. Breiter HC, Aharon I, Kahneman D, Dale A, Shizgal P. 2001. Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron 30: 619–639.

23. Bunge SA. 2004. How we use rules to select actions: a review of evidence from cognitive neuroscience. Cogn Affect Behav Neurosci 4: 564–579.

24. Cooper JC, Knutson B. 2008. Valence and salience contribute to nucleus accumbens activation. Neuroimage 39: 538–547.

25. Cooper RP, Shallice T. 2006. Hierarchical schemas and goals in the control of sequential behavior. Psychol Rev 113: 887–916.

26. Cordova N, Diuk C, Niv Y, Botvinick MM (submitted). Discovering hierarchical task structure.

27. Courtney SM, Roth JK, Sala JB. 2007. A hierarchical biased-competition model of domain-dependent working memory maintenance and executive control. In: Working Memory: Behavioural and Neural Correlates (Osaka N, Logie R, D’Esposito M, eds). Oxford: Oxford University Press.

28. D’Esposito M. 2007. From cognitive to neural models of working memory. Philos Trans R Soc Lond B Biol Sci 362: 761–772.

29. Daw ND, Courville AC, Touretzky DS. 2003. Timing and partial observability in the dopamine system. In: Advances in Neural Information Processing Systems 15 (Becker S, Thrun S, Obermayer K, eds), pp 99–106. Cambridge, MA: MIT Press.

30. Daw ND, Frank MJ. 2009. Reinforcement learning and higher level cognition: introduction to special issue. Cognition 113: 259–261.

31. Daw ND, Niv Y, Dayan P. 2005. Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nat Neurosci 8: 1704–1711.

32. Daw ND, Niv Y, Dayan P. 2006. Actions, policies, values and the basal ganglia. In: Recent Breakthroughs in Basal Ganglia Research (Bezard E, ed), pp 111–130. New York: Nova Science Publishers.

(p.306) 33. Dietterich TG. 1998. The MAXQ method for hierarchical reinforcement learning. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp 118–126. San Francisco: Morgan Kaufmann Publishers.

34. Dietterich TG. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. J Artif Intell Res 13: 227–303.

35. Fischer KW. 1980. A theory of cognitive development: the control and construction of hierarchies of skills. Psychol Rev 87: 477–531.

36. Frank MJ, Claus ED. 2006. Anatomy of a decision: striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal. Psychol Rev 113: 300–326.

37. Fujii N, Graybiel AM. 2003. Representation of action sequence boundaries by macaque prefrontal cortical neurons. Science 301: 1246–1249.

38. Fuster JM. (1997) The Prefrontal Cortex: Anatomy, Physiology, and Neuropsychology of the Frontal Lobe. Philadelphia: Lippincott-Raven.

39. Fuster JM. 2001. The prefrontal cortex—an update: time is of the essence. Neuron 30: 319–333.

40. Fuster JM. 2004. Upper processing stages of the perception-action cycle. Trends Cogn Sci 8: 143–145.

41. Grafman J. 2002. The human prefrontal cortex has evolved to represent components of structured event complexes. In: Handbook of Neuropsychology (Grafman J, ed), pp 157–174. Amsterdam: Elsevier.

42. Graybiel AM. 1995. Building action repertoires: memory and learning functions of the basal ganglia. Curr Opin Neurobiol 5: 733–741.

43. Graybiel AM. 1998. The basal ganglia and chunking of action repertoires. Neurobiol Learn Mem 70: 119–136.

44. Hamilton AFdeC, Grafton ST. 2008. Action outcomes are represented in human inferior frontoparietal cortex. Cereb Cortex 18: 1160–1168.

45. Hare TA, O’Doherty J, Camerer CF, Schultz W, Rangel A. 2008. Dissociating the role of the orbitofrontal cortex and the striatum in the computation of goal values and prediction errors. J Neurosci 28: 5623–5630.

46. Haruno M, Kawato M. 2006. Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning. Neural Netw 19: 1242–1254.

47. Hayes-Roth B, Hayes-Roth F. 1979. A cognitive model of planning. Cogn Sci 3: 275–310.

48. Holroyd CB, Coles MG. 2002. The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychol Rev 109: 679–709.

49. Holroyd CB, Nieuwenhuis S, Yeung N, Cohen JD. 2003. Errors in reward prediction are reflected in the event-related brain potential. Neuroreport 14: 2481–2484.

50. Hoshi E, Shima K, Tanji J. 1998. Task-dependent selectivity of movement-related neuronal activity in the primate prefrontal cortex. J Neurophysiol 80: 3392–3397.

51. Houk JC, Adams CM, Barto AG. 1995. A model of how the basal ganglia generate and use neural signals that predict reinforcement. In: Models of Information Processing in the Basal Ganglia (Houk JC, Davis DG, eds), pp 249–270. Cambridge, MA: MIT Press.

52. Joel D, Niv Y, Ruppin E. 2002. Actor-critic models of the basal ganglia: new anatomical and computational perspectives. Neural Netw 15: 535–547.

53. Johnston K, Everling S. 2006. Neural activity in monkey prefrontal cortex is modulated by task context and behavioral instruction during delayed-match-to-sample and conditional prosaccade–antisaccade tasks. J Cogn Neurosci 18: 749–765.

54. Kambhampati S, Mali AD, Srivastava B. 1998. Hybrid planning for partially hierarchical domains. In: Proceedings of the Fiftheenth National Conference on Artificial Intelligence (AAAI-98), pp 882–888. Madison, WI: AAAI Press.

(p.307) 55. Kearns M, Singh S. 2002. Near-optimal reinforcement learning in polynomial time. Mach Learn 49: 209–232.

56. Knutson B, Taylor J, Kaufman M, Petersen R, Glover G. 2005. Distributed neural representation of expected value. J Neurosci 25: 4806–4812.

57. Koechlin E, Ody C, Kouneiher F. 2003. The architecture of cognitive control in the human prefrontal cortex. Science 302: 1181–1185.

58. Lashley KS. 1951. The problem of serial order in behavior. In: Cerebral Mechanisms in Behavior: The Hixon Symposium (Jeffress LA, ed), pp 112–136. New York: Wiley.

59. Lee IH, Seitz AR, Assad JA. 2006. Activity of tonically active neurons in the monkey putamen during initiation and withholding of movement. J Neurophysiol 95: 2391–2403.

60. Li L, Walsh TJ. 2006. Towards a unified theory of state abstraction for MDPs. In: Ninth International Symposium on Artificial Intelligence and Mathematics, pp 531–539.

61. Marthi B, Russell SJ, Wolfe J. 2007. Angelic semantics for high-level actions. In: Seventeenth International Conference on Automated Planning and Scheduling (ICAPS 2007).

62. Matsumoto M, Hikosaka O. 2007. Lateral habenula as a source of negative reward signals in dopamine neurons. Nature 447: 1111–1115.

63. Miller EK, Cohen JD. 2001. An integrative theory of prefrontal cortex function. Annu Rev Neurosci 24: 167–202.

64. Miller GA, Galanter E, Pribram KH. 1960. Plans and the Structure of Behavior. New York: Holt, Rinehart & Winston.

65. Miltner WHR, Braun CH, Coles MGH. 1997. Event-related brain potentials following incorrect feedback in a time-estimation task: evidence for a “generic” neural system for error detection. J Cogn Neurosci 9: 788–798.

66. Miyamoto H, Morimoto J, Doya K, Kawato M. 2004. Reinforcement learning with via-point representation. Neural Netw 17: 299–305.

67. Montague P, Dayan P, Sejnowski T. 1996. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci 16: 1936.

68. Morris G, Arkadir D, Nevet A, Vaadia E, Bergman H. 2004. Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron 43: 133–143.

69. Muhammad R, Wallis JD, Miller EK. 2006. A comparison of abstract rules in the prefrontal cortex, premotor cortex, inferior temporal cortex, and striatum. J Cogn Neurosci 18: 974–989.

70. O’Doherty J, Critchley H, Deichmann R, Dolan RJ. 2003. Dissociating valence of outcome from behavioral control in human orbital and ventral prefrontal cortices. J Neurosci 7931: 7931–7939.

71. O’Doherty J, Dayan P, Schultz P, Deischmann J, Friston K, Dolan RJ. 2004. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304: 452–454.

72. O’Doherty JP, Buchanan TW, Seymour B, Dolan R. 2006. Predictive neural coding of reward preference involves dissociable responses in human ventral midbrain and ventral striatum. Neuron 49: 157–166.

73. O’Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ. 2003. Temporal difference models and reward-related learning in the human brain. Neuron 38: 329–337.

74. O’Reilly RC, Frank MJ. 2006. Making working memory work: a computational model of learning in the prefrontal cortex and basal ganglia. Neural Comput 18: 283–328.

75. Pagnoni G, Zink CF, Montague PR, Berns GS. 2002. Activity in human ventral striatum locked to errors of reward prediction. Nat Neurosci 5: 97–98.

76. Parent A, Hazrati LN. 1995. Functional anatomy of the basal ganglia. I. The cortico-basal ganglia-thalamo-cortical loop. Brain Res Brain Res Rev 20: 91–127.

77. Parr R, Russell S. 1998. Reinforcement learning with hierarchies of machines. In: Advances in Neural Information Processing Systems 10 (Jordan MI, Kearns MJ, Solla SA, eds), pp 1043–1049. Cambridge, MA: MIT Press.

(p.308) 78. Petrides M. 1995. Impairments on nonspatial self-ordered and externally ordered working memory tasks after lesions to the mid-dorsal part of the lateral frontal cortex in the monkey. J Neurosci 15: 359–375.

79. Phan KL, Wager TD, Taylor SF, Liberzon I. 2004. Functional neuroimaging studies of human emotions. CNS Spectr 9: 258–266.

80. Postle BR. 2006. Working memory as an emergent property of the mind and brain. Neuroscience 139: 23–28.

81. Ravel S, Sardo P, Legallet E, Apicella P. 2006. Influence of spatial information on responses of tonically active neurons in the monkey striatum. J Neurophysiol 95: 2975–2986.

82. Ribas-Fernandes JJF, Solway A, Diuk C, McGuire JT, Barto AG, Niv Y, Botvinick MM (under review). A neural signature of hierarchical reinforcement learning.

83. Roesch MR, Taylor AR, Schoenbaum G. 2006. Encoding of time-discounted rewards in orbitofrontal cortex is independent of value representation. Neuron 51: 509–520.

84. Rolls ET. 2004. The functions of the orbitofrontal cortex. Brain Cogn 55: 11–29.

85. Rougier NP, Noell DC, Braver TS, Cohen JD, O’Reilly RC. 2005. Prefrontal cortex and flexible cognitive control: rules without symbols. Proc Natl Acad Sci USA 102: 7338–7343.

86. Rushworth MF, Walton ME, Kennerley SW, Bannerman DM. 2004. Action sets and decisions in the medial frontal cortex. Trends Cogn Sci 8: 410–417.

87. Salas R, Baldwin P, de Biasi M, Montague PR. 2010. BOLD responses to negative reward prediction errors in human habenula. Front Hum Neurosci 4: 36.

88. Salinas E. 2004. Fast remapping of sensory stimuli onto motor actions on the basis of contextual modulation. J Neurosci 24: 1113–1118.

89. Schank RC, Abelson RP. 1977. Scripts, Plans, Goals and Understanding. Hillside, NJ: Erlbaum.

90. Schapiro A, Rogers T, Botvinick MM. 2010. Beyond uncertainty: behavioral and computational investigations of the structure of event representations. Paper presented at the Annual Meeting of the Cognitive Science Society.

91. Schneider DW, Logan GD. 2006. Hierarchical control of cognitive processes: switching tasks in sequences. J Exp Psychol Gen 135: 623–640.

92. Schoenbaum G, Chiba AA, Gallagher M. 1999. Neural encoding in orbitofrontal cortex and basolateral amygdala during olfactory discrimination learning. J Neurosci 19: 1876–1884.

93. Schultz W, Dayan P, Montague P. 1997. A neural substrate of prediction and reward. Science 275: 1593.

94. Schultz W, Tremblay KL, Hollerman JR. 2000. Reward processing in primate orbitofrontal cortex and basal ganglia. Cereb Cortex 10: 272–283.

95. Seymour B, Daw ND, Dayan P, Singer T, Dolan RJ. 2007. Differential encoding of losses and gains in the human striatum. J Neurosci 27: 4826–4831.

96. Shallice T, Burgess PW. 1991. Deficits in strategy application following frontal lobe damage in man. Brain 114: 727–741.

97. Shima K, Isoda M, Mushiake H, Tanji J. 2007. Categorization of behavioural sequences in the prefrontal cortex. Nature 445: 315–318.

98. Shima K, Tanji J. 2000. Neuronal activity in the supplementary and presupplementary motor areas for temporal organization of multiple movements. J Neurophysiol 84: 2148–2160.

99. Shimamura AP. 2000. The role of the prefrontal cortex in dynamic filtering. Psychobiology 28: 207–218.

100. Şimsşekö, Wolfe A, Barto A. 2005. Identifying useful subgoals in reinforcement learning by local graph partitioning. In: Proceedings of the 22nd International Conference on Machine Learning, pp 816–823. New York: ACM.

101. Singh S, Barto AG, Chentanez N. 2005. Intrinsically motivated reinforcement learning. In: Advances in Neural Information Processing Systems 17: Proceedings of the 2004 Conference (Saul LK, Weiss Y, Bottou L, eds), pp 1281–1288. Cambridge, MA: MIT Press.

(p.309) 102. Sirigu A, Zalla T, Pillon B, Dubois B, Grafman J, Agid Y. 1995. Selective impairments in managerial knowledge in patients with prefrontal cortex lesions. Cortex 31: 301–316.

103. Suri RE, Bargas J, Arbib MA. 2001. Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience 103: 65–85.

104. Sutton RS, Barto AG. 1990. Time-derivative models of pavlovian reinforcement. In: Learning and Computational Neuroscience: Foundations of Adaptive Networks (Gabriel M, Moore J, eds), pp 497–537. Cambridge, MA: MIT Press.

105. Sutton RS, Barto AG. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

106. Sutton RS, Precup D, Singh S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif Intell 112: 181–211.

107. Ullsperger M, von Cramon DY. 2003. Error monitoring using external feedback: specific roles of the habenular complex, the reward system, and the cingulate motor area revealed by functional magnetic resonance imaging. J Neurosci 23: 4308–4314.

108. Wallis JD, Anderson KC, Miller EK. 2001. Single neurons in prefrontal cortex encode abstract rules. Nature 411: 953–956.

109. Wallis JD, Miller EK. 2003. From rule to response: neuronal processes in the premotor and prefrontal cortex. J Neurophysiol 90: 1790–1806.

110. White IM, Wise SP. 1999. Rule-dependent neuronal activity in the prefrontal cortex. Exp Brain Res 126: 315–335.

111. White RW. 1959. Motivation reconsidered: the concept of competence. Psychol Rev 66: 297–333.

112. Wickens J, Kotter R, Houk JC. 1995. Cellular models of reinforcement. In: Models of Information Processing in the Basal Ganglia (Davis JL, Beiser DG, eds), pp 187–214. Cambridge, MA: MIT Press.

113. Wolpert D, Flanagan J. 2001. Motor prediction. Curr Biol 18: R729–R732.

114. Wood JN, Grafman J. 2003. Human prefrontal cortex: processing and representational perspectives. Nat Rev Neurosci 4: 139–147.

115. Yacubian J, Gläscher J, Schroeder K, Sommer T, Braus DF, Büchel C. 2006. Dissociable systems for gain- and loss-related value predictions and errors of prediction in the human brain. J Neurosci 26: 9530–9537.

116. Zacks JM, Braver TS, Sheridan MA, Donaldson DI, Snyder AZ, Ollinger JM, Buckner RL, Raichle ME. 2001. Human brain activity time-locked to perceptual event boundaries. Nat Neurosci 4: 651–655.

117. Zacks JM, Speer NK, Swallow KM, Braver TS, Reynolds JR. 2007. Event perception: a mind/brain perspective. Psychol Bull 133: 273–293.

118. Zalla T, Pradat-Diehl P, Sirigu A. 2003. Perception of action boundaries in patients with frontal lobe damage. Neuropsychologia 41: 1619–1627.