ucl-dark.github.io/sitedata/papers.yml at master · ucl-dark/ucl-dark.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
-   UID: samvelyan2024rainbow
    title: "Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts"
    authors: Mikayel Samvelyan|Sharath Chandra Raparthy|Andrei Lupu|Eric Hambro|Aram H. Markosyan|Manish Bhatt|Yuning Mao|Minqi Jiang|Jack Parker-Holder|Jakob Foerster|Tim Rocktäschel|Roberta Raileanu
    abstract: As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. Focusing on the safety domain, we use Rainbow Teaming to target various state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90% across all tested models. Furthermore, we demonstrate that prompts generated by Rainbow Teaming are highly transferable and that fine-tuning models with synthetic data generated by our method significantly enhances their safety without sacrificing general performance or helpfulness. We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity, showcasing its potential to drive robust open-ended self-improvement in a wide range of applications.
    keywords: open-endednes|large language models|safety|diversity
    proceedings: NeurIPS
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2402.16822
-   UID: rutherford2024jaxmarl
    title: "JaxMARL: Multi-Agent RL Environments in JAX"
    authors: Alexander Rutherford|Benjamin Ellis|Matteo Gallici|Jonathan Cook|Andrei Lupu|Gardar Ingvarsson|Timon Willi|Akbir Khan|Christian Schroeder de Witt|Alexandra Souly|Saptarashmi Bandyopadhyay|Mikayel Samvelyan|Minqi Jiang|Robert Tjarko Lange|Shimon Whiteson|Bruno Lacerda|Nick Hawes|Tim Rocktaschel|Chris Lu|Jakob Nicolaus Foerster
    abstract: Benchmarks play an important role in the development of machine learning algorithms. For example, research in reinforcement learning (RL) has been heavily influenced by available environments and benchmarks. However, RL environments are traditionally run on the CPU, limiting their scalability with typical academic compute. Recent advancements in JAX have enabled the wider use of hardware acceleration to overcome these computational hurdles, enabling massively parallel RL training pipelines and environments. This is particularly useful for multi-agent reinforcement learning (MARL) research. First of all, multiple agents must be considered at each environment step, adding computational burden, and secondly, the sample complexity is increased due to non-stationarity, decentralised partial observability, or other MARL challenges. In this paper, we present JaxMARL, the first open-source code base that combines ease-of-use with GPU enabled efficiency, and supports a large number of commonly used MARL environments as well as popular baseline algorithms. When considering wall clock time, our experiments show that per-run our JAX-based training pipeline is up to 12500x faster than existing approaches. This enables efficient and thorough evaluations, with the potential to alleviate the evaluation crisis of the field. We also introduce and benchmark SMAX, a vectorised, simplified version of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL.
    keywords: reinforcement learning|multi-agent|jax|environment
    proceedings: NeurIPS
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2311.10090
-   UID: jiang2023hgap
    title: "H-GAP: Humanoid Control with a Generalist Planner"
    authors: Zhengyao Jiang|Yingchen Xu|Nolan Wagener|Yicheng Luo|Michael Janner|Edward Grefenstette|Tim Rocktäschel|Yuandong Tian
    abstract: Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing.
    keywords: generative model|model-based reinforcement learning|humanoids
    proceedings: ICLR
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2312.02682
-   UID: chitnis2023iqltdmpc
    title: "IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control"
    authors: Rohan Chitnis|Yingchen Xu|Bobak Hashemi|Lucas Lehnert|Urun Dogan|Zheqing Zhu|Olivier Delalleau
    abstract: Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions":" 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.
    keywords: hierarchical reinforcement learning|model-based reinforcement learning|offline reinforcement learning
    proceedings: ICRA
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2306.00867
-   UID: raparthy2023generalization
    title: "Generalization to New Sequential Decision Making Tasks with In-Context Learning"
    authors: Sharath Chandra Raparthy|Eric Hambro|Robert Kirk|Mikael Henaff|Roberta Raileanu
    abstract: Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.
    keywords: generalisation|in-context learning|transformers|reinforcement learning
    proceedings: arXiv
    year: 2023
    type: Preprint
    url: https://arxiv.org/abs/2312.03801
-   UID: coste2023reward
    title: "Reward Model Ensembles Help Mitigate Overoptimization"
    authors: Thomas Coste|Usman Anwar|Robert Kirk|David Krueger
    abstract: Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to \textit{overoptimization}. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods":" (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.
    keywords: large language models|fine-tuning|overoptimisation|alignment
    proceedings: ICLR
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2310.02743
-   UID: jain2023mechanistically
    title: "Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks"
    authors: Samyak Jain|Robert Kirk|Ekdeep Singh Lubana|Robert P. Dick|Hidenori Tanaka|Edward Grefenstette|Tim Rocktäschel|David Scott Krueger
    abstract: Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining":" does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that":" (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.
    keywords: large language models|fine-tuning|generalisation|interpretability
    proceedings: ICLR
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2311.12786
-   UID: kirk2023understanding
    title: "Understanding the Effects of RLHF on LLM Generalisation and Diversity"
    authors: Robert Kirk|Ishita Mediratta|Christoforos Nalmpantis|Jelena Luketina|Eric Hambro|Edward Grefenstette|Roberta Raileanu
    abstract: Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e.~supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties":" out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.
    keywords: large language models|rlhf|generalisation|diversity
    proceedings: ICLR
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2310.06452
-   UID: samvelyan2024multiagent
    title: "Multi-Agent Diagnostics for Robustness via Illuminated Diversity"
    authors: Mikayel Samvelyan|Davide Paglieri|Minqi Jiang|Jack Parker-Holder|Tim Rocktäschel
    abstract: In the rapidly advancing field of multi-agent systems, ensuring robustness in unfamiliar and adversarial settings is crucial. Notwithstanding their outstanding performance in familiar environments, these systems often falter in new situations due to overfitting during the training phase. This is especially pronounced in settings where both cooperative and competitive behaviours are present, encapsulating a dual nature of overfitting and generalisation challenges. To address this issue, we present Multi-Agent Diagnostics for Robustness via Illuminated Diversity (MADRID), a novel approach for generating diverse adversarial scenarios that expose strategic vulnerabilities in pre-trained multi-agent policies. Leveraging the concepts from open-ended learning, MADRID navigates the vast space of adversarial settings, employing a target policy's regret to gauge the vulnerabilities of these settings. We evaluate the effectiveness of MADRID on the 11vs11 version of Google Research Football, one of the most complex environments for multi-agent reinforcement learning. Specifically, we employ MADRID for generating a diverse array of adversarial settings for TiZero, the state-of-the-art approach which "masters" the game through 45 days of training on a large-scale distributed infrastructure. We expose key shortcomings in TiZero's tactical decision-making, underlining the crucial importance of rigorous evaluation in multi-agent systems.
    keywords: reinforcement learning|multi-agent|open-endedness|environment design
    proceedings: AAMAS
    year: 2024
    type: Conference
    url: https://arxiv.org/abs/2401.13460
-   UID: samvelyan2023maestro
    title: "MAESTRO: Open-Ended Environment Design for Multi-Agent Reinforcement Learning"
    authors: Mikayel Samvelyan|Akbir Khan|Michael Dennis|Minqi Jiang|Jack Parker-Holder|Jakob Foerster|Roberta Raileanu|Tim Rocktäschel
    abstract: Open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks serve as a promising avenue toward generally capable reinforcement learning agents. Existing methods adapt curricula independently over either environment parameters (in single-agent settings) or co-player policies (in multi-agent settings). However, the strengths and weaknesses of co-players can manifest themselves differently depending on environmental features. It is thus crucial to consider the dependency between the environment and co-player when shaping a curriculum in multi-agent domains. In this work, we use this insight and extend Unsupervised Environment Design (UED) to multi-agent environments. We then introduce Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO), the first multi-agent UED approach for two-player zero-sum settings. MAESTRO efficiently produces adversarial, joint curricula over both environments and co-players and attains minimax-regret guarantees at Nash equilibrium. Our experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player games, spanning discrete and continuous control settings.
    keywords: reinforcement learning|multi-agent|open-endedness|environment design
    proceedings: ICLR
    year: 2023
    type: Conference
    url: https://arxiv.org/abs/2303.03376
-   UID: ellis2022smacv2
    title: "SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning"
    authors: Benjamin Ellis|Jonathan Cook|Skander Moalla|Mikayel Samvelyan|Mingfei Sun|Anuj Mahajan|Jakob Foerster|Shimon Whiteson
    abstract: The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this work, we conduct new analysis demonstrating that SMAC is not sufficiently stochastic to require complex closed-loop policies. In particular, we show that an open-loop policy conditioned only on the timestep can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We show that these changes ensure the benchmark requires the use of closed-loop policies. We evaluate state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods. Videos of training are available at https://sites.google.com/view/smacv2.
    keywords: reinforcement learning|multi-agent|generalization|benchmark
    proceedings: NeurIPS
    year: 2023
    type: Conference
    url: https://arxiv.org/abs/2212.07489
-   UID: jiang2022grounding
    title: "Grounding Aleatoric Uncertainty for Unsupervised Environment Design"
    authors: Minqi Jiang|Michael Dennis|Jack Parker-Holder|Andrei Lupu|Heinrich Küttler|Edward Grefenstette|Tim Rocktäschel|Jakob Foerster
    abstract: 'Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.'
    keywords: reinforcement learning|generalization|environment design|curriculum learning|procedural content generation
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2207.05219
-   UID: henaff2022exploration
    title: "Exploration via Elliptical Episodic Bonuses"
    authors: Mikael Henaff|Minqi Jiang|Roberta Raileanu
    abstract: 'In recent years, a number of reinforcement learning (RL) methods have been proposed to explore complex environments which differ across episodes. In this work, we show that the effectiveness of these methods critically relies on a count-based episodic term in their exploration bonus. As a result, despite their success in relatively simple, noise-free settings, these methods fall short in more realistic scenarios where the state space is vast and prone to noise. To address this limitation, we introduce Exploration via Elliptical Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces and encourages an agent to explore states that are diverse under a learned embedding within each episode. The embedding is learned using an inverse dynamics model in order to capture controllable aspects of the environment. Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases. E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat, demonstrating that it can scale to high-dimensional pixel-based observations and realistic environments.'
    keywords: reinforcement learning|exploration
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://e3bagent.github.io/
-   UID: mu2022improving
    title: "Improving Intrinsic Exploration with Language Abstractions"
    authors: Jesse Mu|Victor Zhong|Roberta Raileanu|Minqi Jiang|Noah Goodman|Tim Rocktäschel|Edward Grefenstette
    abstract: 'Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 45-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.'
    keywords: reinforcement learning|exploration|language
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2202.08938
-   UID: zhong2022improving
    title: "Improving Policy Learning via Language Dynamics Distillation"
    authors: Victor Zhong|Jesse Mu|Luke Zettlemoyer|Edward Grefenstette|Tim Rocktäschel
    abstract: 'Recent work has shown that augmenting environments with language descriptions improves policy learning. However, for environments with complex language abstractions, learning how to ground language to observations is difficult due to sparse, delayed rewards. We propose Language Dynamics Distillation (LDD), which pretrains a model to predict environment dynamics given demonstrations with language descriptions, and then fine-tunes these language-aware pretrained representations via reinforcement learning (RL). In this way, the model is trained to both maximize expected reward and retain knowledge about how language relates to environment dynamics. On SILG, a benchmark of five tasks with language descriptions that evaluate distinct generalization challenges on unseen environments (NetHack, ALFWorld, RTFM, Messenger, and Touchdown), LDD outperforms tabula-rasa RL, VAE pretraining, and methods that learn from unlabeled demonstrations in inverse RL and reward shaping with pretrained experts. In our analyses, we show that language descriptions in demonstrations improve sample-efficiency and generalization across environments, and that dynamics modelling with expert demonstrations is more effective than with non-experts.'
    keywords: reinforcement learning|language|transfer learning
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2210.00066
-   UID: hambro2022dungeons
    title: "Dungeons and Data: A Large-Scale NetHack Dataset"
    authors: Eric Hambro|Roberta Raileanu|Danielle Rothermel|Vegard Mella|Tim Rocktäschel|Heinrich Küttler|Naila Murray
    abstract: 'Recent breakthroughs in the development of agents to solve challenging sequential decision making problems such as Go, StarCraft, or DOTA, have relied on both simulated environments and large-scale datasets. However, progress on this research has been hindered by the scarcity of open-sourced datasets and the prohibitive computational cost to work with them. Here we present the NetHack Learning Dataset (NLD), a large and highly-scalable dataset of trajectories from the popular game of NetHack, which is both extremely challenging for current methods and very fast to run. NLD consists of three parts: 10 billion state transitions from 1.5 million human trajectories collected on the NAO public NetHack server from 2009 to 2020; 3 billion state-action-score transitions from 100,000 trajectories collected from the symbolic bot winner of the NetHack Challenge 2021; and, accompanying code for users to record, load and stream any collection of such trajectories in a highly compressed form. We evaluate a wide range of existing algorithms including online and offline RL, as well as learning from demonstrations, showing that significant research advances are needed to fully leverage large-scale datasets for challenging sequential decision making tasks.'
    keywords: reinforcement learning|offline learning|environments|dataset
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2211.00539
-   UID: xu2022cascade
    title: "Learning General World Models in a Handful of Reward-Free Deployments"
    authors: Yingchen Xu|Jack Parker-Holder|Aldo Pacchiano|Philip J. Ball|Oleh Rybkin|Stephen J. Roberts|Tim Rocktäschel|Edward Grefenstette
    abstract: 'Building generally capable agents is a grand challenge for deep reinforcement learning (RL). To approach this challenge practically, we outline two key desiderata: 1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate scalability, exploration policies should collect large quantities of data without costly centralized retraining. Combining these two properties, we introduce the reward-free deployment efficiency setting, a new paradigm for RL research. We then present CASCADE, a novel approach for self-supervised exploration in this new setting. CASCADE seeks to learn a world model by collecting data with a population of agents, using an information theoretic objective inspired by Bayesian Active Learning. CASCADE achieves this by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective. We provide theoretical intuition for CASCADE which we show in a tabular setting improves upon naïve approaches that do not account for population diversity. We then demonstrate that CASCADE collects diverse task-agnostic datasets and learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari, MiniGrid, Crafter and the DM Control Suite.'
    keywords: reinforcement learning|world model|generalist agent|exploration|model-based|reward-free|unsupervised learning
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2210.12719
-   UID: ruis2022implicature
    title: "Large language models are not zero-shot communicators"
    authors: Laura Ruis|Akbir Khan|Stella Biderman|Sara Hooker|Tim Rocktäschel|Edward Grefenstette
    abstract: 'Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.'
    keywords: natural language processing|pragmatics|implicature|large language models
    proceedings: NeurIPS
    year: 2023
    type: Conference
    url: https://arxiv.org/abs/2210.14986
-   UID: bamford2022griddlyjs
    title: "GriddlyJS: A Web IDE for Reinforcement Learning"
    authors: Christopher Bamford|Minqi Jiang|Mikayel Samvelyan|Tim Rocktäschel
    abstract: Progress in reinforcement learning (RL) research is often driven by the design of new, challenging environments—a costly undertaking requiring skills orthogonal to that of a typical machine learning researcher. The complexity of environment development has only increased with the rise of procedural-content generation (PCG) as the prevailing paradigm for producing varied environments capable of testing the robustness and generalization of RL agents. Moreover, existing environments often require complex build processes, making reproducing results difficult. To address these issues, we introduce GriddlyJS, a web-based Integrated Development Environment (IDE) based on the Griddly engine. GriddlyJS allows researchers to easily design and debug arbitrary, complex PCG grid-world environments, as well as visualize, evaluate, and record the performance of trained agent models. By connecting the RL workflow to the advanced functionality enabled by modern web standards, GriddlyJS allows publishing interactive agent-environment demos that reproduce experimental results directly to the web. To demonstrate the versatility of GriddlyJS, we use it to quickly develop a complex compositional puzzle-solving environment alongside arbitrary human-designed environment configurations and their solutions for use in a automatic curriculum learning and offline RL context. The GriddlyJS IDE is open source and freely available at https://griddly.ai.
    keywords: reinforcement learning|open-endedness|environment design|environment
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2207.06105
-   UID: chen2022refactor
    title: "ReFactor GNNs: Revisiting Factorisation-based Models from a Message-Passing Perspective"
    authors: Yihong Chen|Pushkar Mishra|Luca Franceschi|Pasquale Minervini|Pontus Stenetorp|Sebastian Riedel
    abstract: Factorisation-based Models (FMs), such as DistMult, have enjoyed enduring success for Knowledge Graph Completion (KGC) tasks, often outperforming Graph Neural Networks (GNNs). However, unlike GNNs, FMs struggle to incorporate node features and generalise to unseen nodes in inductive settings. Our work bridges the gap between FMs and GNNs by proposing ReFactor GNNs. This new architecture draws upon both modelling paradigms, which previously were largely thought of as disjoint. Concretely, using a message-passing formalism, we show how FMs can be cast as GNNs by reformulating the gradient descent procedure as message-passing operations, which forms the basis of our ReFactor GNNs. Across a multitude of well-established KGC benchmarks, our ReFactor GNNs achieve comparable transductive performance to FMs, and state-of-the-art inductive performance while using an order of magnitude fewer parameters.
    keywords: reinforcement learning|offline RL|sequence modelling RL|continuous control
    proceedings: NeurIPS
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2207.09980
-   UID: jiang2022tap
    title: "Efficient Planning in a Compact Latent Action Space"
    authors: Zhengyao Jiang|Tianjun Zhang|Micheal Janner|Yueying Li|Tim Rocktäschel|Edward Grefenstette|Yuandong Tian
    abstract: While planning-based sequence modelling methods have shown great potential in continuous control, scaling them to high-dimensional state-action sequences remains an open challenge due to the high computational complexity and innate difficulty of planning in high-dimensional spaces. We propose the Trajectory Autoencoding Planner (TAP), a planning-based sequence modelling RL method that scales to high state-action dimensionalities. Using a state-conditional Vector-Quantized Variational Autoencoder (VQ-VAE), TAP models the conditional distribution of the trajectories given the current state. When deployed as an RL agent, TAP avoids planning step-by-step in a high-dimensional continuous action space but instead looks for the optimal latent code sequences by beam search. Unlike O(D^3) complexity of Trajectory Transformer, TAP enjoys constant O(C) planning computational complexity regarding state-action dimensionality D. Our empirical evaluation also shows the increasingly strong performance of TAP with the growing dimensionality. For Adroit robotic hand manipulation tasks with high state and action dimensionality, TAP surpasses existing model-based methods, including TT, with a large margin and also beats strong model-free actor-critic baselines.
    keywords: reinforcement learning|offline RL|sequence modelling RL|continuous control
    proceedings: arXiv
    year: 2022
    type: Preprint
    url: https://arxiv.org/abs/2208.10291
-   UID: matthews2022hierarchical
    title: "Hierarchical Kickstarting for Skill Transfer in Reinforcement Learning"
    authors: Michael Matthews|Mikayel Samvelyan|Jack Parker-Holder|Edward Grefenstette|Tim Rocktäschel
    abstract: Practising and honing skills forms a fundamental component of how humans learn, yet artificial agents are rarely specifically trained to perform them. Instead, they are usually trained end-to-end, with the hope being that useful skills will be implicitly learned in order to maximise discounted return of some extrinsic reward function. In this paper, we investigate how skills can be incorporated into the training of reinforcement learning (RL) agents in complex environments with large state-action spaces and sparse rewards. To this end, we created SkillHack, a benchmark of tasks and associated skills based on the game of NetHack. We evaluate a number of baselines on this benchmark, as well as our own novel skill-based method Hierarchical Kickstarting (HKS), which is shown to outperform all other evaluated methods. Our experiments show that learning with a prior knowledge of useful skills can significantly improve the performance of agents on complex problems. We ultimately argue that utilising predefined skills provides a useful inductive bias for RL problems, especially those with large state-action spaces and sparse rewards.
    keywords: reinforcement learning|transfer learning|environment
    proceedings: CoLLAs
    year: 2022
    type: Conference
    url: https://arxiv.org/pdf/2207.11584.pdf
-   UID: jiang2022gb
    title: "Graph Backup: Data Efficient Backup Exploiting Markovian Transitions"
    authors: Zhengyao Jiang|Tianjun Zhang|Robert Kirk|Tim Rocktäschel|Edward Grefenstette
    abstract: The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as n-step Q-Learning and TD(λ), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.
    keywords: reinforcement learning|graph structure|data-efficient RL
    proceedings: arXiv
    year: 2022
    type: Preprint
    url: https://arxiv.org/abs/2205.15824
-   UID: parker-holder2022evolving
    title: "Evolving Curricula with Regret-Based Environment Design"
    authors: Jack Parker-Holder|Minqi Jiang|Michael Dennis|Mikayel Samvelyan|Jakob Foerster|Edward Grefenstette|Tim Rocktäschel
    abstract: It remains a significant challenge to train generally capable agents with reinforcement learning (RL). A promising avenue for improving the robustness of RL agents is through the use of curricula. One such class of methods frames environment design as a game between a student and a teacher, using regret-based objectives to produce environment instantiations (or levels) at the frontier of the student agent's capabilities. These methods benefit from their generality, with theoretical guarantees at equilibrium, yet they often struggle to find effective levels in challenging design spaces. By contrast, evolutionary approaches seek to incrementally alter environment complexity, resulting in potentially open-ended learning, but often rely on domain-specific heuristics and vast amounts of computational resources. In this paper we propose to harness the power of evolution in a principled, regret-based curriculum. Our approach, which we call Adversarially Compounding Complexity by Editing Levels (ACCEL), seeks to constantly produce levels at the frontier of an agent's capabilities, resulting in curricula that start simple but become increasingly complex. ACCEL maintains the theoretical benefits of prior regret-based methods, while providing significant empirical gains in a diverse set of environments. An interactive version of the paper is available at accelagent.github.io.
    keywords: reinforcement learning|generalization|open-endedness|environment design|curriculum learning|procedural content generation
    proceedings: ICML
    year: 2022
    type: Conference
    url: https://arxiv.org/abs/2203.01302
-   UID: mahajan2022generalization
    title: "Generalization in Cooperative Multi-Agent Systems"
    authors: Anuj Mahajan|Mikayel Samvelyan|Tarun Gupta|Benjamin Ellis|Mingfei Sun|Tim Rocktäschel|Shimon Whiteson
    abstract: Collective intelligence is a fundamental trait shared by several species of living organisms. It has allowed them to thrive in the diverse environmental conditions that exist on our planet. From simple organisations in an ant colony to complex systems in human groups, collective intelligence is vital for solving complex survival tasks. As is commonly observed, such natural systems are flexible to changes in their structure. Specifically, they exhibit a high degree of generalization when the abilities or the total number of agents changes within a system. We term this phenomenon as Combinatorial Generalization (CG). CG is a highly desirable trait for autonomous systems as it can increase their utility and deployability across a wide range of applications. While recent works addressing specific aspects of CG have shown impressive results on complex domains, they provide no performance guarantees when generalizing towards novel situations. In this work, we shed light on the theoretical underpinnings of CG for cooperative multi-agent systems (MAS). Specifically, we study generalization bounds under a linear dependence of the underlying dynamics on the agent capabilities, which can be seen as a generalization of Successor Features to MAS. We then extend the results first for Lipschitz and then arbitrary dependence of rewards on team capabilities. Finally, empirical analysis on various domains using the framework of multi-agent reinforcement learning highlights important desiderata for multi-agent algorithms towards ensuring CG.
    keywords: reinforcement learning|multi-agent|generalization
    proceedings: arXiv
    year: 2022
    type: Preprint
    url: https://arxiv.org/abs/2202.00104
-   UID: kirk2021survey
    title: A Survey of Zero-shot Generalisation in Deep Reinforcement Learning
    authors: Robert Kirk|Amy Zhang|Edward Grefenstette|Tim Rocktäschel
    abstract: The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We rely on a unifying formalism and terminology for discussing different ZSG problems, building upon previous works. We go on to categorise existing benchmarks for ZSG, as well as current methods for tackling these problems. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in ZSG, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for ZSG, and we recommend building benchmarks in underexplored problem settings such as offline RL ZSG and reward-function variation.
    keywords: reinforcement learning|generalization|survey|review
    proceedings: Journal of Artificial Intelligence Research
    year: 2023
    type: Journal
    blog: generalization_survey.md
    url: https://arxiv.org/abs/2111.09794
-   UID: samvelyan2021minihack
    title: "MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research"
    authors: Mikayel Samvelyan|Robert Kirk|Vitaly Kurin|Jack Parker-Holder|Minqi Jiang|Eric Hambro|Fabio Petroni|Heinrich Küttler|Edward Grefenstette|Tim Rocktäschel
    abstract: The progress in deep reinforcement learning (RL) is heavily driven by the availability of challenging benchmarks used for training agents. However, benchmarks that are widely adopted by the community are not explicitly designed for evaluating specific capabilities of RL methods. While there exist environments for assessing particular open problems in RL (such as exploration, transfer learning, unsupervised environment design, or even language-assisted RL), it is generally difficult to extend these to richer, more complex environments once research goes beyond proof-of-concept results. We present MiniHack, a powerful sandbox framework for easily designing novel RL environments. MiniHack is a one-stop shop for RL experiments with environments ranging from small rooms to complex, procedurally generated worlds. By leveraging the full set of entities and environment dynamics from NetHack, one of the richest grid-based video games, MiniHack allows designing custom RL testbeds that are fast and convenient to use. With this sandbox framework, novel environments can be designed easily, either using a human-readable description language or a simple Python interface. In addition to a variety of RL tasks and baselines, MiniHack can wrap existing RL benchmarks and provide ways to seamlessly add additional complexity.
    keywords: reinforcement learning|open-endedness|environment design|environment
    proceedings: NeurIPS
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2109.13202
-   UID: jiang2021replay
    title: "Replay-Guided Adversarial Environment Design"
    authors: Minqi Jiang|Michael Dennis|Jack Parker-Holder|Jakob Foerster|Edward Grefenstette|Tim Rocktäschel
    abstract: "Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising self-supervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent's capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR⊥, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR⊥ improves the performance of PAIRED, from which it inherited its theoretical framework."
    keywords: reinforcement learning|generalization|curriculum learning|environment design|procedural content generation
    proceedings: NeurIPS
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2110.02439
-   UID: jiang2020prioritized
    title: Prioritized Level Replay
    authors: Minqi Jiang|Edward Grefenstette|Tim Rocktäschel
    abstract: 'Simulated environments with procedurally generated content have become popular benchmarks for testing systematic generalization of reinforcement learning agents. Every level in such an environment is algorithmically created, thereby exhibiting a unique configuration of underlying factors of variation, such as layout, positions of entities, asset appearances, or even the rules governing environment transitions. Fixed sets of training levels can be determined to aid comparison and reproducibility, and test levels can be held out to evaluate the generalization and robustness of agents. While prior work samples training levels in a direct way (e.g. uniformly) for the agent to learn from, we investigate the hypothesis that different levels provide different learning progress for an agent at specific times during training. We introduce Prioritized Level Replay, a general framework for estimating the future learning potential of a level given the current state of the agent''s policy. We find that temporal-difference (TD) errors, while previously used to selectively sample past transitions, also prove effective for scoring a level''s future learning potential when the agent replays (that is, revisits) that level to generate entirely new episodes of experiences from it. We report significantly improved sample-efficiency and generalization on the majority of Procgen Benchmark environments as well as two challenging MiniGrid environments. Lastly, we present a qualitative analysis showing that Prioritized Level Replay induces an implicit curriculum, taking the agent gradually from easier to harder levels'
    keywords: reinforcement learning|curriculum learning|generalization|procedural content generation
    proceedings: ICML
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2010.03934
-   UID: jiang2021gtg
    title:  "Grid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning"
    authors: Zhengyao Jiang|Pasquale Minervini|Minqi Jiang|Tim Rocktaschel
    abstract: 'Although reinforcement learning has been successfully applied in many domains in recent years, we still lack agents that can systematically generalize. While relational inductive biases that fit a task can improve generalization of RL agents, these biases are commonly hard-coded directly in the agent''s neural architecture. In this work, we show that we can incorporate relational inductive biases, encoded in the form of relational graphs, into agents. Based on this insight, we propose Grid-to-Graph (GTG), a mapping from grid structures to relational graphs that carry useful spatial relational inductive biases when processed through a Relational Graph Convolution Network (R-GCN). We show that, with GTG, R-GCNs generalize better both in terms of in-distribution and out-of-distribution compared to baselines based on Convolutional Neural Networks and Neural Logic Machines on challenging procedurally generated environments and MinAtar. Furthermore, we show that GTG produces agents that can jointly reason over observations and environment dynamics encoded in knowledge bases.'
    keywords: Relational Inductive Bias|Reinforcement Learning|Graph Neural Network
    proceedings: AAMAS
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2102.04220
-   UID: mahajan2021tesseract
    title: "Tesseract: Tensorised Actors for Multi-Agent Reinforcement Learning"
    authors: Anuj Mahajan|Mikayel Samvelyan|Lei Mao|Viktor Makoviychuk|Animesh Garg|Jean Kossaifi|Shimon Whiteson|Yuke Zhu|Animashree Anandkumar
    abstract: 'Reinforcement Learning in large action spaces is a challenging problem. Cooperative multi-agent reinforcement learning (MARL) exacerbates matters by imposing various constraints on communication and observability. In this work, we consider the fundamental hurdle affecting both value-based and policy-gradient approaches: an exponential blowup of the action space with the number of agents. For value-based methods, it poses challenges in accurately representing the optimal value function. For policy gradient methods, it makes training the critic difficult and exacerbates the problem of the lagging critic. We show that from a learning theory perspective, both problems can be addressed by accurately representing the associated action-value function with a low-complexity hypothesis class. This requires accurately modelling the agent interactions in a sample efficient way. To this end, we propose a novel tensorised formulation of the Bellman equation. This gives rise to our method Tesseract, which views the Q-function as a tensor whose modes correspond to the action spaces of different agents. Algorithms derived from Tesseract decompose the Q-tensor across agents and utilise low-rank tensor approximations to model agent interactions relevant to the task. We provide PAC analysis for Tesseract-based algorithms and highlight their relevance to the class of rich observation MDPs. Empirical results in different domains confirm Tesseract''s gains in sample efficiency predicted by the theory.'
    keywords: reinforcement learning|multi-agent learning
    proceedings: ICML
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2106.00136
-   UID: niepert2021imle
    title: "Implicit MLE: Backpropagating Through Discrete Exponential Family Distribution"
    authors: Mathias Niepert|Pasquale Minervini|Luca Franceschi
    abstract: Combining discrete probability distributions and combinatorial optimization problems with neural network components has numerous applications but poses several challenges. We propose Implicit Maximum Likelihood Estimation (I-MLE), a framework for end-to-end learning of models combining discrete exponential family distributions and differentiable neural components. I-MLE is widely applicable as it only requires the ability to compute the most probable states and does not rely on smooth relaxations. The framework encompasses several approaches such as perturbation-based implicit differentiation and recent methods to differentiate through black-box combinatorial solvers. We introduce a novel class of noise distributions for approximating marginals via perturb-and-MAP. Moreover, we show that I-MLE simplifies to maximum likelihood estimation when used in some recently studied learning settings that involve combinatorial solvers. Experiments on several datasets suggest that I-MLE is competitive with and often outperforms existing approaches which rely on problem-specific relaxations.
    keywords: reasoning|planning|gradient estimation|discrete distributions|backpropagation
    proceedings: NeurIPS
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2106.01798
-   UID: campero2021amigo
    title: "Learning with AMIGo: Adversarially Motivated Intrinsic Goals"
    authors: Andres Campero|Roberta Raileanu|Heinrich Küttler|Joshua B. Tenenbaum|Tim Rocktäschel|Edward Grefenstette
    abstract: 'A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating a goal-generating teacher that proposes Adversarially Motivated Intrinsic Goals to train a goal-conditioned ''student'' policy in the absence of (or alongside) environment reward. Specifically, through a simple but effective ''constructively adversarial'' objective, the teacher learns to propose increasingly challenging—yet achievable—goals that allow the student to learn general skills for acting in a new environment, independent of the task to be solved. We show that our method generates a natural curriculum of self-proposed goals which ultimately allows the agent to solve challenging procedurally-generated tasks where other forms of intrinsic motivation and state-of-the-art RL methods fail.'
    keywords: exploration|reinforcement learning
    proceedings: ICLR
    year: 2021
    type: Conference
    url: https://arxiv.org/abs/2006.12122
-   UID: arakelyan2021cqd
    title: Complex Query Answering with Neural Link Predictors
    authors: Erik Arakelyan|Daniel Daza|Pasquale Minervini|Michael Cochez
    abstract: Neural link predictors are immensely useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries that arise in a number of domains, such as queries using logical conjunctions (∧), disjunctions (∨) and existential quantifiers (∃), while accounting for missing edges. In this work, we propose a framework for efficiently answering complex queries on incomplete Knowledge Graphs. We translate each query into an end-to-end differentiable objective, where the truth value of each atom is computed by a pre-trained neural link predictor. We then analyse two solutions to the optimisation problem, including gradient-based and combinatorial search. In our experiments, the proposed approach produces more accurate results than state-of-the-art methods -- black-box neural models trained on millions of generated queries -- without the need of training on a large and diverse set of complex queries. Using orders of magnitude less training data, we obtain relative improvements ranging from 8% up to 40% in Hits@3 across different knowledge graphs containing factual information. Finally, we demonstrate that it is possible to explain the outcome of our model in terms of the intermediate solutions identified for each of the complex query atoms. All our source code and datasets are available online.
    keywords: complex query answering|Knowledge Graphs
    proceedings: ICLR
    year: 2021 (Outstanding Paper Award)
    type: Conference
    url: https://arxiv.org/abs/2011.03459
-   UID: kuettler2020nethack
    title: The NetHack Learning Environment
    authors: Heinrich Küttler|Nantas Nardelli|Alexander H. Miller|Roberta Raileanu|Marco Selvatici|Edward Grefenstette|Tim Rocktäschel
    abstract: Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment. NLE is open source at https://github.com/facebookresearch/nle.
    keywords: environment|reinforcement learning
    proceedings: NeurIPS
    year: 2020
    type: Conference
    url: https://arxiv.org/abs/2006.13760
-   UID: jiang2020wordcraft
    title: "WordCraft: An Environment for Benchmarking Commonsense Agents"
    authors: Minqi Jiang|Jelena Luketina|Nantas Nardelli|Pasquale Minervini|Philip H.S. Torr|Shimon Whiteson|Tim Rocktäschel
    abstract: The ability to quickly solve a wide range of real-world tasks requires a commonsense understanding of the world. Yet, how to best extract such knowledge from natural language corpora and integrate it with reinforcement learning (RL) agents remains an open challenge. This is partly due to the lack of lightweight simulation environments that sufficiently reflect the semantics of the real world and provide knowledge sources grounded with respect to observations in an RL environment. To better enable research on agents making use of commonsense knowledge, we propose WordCraft, an RL environment based on Little Alchemy 2. This lightweight environment is fast to run and built upon entities and relations inspired by real-world semantics. We evaluate several representation learning methods on this new benchmark and propose a new method for integrating knowledge graphs with an RL agent.
    keywords: reinforcement learning|commonsense reasoning|natural language processing|procedural content generation
    proceedings: Workshop on Language in Reinforcement Learning at ICML
    year: 2020
    type: Conference
    url: https://arxiv.org/abs/2007.09185
-   UID: raileanu2020ride
    title: "RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments"
    authors: Roberta Raileanu|Tim Rocktäschel
    abstract: Exploration in sparse reward environments remains one of the key challenges of model-free reinforcement learning. Instead of solely relying on extrinsic rewards provided by the environment, many state-of-the-art methods use intrinsic rewards to encourage exploration. However, we show that existing methods fall short in procedurally-generated environments where an agent is unlikely to visit a state more than once. We propose a novel type of intrinsic reward which encourages the agent to take actions that lead to significant changes in its learned state representation. We evaluate our method on multiple challenging procedurally-generated tasks in MiniGrid, as well as on tasks with high-dimensional observations used in prior work. Our experiments demonstrate that this approach is more sample efficient than existing exploration methods, particularly for procedurally-generated MiniGrid environments. Furthermore, we analyze the learned behavior as well as the intrinsic reward received by our agent. In contrast to previous approaches, our intrinsic reward does not diminish during the course of training and it rewards the agent substantially more for interacting with objects that it can control.
    keywords: exploration|reinforcement learning
    proceedings: ICLR
    year: 2020
    type: Conference
    url: https://arxiv.org/abs/2002.12292.abs
-   UID: zhong2020rtfm
    title: "RTFM: Generalising to New Environment Dynamics via Reading"
    authors: Victor Zhong|Tim Rocktäschel|Edward Grefenstette
    abstract: Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2π, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2π produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps.
    keywords: natural language processing|reasoning|reinforcement learning
    proceedings: ICLR
    year: 2020
    type: Conference
    url: https://arxiv.org/abs/1910.08210