Algorithms typically assume access to a loss, goal, reward function, or other objective that adequately captures what a system designer wants the algorithm to do. This is a poor fit for settings where the real objective is difficult to write down by hand (e.g. synthesising aesthetically pleasing images or music), or where the algorithm has enough flexibility to realise simple, hand-specified goals in surprising and undesirable ways. Value learning is the task of automatically discovering what objectives a system ought to pursue, typically by having an algorithm query and interact with people. I view this as a stepping stone to better aligning the behaviour of algorithms with the long-term interests of society, which are by nature fluid and difficult to formalise. As preliminary work along these lines, I've been looking at how Inverse Reinforcement Learning (IRL), which infers reward functions from demonstrations of desirable behaviour, can be made more reliable in practice.
X. Chen,* S. Toyer,* C. Wild,* S. Emmons, I. Fischer, K.H. Lee, N. Alex, S.H. Wang, P. Luo, S. Russell, P. Abbeel, R. Shah. “An Empirical Investigation of Representation Learning for Imitation”. NeurIPS 2021 (Datasets & Benchmarks Track). [code]
XB Peng, A Kanazawa, S Toyer, P Abbeel, S Levine. “Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”. ICLR 2019 (poster). [IRL code]
A generalised policy for a family of planning problems is a function which can suggest an appropriate action to take for any state of any problem from that family. For instance, if one wants to solve a series of slightly different truck routing problems (e.g. with different numbers of trucks or different road networks), then it might be possible to obtain a single generalised policy which solves all possible truck routing problems, rather than obtaining a separate policy for each problem. As an undergraduate, I explored how deep learning could be used to represent generalised policies for factored MDPs and classical planning problems.
W Shen, FW Trevizan, S Toyer, S Thiébaux, L Xie. “Guiding Search with Generalized Policies for Probabilistic Planning”. SOCS 2019.
S Toyer, FW Trevizan, S Thiébaux, L Xie. “Action Schema Networks: Generalised Policies with Deep Learning”. AAAI 2018. [code]
S Toyer. “Generalized Policies for Probabilistic Planning with Deep Learning”. Honours thesis, ANU 2017. [code]
Pose forecasting is the task of predicting a person's future motion from an observed sequence of motions. It has potential applications in collaborative robotics, as it can be helpful for a robot to anticipate a person's movement in order to coordinate with them or avoid colliding with them. In a DICTA paper, we proposed a novel pose forecasting method which used recurrent variational autoencoders to model a distribution over future poses. In contrast, previous methods could predict only a single trajectory, and thus did not adequately account for the uncertainty inherent to predicting human motion. Our DICTA paper also introduces Ikea Furniture Assembly—a new dataset for pose forecasting, action anticipation, and related tasks.
Geoscientists rely on data gathered from a wide range of satellites operated by many different agencies and countries. Satellite data providers often present their information in mutually incompatible formats, which makes it challenging to combine data from different sources. I spent some time working with a student team on interoperable methods for publishing spatial-temporal coverage data using semantic web technology, which led to a Note (W3C equivalent of a technical report) published by the W3C/OGC Spatial Data on the Web Working Group, as well as some software demonstrating how to publish coverage data on the web.
D Brizhinev, S Toyer, K Taylor. “Publishing and Using Earth Observation Data with the RDF Data Cube and the Discrete Global Grid System”. W3C Working Group Note 2017. [AGU'16 abstract] [code]