Editor’s Note: Joel is a speaker for ODSC East 2022. Be sure to check out his talk, “Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot,” there!
Homo sapiens are a funny species. Along most of the traditionally studied domains of intelligence, we do not outshine other mammals: our perception, memory, attention, planning, and decision-making skills are nothing special for a mammal of our size (Henrich and Muthukrishna, 2021). Rather we distinguish ourselves primarily along social-cognitive dimensions. For instance, we improvisationally achieve cooperation through means such as persuasion and negotiation, reciprocity and reputational concern, alliances and incentive alignment, imitation and teaching, leadership and responsibility, commitments and promises, trust and monitoring, sharing and division of labor, social norms, and institutions, and many and more still. As AI researchers interested in building human-like AGI, we need to make sure the machine intelligences we are building possess all the core skills of the human social-cognitive repertoire before we let them participate in human society.
We propose to take an approach to building advanced AI systems based on reverse engineering human intelligence. The reverse engineering approach is common in AI, especially with people who work on the “classic”’ cognitive abilities like perception, attention, and memory. But, so far, we think reverse engineering is under-explored with regard to the social-cognitive abilities that underlie important skills such as cooperation. In this case, what is needed is for us to figure out what are the social-cognitive capacities, representations, and motivations that underlie human collective intelligence and then build them into our AGI systems.
For this multi-agent-first and reverse engineering-oriented AI research program to progress, it needs to settle two main methodological questions: (1) how to train agents so that they develop social-cognitive abilities on par with those of humans. (2) how to measure progress toward the goal. The methodology we have proposed, Melting Pot, addresses both of these questions (Leibo et al. 2021).
Melting Pot the evaluation protocol
Let’s start first by talking about how Melting Pot provides a recipe for measuring progress. The key idea is to always measure generalization. Otherwise, agents might overfit to one another’s behavior. This kind of overfitting can be extremely brittle and unlikely to work in the real world. To measure social-cognitive skills, the key type of generalization we need to look at is social generalization, i.e. generalization to interactions in unfamiliar social situations involving both familiar and unfamiliar individuals.
Melting Pot consists of a set of test scenarios and a protocol for using them. A scenario is a multi-agent environment that tests the ability of a focal population of agents to generalize to novel social situations. Each scenario is formed by a substrate and a background population. The term ‘substrate’ refers to the physical part of the world: its spatial layout, where the objects are, how they move, the rules of physics, etc. The term ‘background population’ refers to the part of the simulation that is imbued with agency—excluding the focal population. While the substrate is experienced by the focal population during training, the background population is not. Thus the performance of the focal population in a test scenario measures how well its agents generalize to social situations they were not directly exposed to at training time.
How to train populations of agents that will perform well in Melting Pot
Now let’s talk about the training side of Melting Pot. Researchers and the agents they train are allowed unlimited access to the substrate. This means that at test time the agents will usually already be familiar with their physical environment. So the kind of generalization that Melting Pot probes is mainly along social dimensions, not physical dimensions. This is in contrast to most work on generalization in reinforcement learning (e.g. (Cobbe et al. 2019)), which are primarily concerned with generalization along physical dimensions.
In thinking about constructing Melting Pot, our guiding principle has been that we shouldn’t make decisions that restrict the scope for creativity in how researchers can train their agents. We can get away with few restrictions on training because we have a rigorous evaluation. As long as researchers do not cheat by training their agents on the test set, then we should be able to say of training that “anything goes”. All agents will be judged according to a common benchmark in the end.
One interpretation of Melting Pot-compatible training processes, which we especially like, is that they give groups of agents the opportunity to form their own artificial civil society, complete with their own conventions, norms, and institutions. Forming a civil society entails the genesis and successful resolution of all kinds of obstacles to cooperation. The specific obstacles that arise, and the ways they can be resolved, depending on the properties of the substrate. For instance, free riding in public good provision is one obstacle to cooperation and unsustainable resource usage leading to a tragedy of the commons is quite another (Ostrom 2009). The challenge of overcoming these obstacles is what pushes agents to develop their social-cognitive abilities. For instance, if a particular social dilemma can be resolved by inventing the concept of fairness and enforcing its associated norms then that very fact motivates agents capable of representing the fairness concept and its norms to actually do so (Leibo et al. 2019).
Melting pot is open and easy to extend
We want to grow Melting Pot over time and ultimately create a comprehensive platform where most aspects of social intelligence can be assessed. To that end, we designed Melting Pot around the need to establish a scalable process through which it can be expanded. This led us to consider not just modular environment components (which we have), but also a modular process for contributing new scenarios.
As mentioned above, a scenario consists of two parts: a substrate, and a background population. We built substrates on DMLab2D (Beattie et al 2020) using an entity-component system approach similar to that of modern game engines. You write components in Lua and arrange them into substrates in Python. Members of the background population are agents that we trained with reinforcement learning algorithms and then “froze” (turned off further learning). We call them bots to distinguish them from the agents in the focal population. A Melting Pot substrate emits events when interactions occur between agents, or between agents and the environment, such as one player zapping another player or eating an apple. Events can be conditional on the identities of the players involved or the location where the interaction occurred.
The approach we took to creating the bots to populate Melting Pot’s test scenarios involves three steps: (1) specification, (2) training, and (3) quality control. We describe each in turn.
1. Specification: The designer typically starts with an idea of what they want the final bot’s behavior to look like. Since substrate events provide privileged information about other agents and the substrate, we can often easily specify reward functions that induce the right behavior.
This is a much easier task than what focal agents need to solve—learning from only pixels and the final reward.
However, sometimes the desired behavior is difficult to specify using a single reward function. In these cases, we generate background populations using techniques inspired by hierarchical reinforcement learning. In these cases, we create a basic portfolio of behaviors by training bots that use different environment events as their reward signal (as in Horde (Sutton et al. 2011)), and then chain them using simple Python code. This allows us to express complex behaviors in a “if this event, run that behavior” way. For example, in the substrate Prisoners Dilemma in the Matrix (the Melting Pot analog of iterated prisoner’s dilemma), we created a bot that cooperates until its partner defects on it, after which it defects in all future interactions. This is the Melting Pot version of the GRIM TRIGGER strategy in game theory.
2. Training: The decision at this stage is how to train the background population. The thing to keep in mind is that the bots must generalize to the focal population. To this end, we chose at least some bots—typically not used in the final scenario—that are likely to develop behaviors resembling that of the focal agent at test time. For instance, in Running With Scissors in the Matrix (the Melting Pot analog of rock-paper-scissors), we train rock, paper, and scissors specialist bots alongside “free” bots that experience the true substrate reward function.
3. Quality control: Bot quality control is done by running 10-30 episodes where candidate bots interact with other fixed bots. These other bots are typically a mixture of familiar and unfamiliar bots (that trained together or separately). We verify that agents trained to optimize for a certain event, indeed do. We reject agents that fail to do so.
Building Melting Pot is an interdisciplinary effort. To be successful, it will need contributions from people with a variety of different backgrounds. For instance, computer scientists can devise new AI agents algorithms that can train on Melting Pot and try to achieve steadily better and better scores on its test scenarios. Computational neuro/cognitive scientists: your skills are very well suited to developing interesting background populations that (a) have interesting motivations, and (b) yield experimentally rigorous tests of hypotheses about the abilities of the focal population. You could also contribute by helping us improve the list of social-cognitive abilities, representations, and motivations we target. Economists and other social scientists can help by designing substrates where trade and barter behavior can emerge and be studied, as well as help us improve the diversity of substrates based on dilemmas stemming from the utilization of natural resources. Moral and political philosophers can contribute by adding new substrates and scenarios that cover alternative conceptions of the problem of cooperation that we may not yet have thought about.
Our intention is for Melting Pot to eventually cover the full range of social-cognitive abilities, representations, and motivations underlying human collective intelligence. We plan to maintain it, and will be extending it in the coming years to cover more social interactions and generalization scenarios.
Learn more from the Melting Pot Github page: https://github.com/deepmind/meltingpot.
Beattie, C., Köppe, T., Duéñez-Guzmán, E.A. and Leibo, J.Z., 2020. Deepmind Lab2d. arXiv preprint arXiv:2011.07027.
Cobbe, K., Klimov, O., Hesse, C., Kim, T. and Schulman, J., 2019, May. Quantifying generalization in reinforcement learning. In the International Conference on Machine Learning (pp. 1282-1289). PMLR.
Henrich, J. and Muthukrishna, M., 2021. The origins and psychology of human cooperation. Annual Review of Psychology, 72, pp.207-240.
Leibo, J.Z., Hughes, E., Lanctot, M. and Graepel, T., 2019. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742.
Leibo, J.Z., Dueñez-Guzman, E.A., Vezhnevets, A., Agapiou, J.P., Sunehag, P., Koster, R., Matyas, J., Beattie, C., Mordatch, I. and Graepel, T., 2021, July. Scalable evaluation of multi-agent reinforcement learning with melting pot. In the International Conference on Machine Learning (pp. 6187-6199). PMLR.
Ostrom, E., 2009. Understanding institutional diversity. Princeton university press.
Sutton, R.S., Modayil, J., Delp, M., Degris, T., Pilarski, P.M., White, A. and Precup, D., 2011, May. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (pp. 761-768).
About the author/ODSC East 2022 Speaker: Joel Z. Leibo is a research scientist at DeepMind. He obtained his PhD in 2013 from MIT where he worked on the computational neuroscience of face recognition with Tomaso Poggio. Nowadays, Joel’s research is aimed at the following questions:
- How can we get deep reinforcement learning agents to perform complex cognitive behaviors like cooperating with one another in groups?
- How should we evaluate the performance of deep reinforcement learning agents?
- How can we model processes like the cumulative culture that gave rise to unique aspects of human intelligence?