Developer Guide: Big Picture

The session and runner

The entrypoint of the program is the module. This module, first creates a Session.

A Session is responsible for command-line arguments, creating a directory for saving the all results related to that session (logs, checkpoints, …), and initiating the assitive tools, e.g. loggers, monitoring tools, visdom server, etc.

After the Session object is created, a Runner object is built, either from an existing checkpoint or from the parameters file specified at the command-line. The runner class will run the main loop.

How does runner work

The Runner depends on three main classes: Explorer, Memory, and AgentBase. The connection between these classes is really simple (and is intentionally written to be so), as depicted in the following general graph about reinforcement learning:

+-------------+               +--------+
|   Explorer  | ------------> | Memory |
+-------------+               +--------+
       ^                           |
       | (ACTIONS)                 | (TRAJECTORIES)
       |                           |
|      |                           |       |
|      |                      +---------+  |
|      |                      | SAMPLER |  |
|      |                      +---------+  |
|      |                           |       |
|      |     (SAMPLED TRANSITIONS) |       |
|      |         ----------        |       |
|      | <------ | POLICY | <----- |       |
|                ----------                |

The corresponding (pseudo-)code for the above graph is:

do in loop:
    chunk = self.explorer["train"].update()
    for agent_name in self.agents:
  • Explorer: Explorer is responsible for multi-worker environment simulations. It delivers the outputs to the memory in the format of a flattened dictionary (with depth 1). The explorer is tried to be written in its most general manner so it needs least possible modifications for adaptation to new methods.
  • Memory: It stores all of the information from the explorer in a dictionary of numpy arrays. The memory is also written in a very general way, so it is usable with most of the methods without modifications.
  • agent: The agent uses sampler and policy, and is responsible for training the policy and generating actions for simulations in the environment.