Ray RLlib

Index

The source code of all examples described in this section is available in our DIAMBRA Agents repository.

Getting Ready

We highly recommend using virtual environments to isolate your python installs, especially to avoid conflicts in dependencies. In what follows we use Conda but any other tool should work too.

Create and activate a new dedicated virtual environment:

conda create -n diambra-arena-ray python=3.8
conda activate diambra-arena-ray

Install DIAMBRA Arena with Ray RLlib interface:

pip install diambra-arena[ray-rllib]

This should be enough to prepare your system to execute the following examples. You can refer to the official Ray RLlib documentation or reach out on our Discord server for specific needs.

All the examples presented below are available here: DIAMBRA Agents - Ray RLlib. They have been created following the high level approach found on Ray RLlib examples page and their related repository collection, thus allowing to easily extend them and to understand how they interface with the different components.

These examples only aims at demonstrating the core functionalities and high level aspects, they will not generate well performing agents, even if the training time is extended to cover a large number of training steps. The user will need to build upon them, exploring aspects like: policy network architecture, algorithm hyperparameter tuning, observation space tweaking, rewards wrapping and other similar ones.

Native interface

DIAMBRA Arena native interface with Ray RLlib covers a wide range of use cases, automating handling of key things like parallelization. In the majority of cases it will be sufficient for users to directly import and use it, with no need for additional customization.

For the interface low level details, users can review the correspondent source code here.

Basic

Basic Example

This example demonstrates how to:

  • Build the config dictionary for Ray RLlib
  • Interface one of Ray RLlib’s algorithms with DIAMBRA Arena using the native interface
  • Train the algorithm
  • Run the trained agent in the environment for one episode

It uses the PPO algorithm and, for demonstration purposes, the algorithm is trained for only 200 steps, so the resulting agent will be far from optimal.

import diambra.arena
from diambra.arena import SpaceTypes, EnvironmentSettings
import gymnasium as gym
from diambra.arena.ray_rllib.make_ray_env import DiambraArena, preprocess_ray_config
from ray.rllib.algorithms.ppo import PPO, PPOConfig
from ray.tune.logger import pretty_print

def main():
    # Environment Settings
    env_settings = EnvironmentSettings()
    env_settings.frame_shape = (84, 84, 1)
    env_settings.action_space = SpaceTypes.DISCRETE

    # env_config
    env_config = {
            "game_id": "doapp",
            "settings": env_settings,
        }

    config = {
        # Define and configure the environment
        "env": DiambraArena,
        "env_config": env_config,
        "num_workers": 0,
        "train_batch_size": 200,
    }

    # Update config file
    config = preprocess_ray_config(config)

    # Instantiating the agent
    agent = PPO(config=config)

    # Run it for n training iterations
    print("\nStarting training ...\n")
    for idx in range(1):
        print("Training iteration:", idx + 1)
        result = agent.train()
        print(pretty_print(result))
    print("\n .. training completed.")

    # Run the trained agent (and render each timestep output).
    print("\nStarting trained agent execution ...\n")

    env = diambra.arena.make("doapp", env_settings, render_mode="human")

    observation, info = env.reset()
    while True:
        env.render()

        action = agent.compute_single_action(observation)
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated or truncated:
            observation, info = env.reset()
            break

    print("\n... trained agent execution completed.\n")

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run python basic.py

Saving, loading and evaluating

In addition to what seen in the previous example, this one demonstrates how to:

  • Print out the policy network architecture
  • Save a trained agent
  • Load a saved agent
  • Evaluate an agent on a given number of episodes
  • Print training and evaluation results

The same conditions of the previous example for algorithm, policy and training steps are used in this one too.

from diambra.arena import SpaceTypes, EnvironmentSettings
from diambra.arena.ray_rllib.make_ray_env import DiambraArena, preprocess_ray_config
from ray.rllib.algorithms.ppo import PPO
from ray.tune.logger import pretty_print

def main():
    # Settings
    env_settings = EnvironmentSettings()
    env_settings.frame_shape = (84, 84, 1)
    env_settings.action_space = SpaceTypes.DISCRETE

    config = {
        # Define and configure the environment
        "env": DiambraArena,
        "env_config": {
            "game_id": "doapp",
            "settings": env_settings,
        },
        "num_workers": 0,
        "train_batch_size": 200,
        "framework": "torch",
    }

    # Update config file
    config = preprocess_ray_config(config)

    # Create the RLlib Agent.
    agent = PPO(config=config)
    print("Policy architecture =\n{}".format(agent.get_policy().model))

    # Run it for n training iterations
    print("\nStarting training ...\n")
    for idx in range(1):
        print("Training iteration:", idx + 1)
        results = agent.train()
    print("\n .. training completed.")
    print("Training results:\n{}".format(pretty_print(results)))

    # Save the agent
    checkpoint = agent.save().checkpoint.path
    print("Checkpoint saved at {}".format(checkpoint))
    del agent  # delete trained model to demonstrate loading

    # Load the trained agent
    agent = PPO(config=config)
    agent.restore(checkpoint)
    print("Agent loaded")

    # Evaluate the trained agent (and render each timestep to the shell's
    # output).
    print("\nStarting evaluation ...\n")
    results = agent.evaluate()
    print("\n... evaluation completed.\n")
    print("Evaluation results:\n{}".format(pretty_print(results)))

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run python saving_loading_evaluating.py

Parallel Environments

In addition to what seen in previous examples, this one demonstrates how to:

  • Run training and evaluation using parallel environments

This example runs multiple environments. In order to properly execute it, the user needs to specify the correct number of environments instances to be created via DIAMBRA CLI when running the script. In particular, in this case, 6 different instances are needed:

  • 2 rollout workers with 2 environments each, accounting for 4 environments
  • 1 evaluation worker with 2 environments, accounting for the remaining 2 environments
from diambra.arena import SpaceTypes, EnvironmentSettings
from diambra.arena.ray_rllib.make_ray_env import DiambraArena, preprocess_ray_config
from ray.rllib.algorithms.ppo import PPO
from ray.tune.logger import pretty_print

def main():
    # Settings
    env_settings = EnvironmentSettings()
    env_settings.frame_shape = (84, 84, 1)
    env_settings.action_space = SpaceTypes.DISCRETE

    config = {
        # Define and configure the environment
        "env": DiambraArena,
        "env_config": {
            "game_id": "doapp",
            "settings": env_settings,
        },

        "train_batch_size": 200,

        # Use 2 rollout workers
        "num_workers": 2,
        # Use a vectorized env with 2 sub-envs.
        "num_envs_per_worker": 2,

        # Evaluate once per training iteration.
        "evaluation_interval": 1,
        # Run evaluation on (at least) two episodes
        "evaluation_duration": 2,
        # ... using one evaluation worker (setting this to 0 will cause
        # evaluation to run on the local evaluation worker, blocking
        # training until evaluation is done).
        "evaluation_num_workers": 1,
        # Special evaluation config. Keys specified here will override
        # the same keys in the main config, but only for evaluation.
        "evaluation_config": {
            # Render the env while evaluating.
            # Note that this will always only render the 1st RolloutWorker's
            # env and only the 1st sub-env in a vectorized env.
            "render_env": True,
        },
    }

    # Update config file
    config = preprocess_ray_config(config)

    # Create the RLlib Agent.
    agent = PPO(config=config)

    # Run it for n training iterations
    print("\nStarting training ...\n")
    for idx in range(2):
        print("Training iteration:", idx + 1)
        results = agent.train()
    print("\n .. training completed.")
    print("Training results:\n{}".format(pretty_print(results)))

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run -s=6 python parallel_envs.py

Advanced

Dictionary Observations

In addition to what seen in previous examples, this one demonstrates how to:

  • Activate a complete set of environment wrappers
  • How to properly handle dictionary observations for Ray RLlib

The main thing to note in this example is that the library does not have constraints on dictionary observation spaces, being able to handle nested ones too.

The policy network is automatically generated, properly handling different types of inputs. Model architecture is then printed to the console output, allowing to clearly identify all the different contributions.

from diambra.arena import SpaceTypes, EnvironmentSettings, WrappersSettings
from diambra.arena.ray_rllib.make_ray_env import DiambraArena, preprocess_ray_config
from ray.rllib.algorithms.ppo import PPO
from ray.tune.logger import pretty_print

def main():
    # Settings
    env_settings = EnvironmentSettings()
    env_settings.frame_shape = (84, 84, 1)
    env_settings.characters = ("Kasumi")
    env_settings.action_space = SpaceTypes.DISCRETE

    # Wrappers Settings
    wrappers_settings = WrappersSettings()
    wrappers_settings.normalize_reward = True
    wrappers_settings.add_last_action = True
    wrappers_settings.stack_actions = 12
    wrappers_settings.stack_frames = 5
    wrappers_settings.scale = True
    wrappers_settings.role_relative = True

    config = {
        # Define and configure the environment
        "env": DiambraArena,
        "env_config": {
            "game_id": "doapp",
            "settings": env_settings,
            "wrappers_settings": wrappers_settings,
        },
        "num_workers": 0,
        "train_batch_size": 200,
        "framework": "torch",
    }

    # Update config file
    config = preprocess_ray_config(config)

    # Create the RLlib Agent.
    agent = PPO(config=config)
    print("Policy architecture =\n{}".format(agent.get_policy().model))

    # Run it for n training iterations
    print("\nStarting training ...\n")
    for idx in range(1):
        print("Training iteration:", idx + 1)
        results = agent.train()
    print("\n .. training completed.")
    print("Training results:\n{}".format(pretty_print(results)))

    # Evaluate the trained agent (and render each timestep to the shell's
    # output).
    print("\nStarting evaluation ...\n")
    results = agent.evaluate()
    print("\n... evaluation completed.\n")
    print("Evaluation results:\n{}".format(pretty_print(results)))

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run python dict_obs_space.py

Agent Script for Competition

Finally, after the agent training is completed, besides running it locally in your own machine, you may want to submit it to our Competition Platform! To do so, you can use the following script that provides a ready to use, flexible example that can accommodate different models, games and settings.

To submit your trained agent to our platform, compete for the first leaderboard positions, and unlock our achievements, follow the simple steps described in the “How to Submit an Agent” section.

import argparse
import diambra.arena
from diambra.arena import SpaceTypes, EnvironmentSettings, WrappersSettings
from diambra.arena.ray_rllib.make_ray_env import DiambraArena, preprocess_ray_config
from ray.rllib.algorithms.ppo import PPO

# Reference: https://github.com/ray-project/ray/blob/ray-2.0.0/rllib/examples/inference_and_serving/policy_inference_after_training.py

"""This is an example agent based on RL Lib.

Usage:
diambra run python agent.py --trainedModel /absolute/path/to/checkpoint/ --envSpaces /absolute/path/to/environment/spaces/descriptor/
"""

def main(trained_model, env_spaces, test=False):
    # Settings
    env_settings = EnvironmentSettings()
    env_settings.frame_shape = (84, 84, 1)
    env_settings.characters = ("Kasumi")
    env_settings.action_space = SpaceTypes.DISCRETE

    # Wrappers Settings
    wrappers_settings = WrappersSettings()
    wrappers_settings.normalize_reward = True
    wrappers_settings.add_last_action = True
    wrappers_settings.stack_actions = 12
    wrappers_settings.stack_frames = 5
    wrappers_settings.scale = True
    wrappers_settings.role_relative = True

    config = {
        # Define and configure the environment
        "env": DiambraArena,
        "env_config": {
            "game_id": "doapp",
            "settings": env_settings,
            "wrappers_settings": wrappers_settings,
            "load_spaces_from_file": True,
            "env_spaces_file_name": env_spaces,
        },
        "num_workers": 0,
    }

    # Update config file
    config = preprocess_ray_config(config)

    # Load the trained agent
    agent = PPO(config=config)
    agent.restore(trained_model)
    print("Agent loaded")

    # Print the agent policy architecture
    print("Policy architecture =\n{}".format(agent.get_policy().model))

    env = diambra.arena.make("doapp", env_settings, wrappers_settings, render_mode="human")
    obs, info = env.reset()

    while True:
        env.render()

        action = agent.compute_single_action(observation=obs, explore=True, policy_id="default_policy")
        obs, reward, terminated, truncated, info = env.step(action)

        if terminated or truncated:
            obs, info = env.reset()
            if info["env_done"] or test is True:
                break

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--trainedModel", type=str, required=True, help="Model path")
    parser.add_argument("--envSpaces", type=str, required=True, help="Environment spaces descriptor file path")
    parser.add_argument("--test", type=int, default=0, help="Test mode")
    opt = parser.parse_args()
    print(opt)

    main(opt.trainedModel, opt.envSpaces, bool(opt.test))

How to run it locally:

diambra run python agent.py --trainedModel /absolute/path/to/checkpoint/ --envSpaces /absolute/path/to/environment/spaces/descriptor/