Stable Baselines 3

Index

The source code of all examples described in this section is available in our DIAMBRA Agents repository.

Getting Ready

We highly recommend using virtual environments to isolate your python installs, especially to avoid conflicts in dependencies. In what follows we use Conda but any other tool should work too.

Create and activate a new dedicated virtual environment:

conda create -n diambra-arena-sb3 python=3.8
conda activate diambra-arena-sb3

Install DIAMBRA Arena with Stable Baselines 3 interface:

pip install diambra-arena[stable-baselines3]

This should be enough to prepare your system to execute the following examples. You can refer to the official Stable Baselines 3 documentation or reach out on our Discord server for specific needs.

All the examples presented below are available here: DIAMBRA Agents - Stable Baselines 3. They have been created following the high level approach found on Stable Baselines 3 examples page, thus allowing to easily extend them and to understand how they interface with the different components.

These examples only aims at demonstrating the core functionalities and high level aspects, they will not generate well performing agents, even if the training time is extended to cover a large number of training steps. The user will need to build upon them, exploring aspects like: policy network architecture, algorithm hyperparameter tuning, observation space tweaking, rewards wrapping and other similar ones.

Native interface

DIAMBRA Arena native interface with Stable Baselines 3 covers a wide range of use cases, automating handling of vectorized environments and monitoring wrappers. In the majority of cases it will be sufficient for users to directly import and use it, with no need for additional customization. Below is reported its interface and a table describing its arguments.

def make_sb3_env(game_id: str, env_settings: dict={}, wrappers_settings: dict={},
                 use_subprocess: bool=True, seed: int=0, log_dir_base: str="/tmp/DIAMBRALog/",
                 start_index: int=0, allow_early_resets: bool=True,
                 start_method: str=None, no_vec: bool=False)
ArgumentTypeDefault Value(s)Description
game_idstr-Game environment identifier
env_settingsdict{}Environment settings (see more)
wrappers_settingsdict{}Wrappers settings (see more)
use_subprocessboolTrueIf to use subprocesses for multi-threaded parallelization
seedint0Random number generator seed
log_dir_basestr"/tmp/DIAMBRALog/"Folder where to save execution logs
start_indexint0Starting process rank index
allow_early_resetsboolTrueMonitor wrapper argument to allow environment reset before it is done
start_methodstrNoneMethod to spawn subprocesses when active (see more)
no_vecboolFalseIf True avoids using vectorized environments (valid only when using a single instance)

For the interface low level details, users can review the correspondent source code here.

Basic

For all the basic examples, the environment will be used in hardcore mode, so that the observation space will be only of type Box composed by screen pixels, as in the majority of simple examples found in tutorials and docs. This allows to directly use it without the need of further processing.

Basic Example

This example demonstrates how to:

  • Instantiate a new DIAMBRA Arena environment with its settings
  • Interface it with one of Stable Baselines 3’s algorithms
  • Train the algorithm
  • Run the trained agent in the environment for one episode

It uses the A2C algorithm, with a CnnPolicy policy network to properly process the game frame observation as input. For demonstration purposes, the algorithm is trained for only 200 steps, so the resulting agent will be far from optimal.

import diambra.arena
from stable_baselines3 import A2C

def main():
    env = diambra.arena.make("doapp", {"hardcore": True, "frame_shape": (128, 128, 1)})

    print("\nStarting training ...\n")
    agent = A2C("CnnPolicy", env, verbose=1)
    agent.learn(total_timesteps=200)
    print("\n .. training completed.")

    print("\nStarting trained agent execution ...\n")
    observation = env.reset()
    while True:
        env.render()

        action, _state = agent.predict(observation, deterministic=True)

        observation, reward, done, info = env.step(action)

        if done:
            observation = env.reset()
            break
    print("\n... trained agent execution completed.\n")

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run python basic.py

Saving, loading and evaluating

In addition to what seen in the previous example, this one demonstrates how to:

  • Save a trained agent
  • Load a saved agent
  • Evaluate an agent on a given number of episodes

The same conditions of the previous example for algorithm, policy and training steps are used in this one too.

import diambra.arena
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy

def main():
    # Create environment
    env = diambra.arena.make("doapp", {"hardcore": True, "frame_shape": (128, 128, 1)})

    # Instantiate the agent
    agent = A2C("CnnPolicy", env, verbose=1)
    # Train the agent
    agent.learn(total_timesteps=200)
    # Save the agent
    agent.save("a2c_doapp")
    del agent  # delete trained agent to demonstrate loading

    # Load the trained agent
    # NOTE: if you have loading issue, you can pass `print_system_info=True`
    # to compare the system on which the agent was trained vs the current one
    # agent = A2C.load("a2c_doapp", env=env, print_system_info=True)
    agent = A2C.load("a2c_doapp", env=env)

    # Evaluate the agent
    # NOTE: If you use wrappers with your environment that modify rewards,
    #       this will be reflected here. To evaluate with original rewards,
    #       wrap environment in a "Monitor" wrapper before other wrappers.
    mean_reward, std_reward = evaluate_policy(agent, agent.get_env(), n_eval_episodes=3)
    print("Reward: {} (avg) ± {} (std)".format(mean_reward, std_reward))

    # Run trained agent
    observation = env.reset()
    cumulative_reward = 0
    while True:
        env.render()

        action, _state = agent.predict(observation, deterministic=True)

        observation, reward, done, info = env.step(action)
        cumulative_reward += reward
        if (reward != 0):
            print("Cumulative reward =", cumulative_reward)

        if done:
            observation = env.reset()
            break

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run python saving_loading_evaluating.py

Parallel Environments

In addition to what seen in previous examples, this one demonstrates how to:

  • Leverage DIAMBRA Arena native Stable Baselines 3 interface
  • Activate environment wrappers
  • Run training using parallel environments
  • Print out the policy network architecture

In this example, the PPO algorithm is used, with the same CnnPolicy seen before. This policy network works even if in this example an environment wrapper is used to stack multiple game frames, as they are piled along the channel dimension. In this example the policy architecture is also printed to the console output, allowing to visualize how inputs are processed and “translated” to actions probabilities.

This example also runs multiple environments, automatically detecting the number of instances created by DIAMBRA CLI when running the script.

from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env
from stable_baselines3 import PPO

def main():
    # Settings
    settings = {}
    settings["hardcore"] = True
    settings["frame_shape"] = (128, 128, 1)
    settings["characters"] = ("Kasumi")

    # Wrappers Settings
    wrappers_settings = {}
    wrappers_settings["reward_normalization"] = True
    wrappers_settings["frame_stack"] = 5

    # Create environment
    env, num_envs = make_sb3_env("doapp", settings, wrappers_settings)
    print("Activated {} environment(s)".format(num_envs))

    print("Observation space shape =", env.observation_space.shape)
    print("Observation space type =", env.observation_space.dtype)

    print("Act_space =", env.action_space)

    # Instantiate the agent
    agent = PPO("CnnPolicy", env, verbose=1)

    # Print policy network architecture
    print("Policy architecture:")
    print(agent.policy)

    # Train the agent
    agent.learn(total_timesteps=200)

    # Run trained agent
    observation = env.reset()
    cumulative_reward = [0.0 for _ in range(num_envs)]
    while True:
        env.render()

        action, _state = agent.predict(observation, deterministic=True)

        observation, reward, done, info = env.step(action)
        cumulative_reward += reward
        if any(x != 0 for x in reward):
            print("Cumulative reward(s) =", cumulative_reward)

        if done.any():
            observation = env.reset()
            break

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run -s=2 python parallel_envs.py

Advanced

The nex examples make use of the complete observation space of our environments. This is of type Dict, in which different elements are organized as key-value pairs and they can be of different type.

Dictionary Observations

In addition to what seen in previous examples, this one demonstrates how to:

  • Activate a complete set of environment wrappers
  • How to properly handle dictionary observations for Stable Baselines 3

There are two main things to note in this example: how to handle observation normalization and dictionary observations. As it can be seen from the snippet below, the normalization wrapper is applied on all elements but the image frame, as Stable Baselines 3 automatically normalizes images and expects their pixels to be in the range [0 - 255]. The library also has a specific constraint on dictionary observation spaces: they cannot be nested. For this reason we provide a flattening wrapper that creates a shallow, not nested, dictionary from the original observation space, allowing in addition to filter it by keys.

In this case, the policy network needs to be of class MultiInputPolicy, since it will handle different types of inputs. Stable Baselines 3 automatically defines the network architecture, properly matching the input type. The architecture is then printed to the console output, allowing to clearly identify all the different contributions.

from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env
from stable_baselines3 import PPO

def main():
    # Settings
    settings = {}
    settings["frame_shape"] = (128, 128, 1)
    settings["characters"] = ("Kasumi")

    # Wrappers Settings
    wrappers_settings = {}
    wrappers_settings["reward_normalization"] = True
    wrappers_settings["actions_stack"] = 12
    wrappers_settings["frame_stack"] = 5
    wrappers_settings["scale"] = True
    wrappers_settings["exclude_image_scaling"] = True
    wrappers_settings["flatten"] = True
    wrappers_settings["filter_keys"] = ["stage", "P1_ownHealth", "P1_oppHealth", "P1_ownSide",
                                        "P1_oppSide", "P1_oppChar", "P1_actions_move", "P1_actions_attack"]

    # Create environment
    env, num_envs = make_sb3_env("doapp", settings, wrappers_settings)
    print("Activated {} environment(s)".format(num_envs))

    print("Observation space =", env.observation_space)
    print("Act_space =", env.action_space)

    # Instantiate the agent
    agent = PPO("MultiInputPolicy", env, verbose=1)

    # Print policy network architecture
    print("Policy architecture:")
    print(agent.policy)

    # Train the agent
    agent.learn(total_timesteps=200)

    # Run trained agent
    observation = env.reset()
    cumulative_reward = [0.0 for _ in range(num_envs)]
    while True:
        env.render()

        action, _state = agent.predict(observation, deterministic=True)

        observation, reward, done, info = env.step(action)
        cumulative_reward += reward
        if any(x != 0 for x in reward):
            print("Cumulative reward(s) =", cumulative_reward)

        if done.any():
            observation = env.reset()
            break

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":
    main()

How to run it:

diambra run python dict_obs_space.py

Complete Training Script

In addition to what seen in previous examples, this one demonstrates how to:

  • Build a complete training script to be used with Stable Baselines via a config fila
  • How to properly handle hyper-parameters scheduling via callbacks
  • How to use callbacks for auto-saving
  • How to control some policy network models and optimizer parameters

This example show exactly how we trained our own models on these environments. It should be considered a starting point from where to explore and experiment, the following are just a few options among the most obvious ones:

  • Tweak hyper-parameters for the chosen algorithm
  • Evolve the policy network architecture
  • Test different algorithms, both on and off-policy
  • Try to leverage behavioral cloning / imitation learning
  • Modify the reward function to guide learning in other directions
import os
import time
import yaml
import json
import argparse
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env
from diambra.arena.stable_baselines3.sb3_utils import linear_schedule, AutoSave
from stable_baselines3 import PPO

# diambra run -s 8 python stable_baselines3/training.py --cfgFile $PWD/stable_baselines3/cfg_files/sfiii3n/sr6_128x4_das_nc.yaml

def main(cfg_file):

    # Read the cfg file
    yaml_file = open(cfg_file)
    params = yaml.load(yaml_file, Loader=yaml.FullLoader)
    print("Config parameters = ", json.dumps(params, sort_keys=True, indent=4))
    yaml_file.close()

    time_dep_seed = int((time.time() - int(time.time() - 0.5)) * 1000)
    base_path = os.path.dirname(os.path.abspath(__file__))
    model_folder = os.path.join(base_path, params["folders"]["parent_dir"], params["settings"]["game_id"],
                                params["folders"]["model_name"], "model")
    tensor_board_folder = os.path.join(base_path, params["folders"]["parent_dir"], params["settings"]["game_id"],
                                        params["folders"]["model_name"], "tb")

    os.makedirs(model_folder, exist_ok=True)

    # Settings
    settings = params["settings"]

    # Wrappers Settings
    wrappers_settings = params["wrappers_settings"]

    # Create environment
    env, num_envs = make_sb3_env(params["settings"]["game_id"], settings, wrappers_settings, seed=time_dep_seed)
    print("Activated {} environment(s)".format(num_envs))

    print("Observation space =", env.observation_space)
    print("Act_space =", env.action_space)

    # Policy param
    policy_kwargs = params["policy_kwargs"]

    # PPO settings
    ppo_settings = params["ppo_settings"]
    gamma = ppo_settings["gamma"]
    model_checkpoint = ppo_settings["model_checkpoint"]

    learning_rate = linear_schedule(ppo_settings["learning_rate"][0], ppo_settings["learning_rate"][1])
    clip_range = linear_schedule(ppo_settings["clip_range"][0], ppo_settings["clip_range"][1])
    clip_range_vf = clip_range
    batch_size = ppo_settings["batch_size"]
    n_epochs = ppo_settings["n_epochs"]
    n_steps = ppo_settings["n_steps"]

    if model_checkpoint == "0":
        # Initialize the agent
        agent = PPO("MultiInputPolicy", env, verbose=1,
                    gamma=gamma, batch_size=batch_size,
                    n_epochs=n_epochs, n_steps=n_steps,
                    learning_rate=learning_rate, clip_range=clip_range,
                    clip_range_vf=clip_range_vf, policy_kwargs=policy_kwargs,
                    tensorboard_log=tensor_board_folder)
    else:
        # Load the trained agent
        agent = PPO.load(os.path.join(model_folder, model_checkpoint), env=env,
                         gamma=gamma, learning_rate=learning_rate, clip_range=clip_range,
                         clip_range_vf=clip_range_vf, policy_kwargs=policy_kwargs,
                         tensorboard_log=tensor_board_folder)


    # Print policy network architecture
    print("Policy architecture:")
    print(agent.policy)

    # Create the callback: autosave every USER DEF steps
    autosave_freq = ppo_settings["autosave_freq"]
    auto_save_callback = AutoSave(check_freq=autosave_freq, num_envs=num_envs,
                                  save_path=model_folder, filename_prefix=model_checkpoint + "_")

    # Train the agent
    time_steps = ppo_settings["time_steps"]
    agent.learn(total_timesteps=time_steps, callback=auto_save_callback)

    # Save the agent
    new_model_checkpoint = str(int(model_checkpoint) + time_steps)
    model_path = os.path.join(model_folder, new_model_checkpoint)
    agent.save(model_path)

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--cfgFile", type=str, required=True, help="Configuration file")
    opt = parser.parse_args()
    print(opt)

    main(opt.cfgFile)

How to run it:

diambra run python training.py --cfgFile /absolute/path/to/config.yaml

and the configuration file to be used with this training script is reported below:

folders:
  parent_dir: "./results/"
  model_name: "sr6_128x4_das_nc"

settings:
  game_id: "doapp"
  characters: "Kasumi"
  difficulty: 3
  step_ratio: 6
  frame_shape: !!python/tuple [128, 128, 1]
  continue_game: 0.0
  action_space: "multi_discrete"
  attack_but_combination: false
  char_outfits: 2
  player: "Random"
  show_final: false

wrappers_settings:
  frame_stack: 4
  dilation: 1
  actions_stack: 12
  reward_normalization: true
  scale: true
  exclude_image_scaling: true
  flatten: true
  filter_keys:
    [
      "stage",
      "P1_ownHealth",
      "P1_oppHealth",
      "P1_ownSide",
      "P1_oppSide",
      "P1_oppChar",
      "P1_actions_move",
      "P1_actions_attack",
    ]

policy_kwargs:
  #net_arch: [{ pi: [64, 64], vf: [32, 32] }]
  net_arch: [64, 64]

ppo_settings:
  gamma: 0.94
  model_checkpoint: "0"
  learning_rate: [2.5e-4, 2.5e-6] # To start
  clip_range: [0.15, 0.025] # To start
  #learning_rate: [5.0e-5, 2.5e-6] # Fine Tuning
  #clip_range: [0.075, 0.025] # Fine Tuning
  batch_size: 256 #8 #nminibatches gave different batch size depending on the number of environments: batch_size = (n_steps * n_envs) // nminibatches
  n_epochs: 4
  n_steps: 128
  autosave_freq: 256
  time_steps: 512

Agent Script for Competition

Finally, after the agent training is completed, besides running it locally in your own machine, you may want to submit it to our competition platform! To do so, you can use the following script that provides a ready to use, flexible example that can accommodate different models, games and settings.

To submit your trained agent to our platform, compete for the first leaderboard positions, and unlock our achievements, follow the simple steps described in the “How to Submit an Agent” section.

import os
import time
import yaml
import json
import argparse
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env
from stable_baselines3 import PPO

"""This is an example agent based on stable baselines 3.

Usage:
diambra run python stable_baselines3/agent.py --cfgFile $PWD/stable_baselines3/cfg_files/doapp/sr6_128x4_das_nc.yaml --trainedModel "model_name"
"""

def main(cfg_file, trained_model):

    # Read the cfg file
    yaml_file = open(cfg_file)
    params = yaml.load(yaml_file, Loader=yaml.FullLoader)
    print("Config parameters = ", json.dumps(params, sort_keys=True, indent=4))
    yaml_file.close()

    time_dep_seed = int((time.time() - int(time.time() - 0.5)) * 1000)

    base_path = os.path.dirname(os.path.abspath(__file__))
    model_folder = os.path.join(base_path, params["folders"]["parent_dir"], params["settings"]["game_id"],
                                params["folders"]["model_name"], "model")

    # Settings
    settings = params["settings"]
    settings["player"] = "P1"

    # Wrappers Settings
    wrappers_settings = params["wrappers_settings"]
    wrappers_settings["reward_normalization"] = False

    # Create environment
    env, num_envs = make_sb3_env(params["settings"]["game_id"], settings, wrappers_settings, seed=time_dep_seed, no_vec=True)
    print("Activated {} environment(s)".format(num_envs))

    print("Observation space =", env.observation_space)
    print("Act_space =", env.action_space)

    # Load the trained agent
    model_path = os.path.join(model_folder, trained_model)
    agent = PPO.load(model_path)

    # Print policy network architecture
    print("Policy architecture:")
    print(agent.policy)

    obs = env.reset()

    while True:

        action, _ = agent.predict(obs, deterministic=False)

        obs, reward, done, info = env.step(action.tolist())

        if done:
            obs = env.reset()
            if info["env_done"]:
                break

    # Close the environment
    env.close()

    # Return success
    return 0

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--cfgFile", type=str, required=True, help="Configuration file")
    parser.add_argument("--trainedModel", type=str, default="model", help="Model checkpoint")
    opt = parser.parse_args()
    print(opt)

    main(opt.cfgFile, opt.trainedModel)

How to run it locally:

diambra run python agent.py --cfgFile /absolute/path/to/config.yaml --trainedModel "model_name"

and the configuration file to be used is the same that was used for training it, like the one reported in the previous paragraph.