The source code of all examples described in this section is available in our DIAMBRA Agents repository.
We highly recommend using virtual environments to isolate your python installs, especially to avoid conflicts in dependencies. In what follows we use Conda but any other tool should work too.
Create and activate a new dedicated virtual environment:
conda create -n diambra-arena-sb3 python=3.8
conda activate diambra-arena-sb3
Install DIAMBRA Arena with Stable Baselines 3 interface:
pip install diambra-arena[stable-baselines3]
This should be enough to prepare your system to execute the following examples. You can refer to the official Stable Baselines 3 documentation or reach out on our Discord server for specific needs.
All the examples presented below are available here: DIAMBRA Agents - Stable Baselines 3. They have been created following the high level approach found on Stable Baselines 3 examples page, thus allowing to easily extend them and to understand how they interface with the different components.
These examples only aims at demonstrating the core functionalities and high level aspects, they will not generate well performing agents, even if the training time is extended to cover a large number of training steps. The user will need to build upon them, exploring aspects like: policy network architecture, algorithm hyperparameter tuning, observation space tweaking, rewards wrapping and other similar ones.
DIAMBRA Arena native interface with Stable Baselines 3 covers a wide range of use cases, automating handling of vectorized environments and monitoring wrappers. In the majority of cases it will be sufficient for users to directly import and use it, with no need for additional customization. Below is reported its interface and a table describing its arguments.
def make_sb3_env(game_id: str, env_settings: EnvironmentSettings=EnvironmentSettings(),
wrappers_settings: WrappersSettings=WrappersSettings(),
episode_recording_settings: RecordingSettings=RecordingSettings(),
render_mode: str="rgb_array", seed: int=None, start_index: int=0,
allow_early_resets: bool=True, start_method: str=None, no_vec: bool=False,
use_subprocess: bool=True, log_dir_base: str="/tmp/DIAMBRALog/"):
Argument | Type | Default Value(s) | Description |
---|---|---|---|
game_id | str | - | Game environment identifier |
env_settings | EnvironmentSettings | EnvironmentSettings() | Environment settings (see more) |
wrappers_settings | WrappersSettings | WrappersSettings() | Wrappers settings (see more) |
episode_recording_settings | RecordingSettings | RecordingSettings() | Episode recording settings (see more) |
render_mode | str | "rgb_array" | Rendering mode |
seed | int | None | Random number generator seed |
start_index | int | 0 | Starting process rank index |
allow_early_resets | bool | True | Monitor wrapper argument to allow environment reset before it is done |
start_method | str | None | Method to spawn subprocesses when active (see more) |
no_vec | bool | False | If True avoids using vectorized environments (valid only when using a single instance) |
use_subprocess | bool | True | If to use subprocesses for multi-threaded parallelization |
log_dir_base | str | "/tmp/DIAMBRALog/" | Folder where to save execution logs |
For the interface low level details, users can review the correspondent source code here.
For all the examples there are two main things to note about the observation space.
First, the normalization wrapper is applied on all elements but the image frame, as Stable Baselines 3 automatically normalizes images and expects their pixels to be in the range [0 - 255].
Second, the library also has a specific constraint on dictionary observation spaces: they cannot be nested. For this reason we provide a flattening wrapper that creates a shallow, not nested, dictionary from the original observation space, allowing in addition to filter it by keys.
Stable Baselines 3 automatically defines the network architecture, properly matching the input type. In some of the examples the architecture is printed to the console output, allowing to clearly identify all the different contributions.
This example demonstrates how to:
It uses the A2C algorithm, with a MultiInputPolicy
policy network to properly process the dictionary observation space as input. For demonstration purposes, the algorithm is trained for only 200 steps, so the resulting agent will be far from optimal.
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env, EnvironmentSettings, WrappersSettings
from stable_baselines3 import A2C
def main():
# Settings
settings = EnvironmentSettings()
settings.frame_shape = (128, 128, 1)
settings.characters = ("Kasumi")
# Wrappers Settings
wrappers_settings = WrappersSettings()
wrappers_settings.normalize_reward = True
wrappers_settings.stack_frames = 5
wrappers_settings.add_last_action = True
wrappers_settings.stack_actions = 12
wrappers_settings.scale = True
wrappers_settings.exclude_image_scaling = True
wrappers_settings.role_relative = True
wrappers_settings.flatten = True
wrappers_settings.filter_keys = ["action", "own_health", "opp_health", "own_side", "opp_side", "opp_character", "stage", "timer"]
# Create environment
env, num_envs = make_sb3_env("doapp", settings, wrappers_settings)
print("Activated {} environment(s)".format(num_envs))
print("\nStarting training ...\n")
agent = A2C("MultiInputPolicy", env, verbose=1)
agent.learn(total_timesteps=200)
print("\n .. training completed.")
print("\nStarting trained agent execution ...\n")
observation = env.reset()
while True:
env.render()
action, _state = agent.predict(observation, deterministic=True)
observation, reward, done, info = env.step(action)
if done:
observation = env.reset()
break
print("\n... trained agent execution completed.\n")
# Close the environment
env.close()
# Return success
return 0
if __name__ == "__main__":
main()
How to run it:
diambra run python basic.py
In addition to what seen in the previous example, this one demonstrates how to:
The same conditions of the previous example for algorithm, policy and training steps are used in this one too.
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env, EnvironmentSettings, WrappersSettings
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
def main():
# Settings
settings = EnvironmentSettings()
settings.frame_shape = (128, 128, 1)
settings.characters = ("Kasumi")
# Wrappers Settings
wrappers_settings = WrappersSettings()
wrappers_settings.normalize_reward = True
wrappers_settings.stack_frames = 5
wrappers_settings.add_last_action = True
wrappers_settings.stack_actions = 12
wrappers_settings.scale = True
wrappers_settings.exclude_image_scaling = True
wrappers_settings.role_relative = True
wrappers_settings.flatten = True
wrappers_settings.filter_keys = ["action", "own_health", "opp_health", "own_side", "opp_side", "opp_character", "stage", "timer"]
# Create environment
env, num_envs = make_sb3_env("doapp", settings, wrappers_settings)
print("Activated {} environment(s)".format(num_envs))
# Instantiate the agent
agent = A2C("MultiInputPolicy", env, verbose=1)
# Train the agent
agent.learn(total_timesteps=200)
# Save the agent
agent.save("a2c_doapp")
del agent # delete trained agent to demonstrate loading
# Load the trained agent
# NOTE: if you have loading issue, you can pass `print_system_info=True`
# to compare the system on which the agent was trained vs the current one
# agent = A2C.load("a2c_doapp", env=env, print_system_info=True)
agent = A2C.load("a2c_doapp", env=env)
# Evaluate the agent
# NOTE: If you use wrappers with your environment that modify rewards,
# this will be reflected here. To evaluate with original rewards,
# wrap environment in a "Monitor" wrapper before other wrappers.
mean_reward, std_reward = evaluate_policy(agent, agent.get_env(), n_eval_episodes=3)
print("Reward: {} (avg) ± {} (std)".format(mean_reward, std_reward))
# Run trained agent
observation = env.reset()
cumulative_reward = 0
while True:
env.render()
action, _state = agent.predict(observation, deterministic=True)
observation, reward, done, info = env.step(action)
cumulative_reward += reward
if (reward != 0):
print("Cumulative reward =", cumulative_reward)
if done:
observation = env.reset()
break
# Close the environment
env.close()
# Return success
return 0
if __name__ == "__main__":
main()
How to run it:
diambra run python saving_loading_evaluating.py
In addition to what seen in previous examples, this one demonstrates how to:
In this example, the PPO algorithm is used, with the same MultiInputPolicy
seen before. The policy architecture is also printed to the console output, allowing to visualize how inputs are processed and “translated” to actions probabilities.
This example also runs multiple environments, automatically detecting the number of instances created by DIAMBRA CLI when running the script.
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env, EnvironmentSettings, WrappersSettings
from stable_baselines3 import PPO
def main():
# Settings
settings = EnvironmentSettings()
settings.frame_shape = (128, 128, 1)
settings.characters = ("Kasumi")
# Wrappers Settings
wrappers_settings = WrappersSettings()
wrappers_settings.normalize_reward = True
wrappers_settings.stack_frames = 5
wrappers_settings.add_last_action = True
wrappers_settings.stack_actions = 12
wrappers_settings.scale = True
wrappers_settings.exclude_image_scaling = True
wrappers_settings.role_relative = True
wrappers_settings.flatten = True
wrappers_settings.filter_keys = ["action", "own_health", "opp_health", "own_side", "opp_side", "opp_character", "stage", "timer"]
# Create environment
env, num_envs = make_sb3_env("doapp", settings, wrappers_settings)
print("Activated {} environment(s)".format(num_envs))
# Instantiate the agent
agent = PPO("MultiInputPolicy", env, verbose=1)
# Print policy network architecture
print("Policy architecture:")
print(agent.policy)
# Train the agent
agent.learn(total_timesteps=200)
# Run trained agent
observation = env.reset()
cumulative_reward = [0.0 for _ in range(num_envs)]
while True:
action, _state = agent.predict(observation, deterministic=True)
observation, reward, done, info = env.step(action)
cumulative_reward += reward
if any(x != 0 for x in reward):
print("Cumulative reward(s) =", cumulative_reward)
if done.any():
observation = env.reset()
break
# Close the environment
env.close()
# Return success
return 0
if __name__ == "__main__":
main()
How to run it:
diambra run -s=2 python parallel_envs.py
In addition to what seen in previous examples, this one demonstrates how to:
This example show exactly how we trained our own models on these environments. It should be considered a starting point from where to explore and experiment, the following are just a few options among the most obvious ones:
import os
import yaml
import json
import argparse
from diambra.arena import load_settings_flat_dict, SpaceTypes
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env, EnvironmentSettings, WrappersSettings
from diambra.arena.stable_baselines3.sb3_utils import linear_schedule, AutoSave
from stable_baselines3 import PPO
# diambra run -s 8 python stable_baselines3/training.py --cfgFile $PWD/stable_baselines3/cfg_files/sfiii3n/sr6_128x4_das_nc.yaml
def main(cfg_file):
# Read the cfg file
yaml_file = open(cfg_file)
params = yaml.load(yaml_file, Loader=yaml.FullLoader)
print("Config parameters = ", json.dumps(params, sort_keys=True, indent=4))
yaml_file.close()
base_path = os.path.dirname(os.path.abspath(__file__))
model_folder = os.path.join(base_path, params["folders"]["parent_dir"], params["settings"]["game_id"],
params["folders"]["model_name"], "model")
tensor_board_folder = os.path.join(base_path, params["folders"]["parent_dir"], params["settings"]["game_id"],
params["folders"]["model_name"], "tb")
os.makedirs(model_folder, exist_ok=True)
# Settings
params["settings"]["action_space"] = SpaceTypes.DISCRETE if params["settings"]["action_space"] == "discrete" else SpaceTypes.MULTI_DISCRETE
settings = load_settings_flat_dict(EnvironmentSettings, params["settings"])
# Wrappers Settings
wrappers_settings = load_settings_flat_dict(WrappersSettings, params["wrappers_settings"])
# Create environment
env, num_envs = make_sb3_env(settings.game_id, settings, wrappers_settings)
print("Activated {} environment(s)".format(num_envs))
# Policy param
policy_kwargs = params["policy_kwargs"]
# PPO settings
ppo_settings = params["ppo_settings"]
gamma = ppo_settings["gamma"]
model_checkpoint = ppo_settings["model_checkpoint"]
learning_rate = linear_schedule(ppo_settings["learning_rate"][0], ppo_settings["learning_rate"][1])
clip_range = linear_schedule(ppo_settings["clip_range"][0], ppo_settings["clip_range"][1])
clip_range_vf = clip_range
batch_size = ppo_settings["batch_size"]
n_epochs = ppo_settings["n_epochs"]
n_steps = ppo_settings["n_steps"]
if model_checkpoint == "0":
# Initialize the agent
agent = PPO("MultiInputPolicy", env, verbose=1,
gamma=gamma, batch_size=batch_size,
n_epochs=n_epochs, n_steps=n_steps,
learning_rate=learning_rate, clip_range=clip_range,
clip_range_vf=clip_range_vf, policy_kwargs=policy_kwargs,
tensorboard_log=tensor_board_folder)
else:
# Load the trained agent
agent = PPO.load(os.path.join(model_folder, model_checkpoint), env=env,
gamma=gamma, learning_rate=learning_rate, clip_range=clip_range,
clip_range_vf=clip_range_vf, policy_kwargs=policy_kwargs,
tensorboard_log=tensor_board_folder)
# Print policy network architecture
print("Policy architecture:")
print(agent.policy)
# Create the callback: autosave every USER DEF steps
autosave_freq = ppo_settings["autosave_freq"]
auto_save_callback = AutoSave(check_freq=autosave_freq, num_envs=num_envs,
save_path=model_folder, filename_prefix=model_checkpoint + "_")
# Train the agent
time_steps = ppo_settings["time_steps"]
agent.learn(total_timesteps=time_steps, callback=auto_save_callback)
# Save the agent
new_model_checkpoint = str(int(model_checkpoint) + time_steps)
model_path = os.path.join(model_folder, new_model_checkpoint)
agent.save(model_path)
# Close the environment
env.close()
# Return success
return 0
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--cfgFile", type=str, required=True, help="Configuration file")
opt = parser.parse_args()
print(opt)
main(opt.cfgFile)
How to run it:
diambra run python training.py --cfgFile /absolute/path/to/config.yaml
and the configuration file to be used with this training script is reported below:
folders:
parent_dir: "./results/"
model_name: "sr6_128x4_das_nc"
settings:
game_id: "doapp"
step_ratio: 6
frame_shape: !!python/tuple [128, 128, 1]
continue_game: 0.0
action_space: "multi_discrete"
characters: "Kasumi"
difficulty: 3
outfits: 2
wrappers_settings:
normalize_reward: true
no_attack_buttons_combinations: true
stack_frames: 4
dilation: 1
add_last_action: true
stack_actions: 12
scale: true
exclude_image_scaling: true
role_relative: true
flatten: true
filter_keys: ["action", "own_health", "opp_health", "own_side", "opp_side", "opp_character", "stage", "timer"]
policy_kwargs:
#net_arch: [{ pi: [64, 64], vf: [32, 32] }]
net_arch: [64, 64]
ppo_settings:
gamma: 0.94
model_checkpoint: "0"
learning_rate: [2.5e-4, 2.5e-6] # To start
clip_range: [0.15, 0.025] # To start
#learning_rate: [5.0e-5, 2.5e-6] # Fine Tuning
#clip_range: [0.075, 0.025] # Fine Tuning
batch_size: 256 #8 #nminibatches gave different batch size depending on the number of environments: batch_size = (n_steps * n_envs) // nminibatches
n_epochs: 4
n_steps: 128
autosave_freq: 256
time_steps: 512
Finally, after the agent training is completed, besides running it locally in your own machine, you may want to submit it to our Competition Platform! To do so, you can use the following script that provides a ready to use, flexible example that can accommodate different models, games and settings.
To submit your trained agent to our platform, compete for the first leaderboard positions, and unlock our achievements, follow the simple steps described in the “How to Submit an Agent” section.
import os
import yaml
import json
import argparse
from diambra.arena import Roles, SpaceTypes, load_settings_flat_dict
from diambra.arena.stable_baselines3.make_sb3_env import make_sb3_env, EnvironmentSettings, WrappersSettings
from stable_baselines3 import PPO
"""This is an example agent based on stable baselines 3.
Usage:
diambra run python stable_baselines3/agent.py --cfgFile $PWD/stable_baselines3/cfg_files/doapp/sr6_128x4_das_nc.yaml --trainedModel "model_name"
"""
def main(cfg_file, trained_model, test=False):
# Read the cfg file
yaml_file = open(cfg_file)
params = yaml.load(yaml_file, Loader=yaml.FullLoader)
print("Config parameters = ", json.dumps(params, sort_keys=True, indent=4))
yaml_file.close()
base_path = os.path.dirname(os.path.abspath(__file__))
model_folder = os.path.join(base_path, params["folders"]["parent_dir"], params["settings"]["game_id"],
params["folders"]["model_name"], "model")
# Settings
params["settings"]["action_space"] = SpaceTypes.DISCRETE if params["settings"]["action_space"] == "discrete" else SpaceTypes.MULTI_DISCRETE
settings = load_settings_flat_dict(EnvironmentSettings, params["settings"])
settings.role = Roles.P1
# Wrappers Settings
wrappers_settings = load_settings_flat_dict(WrappersSettings, params["wrappers_settings"])
wrappers_settings.normalize_reward = False
# Create environment
env, num_envs = make_sb3_env(settings.game_id, settings, wrappers_settings, no_vec=True)
print("Activated {} environment(s)".format(num_envs))
# Load the trained agent
model_path = os.path.join(model_folder, trained_model)
agent = PPO.load(model_path)
# Print policy network architecture
print("Policy architecture:")
print(agent.policy)
obs, info = env.reset()
while True:
action, _ = agent.predict(obs, deterministic=False)
obs, reward, terminated, truncated, info = env.step(action.tolist())
if terminated or truncated:
obs, info = env.reset()
if info["env_done"] or test is True:
break
# Close the environment
env.close()
# Return success
return 0
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--cfgFile", type=str, required=True, help="Configuration file")
parser.add_argument("--trainedModel", type=str, default="model", help="Model checkpoint")
parser.add_argument("--test", type=int, default=0, help="Test mode")
opt = parser.parse_args()
print(opt)
main(opt.cfgFile, opt.trainedModel, bool(opt.test))
How to run it locally:
diambra run python agent.py --cfgFile /absolute/path/to/config.yaml --trainedModel "model_name"
and the configuration file to be used is the same that was used for training it, like the one reported in the previous paragraph.