This page describes in details all general aspects related to DIAMBRA Arena environments. For game-specific details visit Games & Specifics page.
DIAMBRA Arena is a software package featuring a collection of high-quality environments for Reinforcement Learning research and experimentation. It provides a standard interface to popular arcade emulated video games, offering a Python API fully compliant with OpenAI Gym format, that makes its adoption smooth and straightforward.
It supports all major Operating Systems (Linux, Windows and MacOS) and can be easily installed via Python PIP, as described in the installation section. It is completely free to use, the user only needs to register on the official website.
In addition, its GitHub repository provides a collection of examples covering main use cases of interest that can be run in just a few steps.
All environments are episodic Reinforcement Learning tasks, with discrete actions (gamepad buttons) and observations composed by screen pixels plus additional numerical data (RAM states like characters health bars or characters stage side).
They all support both single player (1P) as well as two players (2P) mode, making them the perfect resource to explore all the following Reinforcement Learning subfields:
Interfaced games have been selected among the most popular fighting retro-games. While sharing the same fundamental mechanics, they provide different challenges, with specific features such as different type and number of characters, how to perform combos, health bars recharging, etc.
Whenever possible, games are released with all hidden/bonus characters unlocked.
Additional details can be found in their dedicated section.
DIAMBRA Arena Environments usage follows the standard RL interaction framework: the agent sends an action to the environment, which process it and performs a transition accordingly, from the starting state to the new state, returning the observation and the reward to the agent to close the interaction loop. The figure below shows this typical interaction scheme and data flow.
The shortest snippet for a complete basic execution of an environment consists of just a few lines of code, and is presented in the code block below:
#!/usr/bin/env python3
import diambra.arena
if __name__ == '__main__':
env = diambra.arena.make("doapp")
observation = env.reset()
while True:
env.render()
actions = env.action_space.sample()
observation, reward, done, info = env.step(actions)
if done:
observation = env.reset()
break
env.close()
More complex and complete examples can be found in the Examples section.
All environments share a numerous set of options allowing to handle many different aspects, controlled by key-value pairs in a Python dictionary passed to the environment creation method, as shown below:
env = diambra.arena.make("doapp", settings)
The first argument, the only one that is mandatory, is the game_id
string, it specifies the game to execute among those available (see games list and info).
Next table summarizes and describes the general, game-independent, settings, while the game-specific ones are presented in the game dedicated pages.
Game-specific settings that are shared among all games, are found in the table contained in the Game Specific Settings section below.
Key | Type | Default Value(s) | Value Range | Description |
---|---|---|---|---|
player | str | Random | 1P mode: P1 (left), P2 (right), Random (50% P1, 50% P2)2P mode: P1P2 | Selects single player (1P) or two players (2P) mode, and to select on which side to play (left/right) |
step_ratio | int | 6 | [1, 6] | Defines how many steps the game (emulator) performs for every environment step |
frame_shape | tuple of three int (H, W, C) | (0, 0, 0) | H, W: [0, 512] C: 0 or 1 | If active, resizes the frame and/or converts it from RGB to grayscale. Combinations: (0, 0, 0) - Deactivated; (H, W, 0) - RBG frame resized to H X W; (0, 0, 1) - Grayscale frame; (H, W, 1) - Grayscale frame resized to H X W. |
continue_game | float | 0.0 | [0.0, 1.0]: probability of continuing game at game overint(abs(-inf, -1.0]) : number of continues at game over before episode to be considered done | Defines if and how to allow ”Continue” when the agent is about to face the game over condition |
show_final | bool | True | True / False | Activates displaying of final animation when game is completed |
action_space | str | multi_discrete | discrete / multi_discrete | Defines the type of the action space |
attack_but_combination | bool | True | True / False | Activates attack buttons combinations |
hardcore | bool | False | True / False | Activates hardcore mode, in which the observation is only made of the game frame |
Environment settings depending on the specific game and shared among all of them are reported in the table below. Additional ones (if present) are reported in game-dedicated pages.
Key | Type | Default Value(s) | Value Range | Description |
---|---|---|---|---|
difficulty | int | 3 | Game-specific min and max values allowed | Specifies game difficulty (1P only) |
characters | str or tuple of maximum three str | Random | Game-specific lists of characters that can be selected | Specifies character(s) to use |
char_outfits | int | 1 | Game-specific min and max values allowed | Defines the number of outfits to draw from at character selection |
Of these general settings, action_space
, attack_but_combination
, characters
, and char_outfits
need to be provided as tuples of two elements (the first for P1 and the second for P2) when using the environments in two players mode. The same applies to some game-specific settings, they are listed in the game-dedicated page.
Actions of the interfaced games can be grouped in two categories: move actions (Up, Left, etc.) and attack ones (Punch, Kick, etc.). DIAMBRA Arena provides four different action spaces: the main distinction is between Discrete and MultiDiscrete ones. The former is a single list composed by the union of move and attack actions (of type gym.spaces.Discrete
), while the latter consists of two sets combined, for move and attack actions respectively (of type gym.spaces.MultiDiscrete
).
For each of the two options, there is an additional differentiation available: if to use attack buttons combinations or not. This option is mainly available to reduce the action space size as much as possible, since combinations of attack buttons can be seen as additional attack buttons. The complete visual description of available action spaces is shown in the figure below, where all four choices are presented via the correspondent gamepad buttons configuration for Dead Or Alive ++.
When run in 2P mode, the environment is provided with a Dictionary action space (type gym.spaces.Dict
) populated with two items, identified by keys P1
and P2
, whose values are either gym.spaces.Discrete
or gym.spaces.MultiDiscrete
as described above.
Each game has specific action spaces since attack buttons (and their combinations) are, in general, game-dependent. For this reason, in each game-dedicated page, a table like the one found below is reported, describing all four actions spaces for the specific game.
In Discrete action spaces:
In MultiDiscrete action spaces:
All and only meaningful actions are made available per each game: they are sufficient to cover the entire spectrum of moves and combos for all the available characters. If a specific game has Button-1 and Button-2 among its available actions, and not Button-1 + Button-2, it means that the latter has no effect in any circumstance, considering all characters in all conditions.
Some actions (especially attack buttons combinations) may have no effect for some of the characters: in some games combos requiring attack buttons combinations are valid only for a subset of characters.
For every game, a table containing the following info is reported. It provides numerical details about action spaces sizes.
Type | Attack Buttons Combination | Space Size (Number of Actions) |
---|---|---|
Discrete / MultiDiscrete | Active / Not active | Total number of actions available, divided in move and attack actions |
Environment observations are composed by two main elements: a visual one (the game frame) and an aggregation of quantitative values called RAM states (stage number, health values, etc.). Both of them are exposed through an observation space of type gym.spaces.Dict
. It consists of global elements and player-specific ones, they are presented and described in the tables below. To give additional context, next figure shows an example of Dead Or Alive ++ observation where some of the RAM States are highlighted, superimposed on the game frame.
Each game specifies and extends the set presented here with its custom one, described in the game-dedicated page.
Global elements of the observation space are unrelated to the player and they are currently limited to those presented and described in the following table. The same table is found on each game-dedicated page reporting its specs:
Key | Type | Value Range | Description |
---|---|---|---|
frame | Box | Game-specific min and max values for each dimension | Latest game frame (RGB pixel screen) |
stage | Box | Game-specific min and max values | Current stage of the game |
Player-specific observations can be accessed using key(s) P1
(1P and 2P Modes) and/or P2
(2P Mode only), as shown in the following snippet for the Side element:
own_side_var = observation["P1"]["ownSide"]
Typical values that are available for each game are reported and described in the table below. The same table is found in every game-dedicated page, specifying and extending (if needed) the observation elements set.
Key | Type | Value Range | Description |
---|---|---|---|
ownSide /oppSide | Discrete (Binary) | [0, 1] | Side of the stage where the player is 0: Left, 1: Right |
ownWins /oppWins | Box | [0, max number of rounds] | Number of rounds won by the player |
ownChar /oppChar | Discrete | [0, max number of characters - 1] | Index of character in use |
ownHealth /oppHealth | Box | [0, max health value] | Health bar value |
actions +move | Discrete | [0, max number of move actions - 1] | Index of latest move action performed (no-move, left, left+up, up, etc.) |
actions +attack | Discrete | [0, max number of attack actions - 1] | Index of latest attack action performed (no-attack, hold, punch, etc.) |
The default reward is defined as a function of characters health values so that, qualitatively, damage suffered by the agent corresponds to a negative reward, and damage inflicted to the opponent corresponds to a positive reward. The quantitative, general and formal reward function definition is as follows:
$$ \begin{equation} R_t = \sum_i^{0,N_c}\left(\bar{H_i}^{t^-} - \bar{H_i}^{t} - \left(\hat{H_i}^{t^-} - \hat{H_i}^{t}\right)\right) \end{equation} $$
Where:
The lower and upper bounds for the episode total cumulative reward are defined in the equations (Eqs. 2) below. They consider the default reward function for game execution with Continue Game option set equal to 0.0 (Continue not allowed).
$$ \begin{equation} \begin{gathered} \min{\sum_t^{0,T_s}R_t} = - N_c \left( \left(N_s-1\right) \left(N_r-1\right) + N_r\right) \Delta H \\ \max{\sum_t^{0,T_s}R_t} = N_c N_s N_r \Delta H \end{gathered} \end{equation} $$
Where:
For 1P mode $N_s$ is game-dependent, while for 2P mode $N_s=1$, meaning the episode always ends after a single stage (so after $N_r$ rounds have been won / lost be the same player, either P1 or P2).
For 2P mode, P1 reward is defined as $R$ in the reward Eq. 1 and P2 reward is equal to $-R$ (zero-sum games). Eq. 1 describes the default reward function. It is of course possible to tweak it at will by means of custom Reward Wrappers.
The minimum and maximum total cumulative reward for the round can be different than $N_c\Delta H$ in some cases. This may happen because:
Lower and upper bounds of episode total cumulative reward may, in some cases, deviate from what defined by Eqs. 2, because:
Please note that the maximum cumulative reward (for 1P mode) is obtained when clearing the game winning all rounds with a perfect ($\max{\sum_t^{0,T_s}R_t}\Rightarrow$ game completed), but the vice versa is not true. In fact not necessarily the higher number of stages won, the higher is the total cumulative reward ($\max{\sum_t^{0,T_s}R_t}\not\propto$ stage reached, game completed $\nRightarrow\max{\sum_t^{0,T_s}R_t}$). Somehow counter intuitively, in order to obtain the lowest possible total cumulative reward the agent is supposed to reach the final stage (collecting negative rewards in all previous ones) before loosing by $N_r$ perfects.
If a normalized reward is considered, the total cumulative reward equation becomes:
$$ \begin{equation} R_t = \frac{\sum_i^{0,N_c}\left(\bar{H_i}^{t^-} - \bar{H_i}^{t} - \left(\hat{H_i}^{t^-} - \hat{H_i}^{t}\right)\right)}{N_k \Delta H} \end{equation} $$
With the following additional term at the denominator:
The normalization term at the denominator ensures that a round won with a perfect (i.e. without losing any health), generates always the same maximum total cumulative reward (for the round) accross all games, equal to $N_c/N_k$.