ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)

Hi HF community,

I built ARCUS-H, an open evaluation harness that measures behavioral stability under stress as a complement to reward-based RL evaluation. It’s built entirely on Stable-Baselines3 and Gymnasium, so it should be immediately familiar to anyone in this community.

The core problem it solves: Return tells you how well an agent performs in nominal conditions. It doesn’t tell you what happens when control authority is reduced, action execution is noisy, or reward feedback is corrupted. ARCUS-H standardizes stress evaluation so these comparisons are reproducible and algorithm-agnostic.

Main empirical finding:

r = +0.14, p = 0.364 between normalized reward and collapse rate under valence inversion — no significant correlation across 9 environments and 7 algorithms (PPO, A2C, TRPO, DQN, DDPG, SAC, TD3). The highest-reward agents (SAC/TD3 on MuJoCo) collapse most severely under stress.

What’s in the benchmark:

  • 4 stress schedules: concept drift, resource constraint, trust violation, valence inversion

  • PRE → SHOCK → POST phase structure (40 episodes each)

  • Adaptive calibration from pre-phase (FPR = 2.0%, target α = 0.05)

  • 5 behavioral channels: competence, coherence, continuity, integrity, meaning

  • 9 environments, 7 algorithms, 10 seeds, ~830 total runs

  • 15 benchmark plots (PNG + PDF)

Everything is open:

:page_facing_up: Paper: https://zenodo.org/records/19075167

:laptop: Code: https://github.com/karimzn00/ARCUSH_1.0

Questions :

I’d love feedback on from this community specifically:

  • Does the SB3 integration feel clean?
  • Are there environments or algorithms on HF Hub that would make good additions to the benchmark suite?
2 Likes

for now, quick feedback:


Here is a paste-ready maintainer-style review. Remove the citations before posting if you want a cleaner version.

I took a close look at the public repo and docs, and the short version is: this feels like a real SB3-native benchmark, not a custom RL fork. The overall shape is strong. It stays inside the Stable-Baselines3 and Gymnasium workflow, keeps the agent interface standard, and adds a separate stress-evaluation layer with a clear PRE → SHOCK → POST protocol. That is a good design choice for adoption because people can understand it immediately if they already use SB3, RL Zoo, or the SB3 Hugging Face models. ARCUS-H’s current public scope also looks benchmark-sized rather than anecdotal: 9 environments, 7 algorithms, 4 stress schedules, 10 seeds, and 120-episode evaluation runs. (GitHub)

On the specific question of whether the SB3 integration feels clean, my answer is: yes at the user-facing level, but not fully yet at the benchmark-hygiene level. The training path is clearly aligned with normal SB3 practice. The code auto-selects CnnPolicy, MultiInputPolicy, or MlpPolicy from the observation space, uses SB3 core algorithms plus sb3-contrib for TRPO, and relies on standard env utilities rather than custom training logic. That is exactly the kind of interface the SB3 community tends to trust. (GitHub)

The main technical cleanup I would make before expanding the benchmark is Atari train/eval symmetry. In the current public code, the train path uses SB3’s Atari env utilities, while the eval path manually wraps Atari with AtariPreprocessing(..., frame_skip=1, terminal_on_life_loss=False, ...) and frame stacking. SB3’s Atari wrapper defaults are materially different: frame_skip=4, terminal_on_life_loss=True, and clipped rewards by default. On Atari, those are not cosmetic differences. They can change the effective task and the meaning of the evaluation result. I would make training and evaluation wrapper stacks match exactly by default, then apply ARCUS-H stress on top. (GitHub)

The second thing I would tighten is metric semantics around reward corruption. ARCUS-H’s valence inversion stressor flips reward during SHOCK, but the evaluator is still driving fixed SB3 policies through model.predict(obs, deterministic=...). In standard SB3 inference, action selection is observation-driven; reward is not an input to predict(). So for frozen policies, reward inversion is not on the same footing as action attenuation, action permutation, or observation drift. It is still a useful track, but I would probably present it as a reward-channel corruption track rather than mix it directly with execution-side stressors in the strongest headline claim. That would make the benchmark story cleaner and preempt an obvious criticism. (GitHub)

I would also do one pass on reproducibility and packaging. The repo README currently advertises Python 3.9+, while SB3’s current stable docs say 2.7.1 is the last release supporting Python 3.9 and recommend Python 3.10 or newer. SB3’s reproducibility guidance also explicitly says that deterministic results on a fixed setup require passing a seed when creating the model, and that exact reproducibility is still not guaranteed across platforms or PyTorch versions. For a benchmark, that means version pinning, explicit model seeding, and logging wrapper stacks and package versions are worth treating as first-class metadata. (GitHub)

On the second question, there are definitely good additions on the Hugging Face Hub, and I would add them in waves rather than all at once. My first picks would be MiniGrid FourRooms, MiniGrid Unlock, LunarLander-v3, and QR-DQN on Acrobot-v1. FourRooms and Unlock are good because they add longer-horizon, interpretable discrete behavior where continuity and integrity failures are easier to see than in some classic-control tasks. LunarLander-v3 is a strong benchmark choice because Gymnasium explicitly notes that v3 fixed reset determinism and episode-to-episode wind independence issues. QR-DQN on Acrobot is the cleanest algorithm-side addition because it lets you test a stronger discrete off-policy baseline without changing the environment family at the same time. All of these already exist in the SB3 organization on the Hub. (Hugging Face)

My second wave would be BipedalWalkerHardcore and TQC. BipedalWalkerHardcore is a good stress-test environment because the harder terrain creates richer degradation modes than simpler continuous-control tasks. TQC is the most informative next continuous-control algorithm because it is an SB3-Contrib method designed to improve over SAC-style critic behavior, which makes it especially relevant if the benchmark is already finding that high nominal reward can coexist with brittle stress behavior. The SB3 Hub also has TQC models for robotics tasks, but I would only add those after the evaluator is ready for more complex observation structures. (Hugging Face)

I would leave RecurrentPPO and the Panda robotics tasks for later. They are good additions, but only after the evaluator explicitly supports recurrent state handling and the observation-side stress logic is ready for more structured inputs. SB3-Contrib’s own docs are very clear that recurrent inference needs lstm_states and episode_start passed into predict() correctly, so I would not put recurrent models on the main leaderboard until that path is implemented cleanly. (Hugging Face)

So the overall maintainer-style take is: the benchmark idea is strong, the SB3 integration is mostly clean, and the project feels worth following. The main things I would fix before broadening the suite are:

  1. make Atari train/eval wrappers identical,
  2. separate reward-corruption semantics from execution-side stress in the main story, and
  3. harden reproducibility metadata and dependency pinning.
    After that, I would expand first with MiniGrid, LunarLander-v3, and QR-DQN, then add harder continuous-control and memory-dependent models in later benchmark versions. (GitHub)
1 Like

Thank you for ur feedback, genuinely the most detailed feedback I’ve received so far, and the three priority items are well-targeted.

On the Atari train/eval wrapper mismatch: you’re right, this is a real inconsistency. Training uses SB3’s AtariWrapper defaults (frame_skip=4, terminal_on_life_loss=True) while eval uses frame_skip=1 and terminal_on_life_loss=False. My original reasoning was that frame_skip=1 gives finer-grained stress measurement per step, but you’re correct that this changes the effective task and makes the Pong results not directly comparable to standard Atari benchmarks. I’ll fix this in v1.1 — train and eval wrappers will match exactly, with stress applied on top.

On valence inversion semantics: this is the critique I expected and you’ve framed it precisely. VI doesn’t affect model.predict() at all for frozen policies, it only affects the logged reward signal. You’re right that it belongs in a separate reward-channel corruption track rather than being presented on equal footing with execution-side stressors. I’ll restructure the stressor taxonomy in the paper revision accordingly.

On reproducibility metadata: agreed on version pinning and explicit model seeding. I’ll add a requirements-lock.txt and log wrapper stacks + package versions as first-class eval metadata in v1.1.

On environment additions: MiniGrid FourRooms and LunarLander-v3 are the right first wave, interpretable discrete behavior where continuity and integrity failures are visible. I’ll hold RecurrentPPO until lstm_states handling in predict() is properly implemented.

One genuine question back: on the reward-corruption reframing, would you present concept drift and valence inversion as a separate “observation/reward corruption” axis distinct from the “execution-side” axis (RC + TV), or keep all four in one taxonomy with clearer semantic labeling?

1 Like

I’m glad that was helpful. :blush:


I would not merge concept drift and valence inversion into one shared “observation/reward corruption” axis.

I would use a hierarchical taxonomy:

Recommended taxonomy

1. Perception / input-side stress

  • Concept drift

2. Execution / control-side stress

  • Resource constraint
  • Trust violation

3. Feedback / objective-side stress

  • Valence inversion

4. Later, if you add it

  • Environment / dynamics-side stress

    • latent dynamics shift
    • delay
    • actuator lag
    • hidden-parameter shift

That structure is the cleanest fit for your case. It also matches how the broader robustness-benchmark literature is organized. Real-World RL Suite separates perturbations on action, observation, and reward channels, while Robust-Gymnasium organizes disruptions across observed state and reward, actions, and the environment. (GitHub)

Why I would not combine CD and VI into one axis

Because they are semantically different in a frozen-policy SB3 benchmark.

Your README defines concept drift as an additive shift applied to the executed observation, s_t^exec = s_t + δ_t. That directly changes the policy input. So CD is a true behavioral stressor for a frozen SB3 policy, because predict() acts on observation. SB3’s base API defines predict(observation, state=None, episode_start=None, deterministic=False) as getting the policy action from an observation and optional hidden state. (GitHub)

Valence inversion is different. Your README defines it as r_t^exec = -r_t. For a standard frozen SB3 policy, reward is not an input to predict(). So VI does not perturb action selection in the same direct way. It perturbs the feedback channel and the semantics of logged task success. That makes it important, but different. If you put CD and VI into one shared “corruption” axis, you risk blurring the exact distinction you just clarified. (GitHub)

So my answer is:

  • CD belongs with input-side / perception-side stress.
  • VI belongs with feedback-side / objective-side stress.
  • They should not be collapsed into one joint axis, except perhaps visually under a very broad umbrella like “non-execution-side perturbations.” Even then, I would keep them as clearly separate sub-axes.

Why RC and TV belong together

Your README defines:

  • RC as reduced control authority, either attenuating continuous actions or replacing discrete actions with a default action with some probability.
  • TV as action-execution mismatch, either mixing continuous actions with a matrix/noise or permuting discrete actions. (GitHub)

Both of those act on the action actually executed by the environment, not on the observation seen by the policy and not on the reward signal recorded afterward. So they are naturally the same top-level class: execution-side stress. Real-World RL Suite’s action-delay framing supports this kind of decomposition, because it also treats the action channel as a distinct failure surface. (GitHub)

The framing I would use in the paper

I would keep all four stressors in one benchmark taxonomy, but not as four peers without structure.

Instead, present them like this:

ARCUS-H v1.1 covers three perturbed RL components:

  • Perception-side: concept drift
  • Execution-side: resource constraint, trust violation
  • Feedback-side: valence inversion

Then add one sentence:

A fourth class, environment/dynamics-side perturbation, is reserved for future versions.

That gives you the simplicity of one taxonomy while preserving the semantic distinctions that matter most. It also aligns well with Robust-Gymnasium’s component-based framing and Real-World RL Suite’s separate treatment of action, observation, and reward perturbations. (GitHub)

Why this is the best choice for ARCUS-H specifically

This framing helps you in three ways.

First, it makes your benchmark easier to explain:

  • what the agent sees
  • what the agent tries to do
  • what the benchmark says happened

Second, it protects you from the strongest criticism of VI:

  • VI is still valuable
  • but now it is clearly labeled as feedback corruption, not a direct execution stressor for frozen policies

Third, it gives you a natural roadmap:

  • v1.1: perception, execution, feedback
  • v1.2 or later: environment/dynamics

That is a very maintainable benchmark story.

My direct recommendation

Use this exact top-level split:

  • Perception-side: CD
  • Execution-side: RC, TV
  • Feedback-side: VI
  • Future environment-side: dynamics shift, delay, etc.

That is cleaner than either of the two alternatives you proposed:

  • cleaner than a single flat taxonomy
  • cleaner than a merged “observation/reward corruption” axis

Because observation corruption and reward corruption are not equivalent once the evaluated policy is frozen. (Stable Baselines3 Docs)