for now, quick feedback:
Here is a paste-ready maintainer-style review. Remove the citations before posting if you want a cleaner version.
I took a close look at the public repo and docs, and the short version is: this feels like a real SB3-native benchmark, not a custom RL fork. The overall shape is strong. It stays inside the Stable-Baselines3 and Gymnasium workflow, keeps the agent interface standard, and adds a separate stress-evaluation layer with a clear PRE → SHOCK → POST protocol. That is a good design choice for adoption because people can understand it immediately if they already use SB3, RL Zoo, or the SB3 Hugging Face models. ARCUS-H’s current public scope also looks benchmark-sized rather than anecdotal: 9 environments, 7 algorithms, 4 stress schedules, 10 seeds, and 120-episode evaluation runs. (GitHub)
On the specific question of whether the SB3 integration feels clean, my answer is: yes at the user-facing level, but not fully yet at the benchmark-hygiene level. The training path is clearly aligned with normal SB3 practice. The code auto-selects CnnPolicy, MultiInputPolicy, or MlpPolicy from the observation space, uses SB3 core algorithms plus sb3-contrib for TRPO, and relies on standard env utilities rather than custom training logic. That is exactly the kind of interface the SB3 community tends to trust. (GitHub)
The main technical cleanup I would make before expanding the benchmark is Atari train/eval symmetry. In the current public code, the train path uses SB3’s Atari env utilities, while the eval path manually wraps Atari with AtariPreprocessing(..., frame_skip=1, terminal_on_life_loss=False, ...) and frame stacking. SB3’s Atari wrapper defaults are materially different: frame_skip=4, terminal_on_life_loss=True, and clipped rewards by default. On Atari, those are not cosmetic differences. They can change the effective task and the meaning of the evaluation result. I would make training and evaluation wrapper stacks match exactly by default, then apply ARCUS-H stress on top. (GitHub)
The second thing I would tighten is metric semantics around reward corruption. ARCUS-H’s valence inversion stressor flips reward during SHOCK, but the evaluator is still driving fixed SB3 policies through model.predict(obs, deterministic=...). In standard SB3 inference, action selection is observation-driven; reward is not an input to predict(). So for frozen policies, reward inversion is not on the same footing as action attenuation, action permutation, or observation drift. It is still a useful track, but I would probably present it as a reward-channel corruption track rather than mix it directly with execution-side stressors in the strongest headline claim. That would make the benchmark story cleaner and preempt an obvious criticism. (GitHub)
I would also do one pass on reproducibility and packaging. The repo README currently advertises Python 3.9+, while SB3’s current stable docs say 2.7.1 is the last release supporting Python 3.9 and recommend Python 3.10 or newer. SB3’s reproducibility guidance also explicitly says that deterministic results on a fixed setup require passing a seed when creating the model, and that exact reproducibility is still not guaranteed across platforms or PyTorch versions. For a benchmark, that means version pinning, explicit model seeding, and logging wrapper stacks and package versions are worth treating as first-class metadata. (GitHub)
On the second question, there are definitely good additions on the Hugging Face Hub, and I would add them in waves rather than all at once. My first picks would be MiniGrid FourRooms, MiniGrid Unlock, LunarLander-v3, and QR-DQN on Acrobot-v1. FourRooms and Unlock are good because they add longer-horizon, interpretable discrete behavior where continuity and integrity failures are easier to see than in some classic-control tasks. LunarLander-v3 is a strong benchmark choice because Gymnasium explicitly notes that v3 fixed reset determinism and episode-to-episode wind independence issues. QR-DQN on Acrobot is the cleanest algorithm-side addition because it lets you test a stronger discrete off-policy baseline without changing the environment family at the same time. All of these already exist in the SB3 organization on the Hub. (Hugging Face)
My second wave would be BipedalWalkerHardcore and TQC. BipedalWalkerHardcore is a good stress-test environment because the harder terrain creates richer degradation modes than simpler continuous-control tasks. TQC is the most informative next continuous-control algorithm because it is an SB3-Contrib method designed to improve over SAC-style critic behavior, which makes it especially relevant if the benchmark is already finding that high nominal reward can coexist with brittle stress behavior. The SB3 Hub also has TQC models for robotics tasks, but I would only add those after the evaluator is ready for more complex observation structures. (Hugging Face)
I would leave RecurrentPPO and the Panda robotics tasks for later. They are good additions, but only after the evaluator explicitly supports recurrent state handling and the observation-side stress logic is ready for more structured inputs. SB3-Contrib’s own docs are very clear that recurrent inference needs lstm_states and episode_start passed into predict() correctly, so I would not put recurrent models on the main leaderboard until that path is implemented cleanly. (Hugging Face)
So the overall maintainer-style take is: the benchmark idea is strong, the SB3 integration is mostly clean, and the project feels worth following. The main things I would fix before broadening the suite are:
- make Atari train/eval wrappers identical,
- separate reward-corruption semantics from execution-side stress in the main story, and
- harden reproducibility metadata and dependency pinning.
After that, I would expand first with MiniGrid, LunarLander-v3, and QR-DQN, then add harder continuous-control and memory-dependent models in later benchmark versions. (GitHub)