JustinTX commited on
Commit
16dd578
·
verified ·
1 Parent(s): e530698

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +5 -0
  2. assets/architecture.png +3 -0
  3. assets/benchmarks.png +3 -0
  4. assets/comparison.png +3 -0
  5. assets/logo_vector.png +3 -0
  6. assets/scaling_comparison.png +3 -0
  7. benchmarks/README.md +383 -0
  8. benchmarks/__init__.py +1 -0
  9. benchmarks/gpu_mode/README.md +92 -0
  10. benchmarks/gpu_mode/grayscale/README.md +34 -0
  11. benchmarks/gpu_mode/grayscale/config.yaml +176 -0
  12. benchmarks/gpu_mode/grayscale/evaluator.py +13 -0
  13. benchmarks/gpu_mode/grayscale/initial_program.py +57 -0
  14. benchmarks/gpu_mode/grayscale/reference.py +103 -0
  15. benchmarks/gpu_mode/grayscale/requirements.txt +2 -0
  16. benchmarks/gpu_mode/mla_decode/README.md +36 -0
  17. benchmarks/gpu_mode/mla_decode/config.yaml +355 -0
  18. benchmarks/gpu_mode/mla_decode/evaluator.py +13 -0
  19. benchmarks/gpu_mode/mla_decode/initial_program.py +245 -0
  20. benchmarks/gpu_mode/mla_decode/reference.py +520 -0
  21. benchmarks/gpu_mode/mla_decode/requirements.txt +2 -0
  22. benchmarks/gpu_mode/modal_eval.py +259 -0
  23. benchmarks/gpu_mode/shared_eval.py +421 -0
  24. benchmarks/gpu_mode/trimul/README.md +34 -0
  25. benchmarks/gpu_mode/trimul/config.yaml +219 -0
  26. benchmarks/gpu_mode/trimul/evaluator.py +13 -0
  27. benchmarks/gpu_mode/trimul/initial_program.py +84 -0
  28. benchmarks/gpu_mode/trimul/reference.py +286 -0
  29. benchmarks/gpu_mode/trimul/requirements.txt +2 -0
  30. benchmarks/gpu_mode/vecadd/README.md +34 -0
  31. benchmarks/gpu_mode/vecadd/config.yaml +50 -0
  32. benchmarks/gpu_mode/vecadd/evaluator.py +13 -0
  33. benchmarks/gpu_mode/vecadd/initial_program.py +39 -0
  34. benchmarks/gpu_mode/vecadd/reference.py +96 -0
  35. benchmarks/gpu_mode/vecadd/requirements.txt +2 -0
  36. benchmarks/kernelbench/README.md +211 -0
  37. benchmarks/kernelbench/config.yaml +86 -0
  38. benchmarks/kernelbench/evaluator/Dockerfile +25 -0
  39. benchmarks/kernelbench/evaluator/evaluate.sh +6 -0
  40. benchmarks/kernelbench/evaluator/evaluator.py +227 -0
  41. benchmarks/kernelbench/evaluator/requirements.txt +2 -0
  42. benchmarks/kernelbench/evaluator/wrapper.py +98 -0
  43. benchmarks/kernelbench/resolver.py +136 -0
  44. benchmarks/math/README.md +43 -0
  45. benchmarks/math/circle_packing/README.md +38 -0
  46. benchmarks/math/circle_packing/codebase/reference/hex_grid.py +43 -0
  47. benchmarks/math/circle_packing/codebase/reference/optimization_patterns.py +94 -0
  48. benchmarks/math/circle_packing/codebase/reference/packing_strategies.md +45 -0
  49. benchmarks/math/circle_packing/config.yaml +54 -0
  50. benchmarks/math/circle_packing/evaluator/Dockerfile +11 -0
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/scaling_comparison.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/architecture.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/logo_vector.png filter=lfs diff=lfs merge=lfs -text
39
+ assets/comparison.png filter=lfs diff=lfs merge=lfs -text
40
+ assets/benchmarks.png filter=lfs diff=lfs merge=lfs -text
assets/architecture.png ADDED

Git LFS Details

  • SHA256: 3b10c6bfb1734211abab7fa2e53b36931428d842ade3c96cbef255543b3889d8
  • Pointer size: 131 Bytes
  • Size of remote file: 278 kB
assets/benchmarks.png ADDED

Git LFS Details

  • SHA256: 42a69cb4c8119b79901ecfcdf93088e932643d6e0890d3c984dead40c407dc5b
  • Pointer size: 131 Bytes
  • Size of remote file: 758 kB
assets/comparison.png ADDED

Git LFS Details

  • SHA256: 8d68074ff5106764b1328b23ef5e949332aab3541172f8d91e2580d6f168e184
  • Pointer size: 131 Bytes
  • Size of remote file: 399 kB
assets/logo_vector.png ADDED

Git LFS Details

  • SHA256: d74ce6a1024e519a5afc85706133e31bafeb06b48b603a11284845b549cb586e
  • Pointer size: 131 Bytes
  • Size of remote file: 891 kB
assets/scaling_comparison.png ADDED

Git LFS Details

  • SHA256: d2aa00d9f59b5e14fc10d2569b872632fb992ab61fcfbba2ae946bef9deb22d8
  • Pointer size: 131 Bytes
  • Size of remote file: 297 kB
benchmarks/README.md ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Benchmarks
2
+
3
+ ~200 optimization tasks across math, systems, algorithms, and reasoning.
4
+
5
+ ## Quick Start
6
+
7
+ Install dependencies first:
8
+
9
+ ```bash
10
+ # Base
11
+ uv sync
12
+
13
+ # Choose extras based on what you run
14
+ uv sync --extra external # openevolve/gepa/shinkaevolve
15
+ uv sync --extra math # math benchmarks
16
+ uv sync --extra adrs # ADRS benchmarks
17
+ uv sync --extra frontier-cs # frontier-cs-eval benchmark
18
+ uv sync --extra prompt-optimization # HotPotQA prompt benchmark
19
+ ```
20
+
21
+ If a benchmark directory has `requirements.txt`, also run:
22
+
23
+ ```bash
24
+ uv pip install -r benchmarks/<task>/requirements.txt
25
+ ```
26
+
27
+ Then run:
28
+
29
+ ```bash
30
+ export OPENAI_API_KEY="..."
31
+
32
+ # Containerized benchmark (recommended — evaluator runs in Docker)
33
+ uv run skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
34
+ benchmarks/math/circle_packing_rect/evaluator \
35
+ -c benchmarks/math/circle_packing_rect/config.yaml \
36
+ -s best_of_n -i 50
37
+
38
+ # Plain Python evaluator (runs on host)
39
+ uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
40
+ benchmarks/math/circle_packing/evaluator.py \
41
+ -c benchmarks/math/circle_packing/config.yaml \
42
+ -s best_of_n -i 100
43
+ ```
44
+
45
+ ## Tasks
46
+
47
+ | Benchmark | Domain | Tasks | What it tests |
48
+ |-----------|--------|-------|---------------|
49
+ | [`math/`](math/) | Math | 14 | Circle packing, Erdos problems, autocorrelation inequalities, geometric optimization |
50
+ | [`ADRS/`](ADRS/) | Systems | 5 | Cloud scheduling, MoE load balancing, model placement, column reordering, transaction scheduling |
51
+ | [`gpu_mode/`](gpu_mode/) | GPU | 4 | Triton kernel optimization (vecadd, grayscale, trimul, MLA decode) |
52
+ | [`frontier-cs-eval/`](frontier-cs-eval/) | Algorithms | 172 | Competitive programming (Frontier-CS benchmark, Docker judge) |
53
+ | [`arc_benchmark/`](arc_benchmark/) | Reasoning | — | ARC-AGI visual reasoning tasks |
54
+ | [`ale_bench/`](ale_bench/) | Algorithms | 10 | Algorithmic contest problems (C++, ALE-Bench) |
55
+ | [`image_gen/`](image_gen/) | Creative | 1 | AI image generation evolution |
56
+ | [`prompt_optimization/`](prompt_optimization/) | Prompts | 1 | Evolve natural-language prompts, not code (HotPotQA) |
57
+
58
+ Each benchmark directory has its own README with setup and run instructions.
59
+
60
+ ## Structure
61
+
62
+ There are three ways to set up a benchmark: a **containerized evaluator** (recommended for new benchmarks), a **Harbor task** (for external benchmark suites), or a **plain Python evaluator** (simplest).
63
+
64
+ ### Containerized evaluator (recommended)
65
+
66
+ ```
67
+ <task>/
68
+ ├── initial_program.py # Starting solution
69
+ ├── config.yaml # System prompt + search/evaluator settings
70
+ └── evaluator/ # Self-contained Docker benchmark
71
+ ├── Dockerfile
72
+ ├── evaluate.sh # Entrypoint (receives solution path + mode)
73
+ ├── evaluator.py # Scoring logic
74
+ ├── requirements.txt # Python dependencies
75
+ └── ... # Any other data/files the evaluator needs
76
+ ```
77
+
78
+ The `evaluator/` directory is the Docker build context. Everything inside it gets copied into the image — data files, model weights, test fixtures, etc. SkyDiscover auto-detects this layout when `evaluation_file` points to a directory containing a `Dockerfile` and `evaluate.sh`.
79
+
80
+ ### Plain Python evaluator
81
+
82
+ ```
83
+ <task>/
84
+ ├── initial_program.py # Starting solution
85
+ ├── evaluator.py # Scoring function (returns combined_score)
86
+ └── config.yaml # System prompt + search/evaluator settings
87
+ ```
88
+
89
+ Simpler but runs evaluator code directly on the host. Fine for pure-Python tasks with no system dependencies.
90
+
91
+ ### Benchmark resolvers (dynamic problem loading)
92
+
93
+ Some benchmarks support **dynamic problem loading** through a resolver pattern. Instead of providing a static `initial_program.py`, the resolver fetches problems from an external dataset based on configuration parameters.
94
+
95
+ This is useful for benchmark suites with many problems (e.g., KernelBench has hundreds of GPU kernel optimization tasks). The resolver pattern allows you to:
96
+
97
+ 1. Select specific problems via config parameters (e.g., difficulty level, problem ID)
98
+ 2. Automatically generate the initial program from the benchmark dataset
99
+ 3. Configure evaluator settings based on the problem specification
100
+
101
+ #### Using a benchmark with a resolver
102
+
103
+ Benchmarks that support resolvers include a `benchmark` section in their `config.yaml`:
104
+
105
+ ```yaml
106
+ benchmark:
107
+ enabled: true # Enable benchmark loader
108
+ name: kernelbench # Benchmark name (for logging)
109
+ resolver: benchmarks.kernelbench.resolver # Python module path
110
+
111
+ # Benchmark-specific parameters
112
+ level: 2 # Example: difficulty level
113
+ problem_id: 5 # Example: specific problem ID
114
+ ```
115
+
116
+ When running such a benchmark, you don't need to provide an `initial_program` argument:
117
+
118
+ ```bash
119
+ uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
120
+ -c benchmarks/kernelbench/config.yaml \
121
+ --search adaevolve \
122
+ --iterations 50
123
+ ```
124
+
125
+ The resolver automatically fetches the problem and generates the initial program based on the config parameters.
126
+
127
+ #### Implementing a benchmark resolver
128
+
129
+ To add resolver support to a new benchmark:
130
+
131
+ 1. **Create `benchmarks/your_benchmark/resolver.py`** implementing the `BenchmarkResolver` interface:
132
+
133
+ ```python
134
+ from pathlib import Path
135
+ from typing import Any, Dict, Tuple
136
+ from skydiscover.benchmarks.base import BenchmarkResolver
137
+
138
+ class YourBenchmarkResolver(BenchmarkResolver):
139
+ def resolve(self, config: Dict[str, Any], output_dir: Path) -> Tuple[str, str]:
140
+ """
141
+ Fetch problem and generate initial program.
142
+
143
+ Args:
144
+ config: The benchmark section from config.yaml
145
+ output_dir: Directory where generated files should be placed
146
+
147
+ Returns:
148
+ BenchmarkResolution containing:
149
+ - initial_program_path: Path to generated initial program
150
+ - evaluator_path: Path to evaluator
151
+ - evaluator_env_vars: Dict of environment variables for the evaluator
152
+ """
153
+ # 1. Fetch problem from dataset based on config parameters
154
+ # 2. Generate initial_program.py with EVOLVE-BLOCK markers
155
+ # 3. Prepare evaluator environment variables (returned, not set globally)
156
+ # 4. Return BenchmarkResolution with paths and env vars
157
+
158
+ initial_program_path = output_dir / "initial_program.py"
159
+ evaluator_path = Path(__file__).parent / "evaluator"
160
+ evaluator_env_vars = {
161
+ "BENCHMARK_PARAM": "value",
162
+ # Add benchmark-specific configuration here
163
+ }
164
+
165
+ return BenchmarkResolution(
166
+ initial_program_path=str(initial_program_path),
167
+ evaluator_path=str(evaluator_path),
168
+ evaluator_env_vars=evaluator_env_vars,
169
+ )
170
+
171
+ # Module-level resolver instance
172
+ resolver = YourBenchmarkResolver()
173
+ ```
174
+
175
+ 2. **Add `benchmark` section to `config.yaml`** with your resolver module path and all benchmark-specific parameters
176
+
177
+ 3. **Use the same CLI pattern** (no initial_program argument needed)
178
+
179
+ See the implementation in:
180
+ - `skydiscover/benchmarks/base.py` - Base resolver interface
181
+ - `benchmarks/kernelbench/resolver.py` - KernelBench example implementation
182
+
183
+ ## Adding a Benchmark
184
+
185
+ ### Option 1: Containerized evaluator (recommended)
186
+
187
+ Containerized evaluators run inside Docker, so they can have arbitrary dependencies, system packages, data files, etc. without polluting the host. Only two files are **required**: `Dockerfile` and `evaluate.sh`.
188
+
189
+ #### `evaluate.sh`
190
+
191
+ The entrypoint that SkyDiscover calls. It receives two arguments:
192
+
193
+ ```bash
194
+ #!/usr/bin/env bash
195
+ set -euo pipefail
196
+
197
+ PROGRAM="$1" # Path to the candidate solution inside the container
198
+ MODE="$2" # "train" (fast, iterative) or "test" (authoritative, final)
199
+
200
+ python /benchmark/evaluator.py "$PROGRAM"
201
+ ```
202
+
203
+ - **train** mode is called during the optimization loop — should be relatively fast.
204
+ - **test** mode is called once at the end for the best solution — should be the full, authoritative evaluation.
205
+
206
+ Evaluators that don't need the distinction can ignore `$MODE`.
207
+
208
+ #### `evaluate.sh` output (JSON protocol)
209
+
210
+ `evaluate.sh` must write a **single JSON object to stdout**:
211
+
212
+ ```json
213
+ {
214
+ "status": "success",
215
+ "combined_score": 0.73,
216
+ "metrics": {"combined_score": 0.73, "accuracy": 0.85, "speed": 1.2},
217
+ "artifacts": {"error": "...", "details": "..."}
218
+ }
219
+ ```
220
+
221
+ - `combined_score` (float, required): the primary optimization target.
222
+ - `metrics` (dict of string → float): all numeric scores. Must include `combined_score`.
223
+ - `artifacts` (dict of string → string, optional): non-numeric context (errors, diagnostics).
224
+ - `status`: `"success"`, `"error"`, or `"timeout"`.
225
+
226
+ Any output to **stderr** is captured for debugging but does not affect scoring. If your evaluator prints debug output, make sure it goes to stderr, not stdout.
227
+
228
+ #### `Dockerfile`
229
+
230
+ A standard Dockerfile. The only requirement is that `evaluate.sh` is executable:
231
+
232
+ ```dockerfile
233
+ FROM python:3.12-slim
234
+ WORKDIR /benchmark
235
+
236
+ COPY requirements.txt .
237
+ RUN pip install --no-cache-dir -r requirements.txt
238
+
239
+ COPY . .
240
+ RUN chmod +x evaluate.sh
241
+
242
+ ENTRYPOINT ["./evaluate.sh"]
243
+ ```
244
+
245
+ #### Migrating an existing Python evaluator
246
+
247
+ If you have an existing `evaluate(program_path) -> dict` function, you can wrap it with the backwards-compatibility wrapper:
248
+
249
+ 1. Copy `skydiscover/evaluation/wrapper.py` into your `evaluator/` directory.
250
+ 2. Add this to the bottom of your `evaluator.py`:
251
+
252
+ ```python
253
+ if __name__ == "__main__":
254
+ from wrapper import run
255
+ run(evaluate)
256
+ ```
257
+
258
+ The wrapper handles stdout redirection (so debug prints don't corrupt JSON), error formatting, and metric/artifact separation.
259
+
260
+ #### Running a containerized benchmark
261
+
262
+ Point `evaluation_file` at the `evaluator/` directory:
263
+
264
+ ```bash
265
+ skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
266
+ benchmarks/math/circle_packing_rect/evaluator \
267
+ -c benchmarks/math/circle_packing_rect/config.yaml \
268
+ -s best_of_n -i 50
269
+ ```
270
+
271
+ SkyDiscover will automatically build the Docker image, start a persistent container, and run evaluations inside it.
272
+
273
+ #### Example to copy
274
+
275
+ Simple containerized benchmark: [`math/heilbronn_triangle/`](math/heilbronn_triangle/)
276
+
277
+ ### Option 2: Harbor tasks (external benchmarks)
278
+
279
+ SkyDiscover natively supports [Harbor](https://harborframework.com/)-format tasks. This lets you run external benchmark suites like [AlgoTune](https://github.com/oripress/AlgoTune) (154 algorithm optimization tasks) without any conversion.
280
+
281
+ A Harbor task directory looks like this:
282
+
283
+ ```
284
+ task_dir/
285
+ ├── task.toml # Metadata, timeouts
286
+ ├── instruction.md # Problem description (shown to the LLM as context)
287
+ ├── environment/
288
+ │ └── Dockerfile # Container image definition
289
+ ├── tests/
290
+ │ ├── test.sh # Verification entrypoint
291
+ │ └── ... # Supporting test files (evaluator.py, test data, etc.)
292
+ └── solution/ # Reference solution (optional, not shown to LLM)
293
+ └── solve.sh
294
+ ```
295
+
296
+ SkyDiscover auto-detects Harbor tasks when the directory contains `instruction.md`, `tests/`, and `environment/Dockerfile`. The `instruction.md` is used as LLM context, solutions are injected at the path extracted from `solution/solve.sh` (or `instruction.md` as fallback), and rewards are read from `/logs/verifier/reward.txt` or `reward.json`.
297
+
298
+ #### Tested Harbor datasets
299
+
300
+ SkyDiscover has been tested with the following Harbor registry benchmarks:
301
+
302
+ | Dataset | Tasks | Domain | Language | Install |
303
+ |---------|-------|--------|----------|---------|
304
+ | [algotune](https://github.com/oripress/AlgoTune) | 154 | Algorithm optimization (speedup scoring) | Python | `harbor datasets download algotune@1.0` |
305
+ | [evoeval](https://github.com/evo-eval/evoeval) | 100 | Code generation (evolved from HumanEval) | Python | `harbor datasets download evoeval@1.0` |
306
+ | [humanevalfix](https://github.com/bigcode-project/octopack) | 164 | Code repair (fix buggy functions) | Python | `harbor datasets download humanevalfix@1.0` |
307
+ | [bigcodebench-hard-complete](https://github.com/bigcode-project/bigcodebench) | 145 | Python programming (reward-based) | Python | `harbor datasets download bigcodebench-hard-complete@1.0.0` |
308
+ | [livecodebench](https://livecodebench.github.io/) | 100 | Competitive programming (stdin/stdout) | Python | `harbor datasets download livecodebench@6.0` |
309
+ | [codepde](https://github.com/LithiumDA/CodePDE) | 5 | Scientific computing (PDE solvers) | Python | `harbor datasets download codepde@1.0` |
310
+ | [crustbench](https://github.com/AInfinity/CRUSTBench) | 100 | C-to-safe-Rust transpilation | Rust | `harbor datasets download crustbench@1.0` |
311
+ | [usaco](https://usaco.org/) | 304 | Competition programming (USACO) | Python | `harbor datasets download usaco@2.0` |
312
+
313
+ Any Harbor-compatible dataset should work — the evaluator automatically extracts the solution path from the task's `solution/solve.sh` script.
314
+
315
+ #### Running a Harbor task
316
+
317
+ 1. **Install the Harbor CLI and download a dataset:**
318
+
319
+ ```bash
320
+ pip install harbor
321
+ harbor datasets download algotune@1.0 -o /tmp/algotune
322
+ ```
323
+
324
+ This downloads all 154 AlgoTune tasks. Each task is in a subdirectory like `/tmp/algotune/<id>/algotune-<name>/`.
325
+
326
+ 2. **Run SkyDiscover**, pointing at the task directory. The LLM uses `instruction.md` as context and generates solutions from scratch:
327
+
328
+ ```bash
329
+ # AlgoTune (algorithm optimization)
330
+ TASK=/tmp/algotune/2HHbpvzVPo2qakaoGyAVS2/algotune-set-cover
331
+ skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 10
332
+
333
+ # EvoEval (code generation)
334
+ harbor datasets download evoeval@1.0 -o /tmp/evoeval
335
+ TASK=/tmp/evoeval/<id>/<task-name>
336
+ skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5
337
+
338
+ # HumanEvalFix (code repair)
339
+ harbor datasets download humanevalfix@1.0 -o /tmp/humanevalfix
340
+ TASK=/tmp/humanevalfix/<id>/<task-name>
341
+ skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5
342
+ ```
343
+
344
+ SkyDiscover will build the Docker image from `environment/Dockerfile`, upload `tests/` into the container, and start optimizing.
345
+
346
+ > **Note:** Some datasets have heavy Dockerfiles. AlgoTune needs ~10GB disk and 16GB RAM (torch, jax, scipy). BigCodeBench installs R, GDAL, and many system packages. First builds are slow; subsequent runs use Docker cache.
347
+
348
+ #### Other Harbor datasets
349
+
350
+ Any Harbor-compatible dataset works the same way. Run `harbor datasets list` to see all available datasets, then `harbor datasets download <name>` to fetch them.
351
+
352
+ ### Option 3: Plain Python evaluator
353
+
354
+ For simple tasks with no system dependencies, you can use a plain Python evaluator that runs on the host.
355
+
356
+ **Evaluator** (`evaluator.py`) scores whatever the LLM produces:
357
+
358
+ ```python
359
+ def evaluate(program_path: str) -> dict:
360
+ # load and run the program, compute a score
361
+ return {"combined_score": 0.73, ...} # combined_score is required
362
+ ```
363
+
364
+ `program_path` is a `.py` file for code tasks or a `.txt` file for prompt tasks. On failure, return `{"combined_score": 0.0, "error": "..."}` instead of raising.
365
+
366
+ ### Seed program
367
+
368
+ **Seed** (`initial_program.py` or `initial_prompt.txt`) is the starting solution. Mark the region for the LLM to evolve:
369
+
370
+ ```python
371
+ # EVOLVE-BLOCK-START
372
+ def solve(input_data):
373
+ return input_data # LLM will improve this
374
+ # EVOLVE-BLOCK-END
375
+ ```
376
+
377
+ For prompt optimization, use a plain `.txt` file with no markers.
378
+
379
+ ### Config
380
+
381
+ **Config** (`config.yaml`) sets the system prompt and search settings. For prompt optimization, set `language: text` and `diff_based_generation: false`.
382
+
383
+ Simple prompt example to copy: [`prompt_optimization/hotpot_qa/`](prompt_optimization/hotpot_qa/)
benchmarks/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """SkyDiscover benchmarks package. This is used for importing benchmark resolver modules."""
benchmarks/gpu_mode/README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Triton Kernel Optimization
2
+
3
+ Evolve high-performance GPU kernels using SkyDiscover. Each benchmark provides a reference PyTorch implementation and scores submissions by runtime — faster is better. Pure PyTorch submissions are accepted; Triton is not required.
4
+
5
+ ## Benchmarks
6
+
7
+ | Benchmark | Operation | Tolerance | GPU |
8
+ |-----------|-----------|-----------|-----|
9
+ | [`vecadd`](vecadd/) | Float16 element-wise `C = A + B` | rtol/atol=1e-3 | H100 |
10
+ | [`grayscale`](grayscale/) | RGB → Grayscale (`0.2989R + 0.5870G + 0.1140B`) | rtol/atol=1e-4 | H100 |
11
+ | [`trimul`](trimul/) | Triangle multiplicative update (AlphaFold3/Chai/Protenix) | rtol/atol=0.02 | H100 |
12
+ | [`mla_decode`](mla_decode/) | Multi-head latent attention decode (DeepSeek-V2/V3) | rtol/atol=0.06 (bfloat16) | **H200** |
13
+
14
+ ## Quick Start
15
+
16
+ ```bash
17
+ # Run on local GPU
18
+ uv run skydiscover-run \
19
+ benchmarks/gpu_mode/trimul/initial_program.py \
20
+ benchmarks/gpu_mode/trimul/evaluator.py \
21
+ -c benchmarks/gpu_mode/trimul/config.yaml \
22
+ -s [your_algorithm] \
23
+ -i 50
24
+
25
+ # Run on Modal cloud GPU (set GPU type per benchmark)
26
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
27
+ uv run skydiscover-run \
28
+ benchmarks/gpu_mode/trimul/initial_program.py \
29
+ benchmarks/gpu_mode/trimul/evaluator.py \
30
+ -c benchmarks/gpu_mode/trimul/config.yaml \
31
+ -s [your_algorithm] \
32
+ -i 50
33
+ ```
34
+
35
+ > **Note:** `mla_decode` requires `GPUMODE_MODAL_GPU=H200` — H100 (80GB) does not have enough VRAM.
36
+
37
+ ## Writing a Submission
38
+
39
+ Your program must define a `custom_kernel(data)` function. The `data` argument is problem-specific (see each benchmark's `reference.py` for the exact type). Return the computed result.
40
+
41
+ ```python
42
+ # EVOLVE-BLOCK-START
43
+ import torch
44
+ import triton
45
+ import triton.language as tl
46
+
47
+ def custom_kernel(data):
48
+ # data is a problem-specific input (tensor, dataclass, etc.)
49
+ # return the computed result
50
+ ...
51
+ # EVOLVE-BLOCK-END
52
+ ```
53
+
54
+ ## Scoring
55
+
56
+ All benchmarks use the same formula:
57
+
58
+ ```
59
+ combined_score = SCORE_SCALE / geom_mean_us
60
+ ```
61
+
62
+ `geom_mean_us` is the geometric mean of kernel runtimes in microseconds across all benchmark cases. Higher score = faster kernel. `SCORE_SCALE` is `3000.0` for all current benchmarks.
63
+
64
+ `vecadd` uses a different combined formula (`0.3 * correctness + speedup`) — see its README for details.
65
+
66
+ ## Evaluation Pipeline
67
+
68
+ The shared evaluator (`shared_eval.py`) handles both local and Modal paths:
69
+
70
+ 1. **Correctness** — runs all `TEST_CASES` from `reference.py`, checks output against reference within tolerance
71
+ 2. **Warmup** — runs one benchmark case briefly to trigger Triton JIT compilation
72
+ 3. **Benchmark** — times `BENCHMARK_CASES` using CUDA events, repeats until error < 0.1% or time budget is exhausted
73
+ 4. **Score** — geometric mean of benchmark runtimes → `SCORE_SCALE / geom_mean_us`
74
+
75
+ ## Directory Structure
76
+
77
+ ```
78
+ gpu_mode/
79
+ ├── shared_eval.py # Shared evaluator (correctness + benchmarking logic)
80
+ ├── modal_eval.py # Modal cloud GPU runners (H100, A100, L40S, T4, H200)
81
+ ├── vecadd/ # Float16 vector addition
82
+ ├── grayscale/ # RGB → grayscale conversion
83
+ ├── trimul/ # Triangle multiplicative update
84
+ └── mla_decode/ # MLA decode (DeepSeek attention)
85
+
86
+ # Each benchmark contains:
87
+ # initial_program.py — starting kernel
88
+ # evaluator.py — imports shared_eval, exposes evaluate()
89
+ # reference.py — reference kernel, test/benchmark cases, SCORE_SCALE
90
+ # config.yaml — search config
91
+ # requirements.txt — dependencies
92
+ ```
benchmarks/gpu_mode/grayscale/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: RGB to Grayscale
2
+
3
+ Evolve a Triton kernel for RGB to Grayscale conversion using SkyDiscover.
4
+
5
+ **Formula:** `Y = 0.2989 * R + 0.5870 * G + 0.1140 * B`
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/grayscale/initial_program.py \
14
+ benchmarks/gpu_mode/grayscale/evaluator.py \
15
+ -c benchmarks/gpu_mode/grayscale/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness:** Must pass all test cases (rtol/atol=1e-4 vs PyTorch reference)
22
+ - **Score:** `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
23
+ - Higher is better (faster runtime = higher score)
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ ```bash
28
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
29
+ uv run skydiscover-run \
30
+ benchmarks/gpu_mode/grayscale/initial_program.py \
31
+ benchmarks/gpu_mode/grayscale/evaluator.py \
32
+ -c benchmarks/gpu_mode/grayscale/config.yaml \
33
+ -s [your_algorithm] -i 50
34
+ ```
benchmarks/gpu_mode/grayscale/config.yaml ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Grayscale Triton Kernel Optimization
2
+
3
+ max_iterations: 100
4
+ checkpoint_interval: 1
5
+ log_level: "INFO"
6
+
7
+ llm:
8
+ models:
9
+ - name: "gpt-5"
10
+ weight: 1.0
11
+ api_base: https://api.openai.com/v1
12
+ temperature: 1.0
13
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
14
+ max_tokens: 32000
15
+ timeout: 600
16
+
17
+ prompt:
18
+ system_message: |
19
+ You are an expert Triton engineer tasked with translating PyTorch code into highly optimized Triton kernel code.
20
+
21
+ You will be implementing a Grayscale conversion kernel that converts RGB images to grayscale using the luminance formula:
22
+ Y = 0.2989 * R + 0.5870 * G + 0.1140 * B
23
+
24
+ Your task:
25
+ - Implement the grayscale conversion as a highly optimized Triton kernel.
26
+ - The input is an (H, W, 3) float32 tensor and a pre-allocated (H, W) float32 output tensor.
27
+ - Your function receives `data = (rgb, output)` and should write the result into `output` and return it.
28
+
29
+ Your function should be defined as 'custom_kernel' with the following signature:
30
+ Input:
31
+ - `data`: Tuple of (rgb: torch.Tensor, output: torch.Tensor)
32
+ - rgb: Input tensor of shape [H, W, 3] (float32, contiguous)
33
+ - output: Pre-allocated output tensor of shape [H, W] (float32, contiguous)
34
+
35
+ Output:
36
+ - output: Grayscale tensor [H, W] (write in-place to the provided output tensor and return it)
37
+
38
+ Here is the reference PyTorch implementation:
39
+ ```python
40
+ import torch
41
+
42
+ # Reference code in PyTorch
43
+ def ref_kernel(data):
44
+ rgb, output = data
45
+ weights = torch.tensor([0.2989, 0.5870, 0.1140], device=rgb.device, dtype=rgb.dtype)
46
+ output[...] = torch.sum(rgb * weights, dim=-1)
47
+ return output
48
+ ```
49
+
50
+ Here is an example of a basic Triton implementation:
51
+ ```python
52
+ import torch
53
+ import triton
54
+ import triton.language as tl
55
+
56
+ @triton.jit
57
+ def grayscale_kernel(
58
+ rgb_ptr, out_ptr,
59
+ H, W,
60
+ stride_h, stride_w, stride_c,
61
+ BLOCK_SIZE: tl.constexpr,
62
+ ):
63
+ pid = tl.program_id(0)
64
+ n_pixels = H * W
65
+ block_start = pid * BLOCK_SIZE
66
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
67
+ mask = offsets < n_pixels
68
+
69
+ h_idx = offsets // W
70
+ w_idx = offsets % W
71
+
72
+ r = tl.load(rgb_ptr + h_idx * stride_h + w_idx * stride_w + 0 * stride_c, mask=mask)
73
+ g = tl.load(rgb_ptr + h_idx * stride_h + w_idx * stride_w + 1 * stride_c, mask=mask)
74
+ b = tl.load(rgb_ptr + h_idx * stride_h + w_idx * stride_w + 2 * stride_c, mask=mask)
75
+
76
+ gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
77
+
78
+ out_offsets = h_idx * W + w_idx
79
+ tl.store(out_ptr + out_offsets, gray, mask=mask)
80
+
81
+ def custom_kernel(data):
82
+ rgb, output = data
83
+ H, W, C = rgb.shape
84
+ assert C == 3
85
+ rgb = rgb.contiguous()
86
+ stride_h, stride_w, stride_c = rgb.stride()
87
+ n_pixels = H * W
88
+ BLOCK_SIZE = 1024
89
+ grid = (triton.cdiv(n_pixels, BLOCK_SIZE),)
90
+ grayscale_kernel[grid](
91
+ rgb, output, H, W,
92
+ stride_h, stride_w, stride_c,
93
+ BLOCK_SIZE=BLOCK_SIZE,
94
+ )
95
+ return output
96
+ ```
97
+
98
+ To help you understand which triton version we are using, here is some example triton code for an unrelated task:
99
+ ```python
100
+ import triton
101
+ import triton.language as tl
102
+
103
+ @triton.jit
104
+ def matmul_persistent_ws_kernel(
105
+ a_ptr, b_ptr, c_ptr, M, N, K,
106
+ stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
107
+ BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
108
+ ):
109
+ pid = tl.program_id(axis=0) # async_task 0, 1, 2
110
+ num_pid_m = tl.cdiv(M, BLOCK_M) # async_task 0, 1, 2
111
+ num_pid_n = tl.cdiv(N, BLOCK_N) # async_task 0, 1, 2
112
+ pid_m = pid // num_pid_m # async_task 0, 1, 2
113
+ pid_n = pid % num_pid_n # async_task 0, 1, 2
114
+ offs_m_1 = pid_m * BLOCK_M + tl.arange(0, BLOCK_M // 2) # async_task 0, 1, 2
115
+ offs_m_2 = pid_m * BLOCK_M + tl.arange(BLOCK_M // 2, BLOCK_M) # async_task 0, 1, 2
116
+ offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_N) # async_task 0, 1, 2
117
+ offs_k = tl.arange(0, BLOCK_K) # async_task 0
118
+ a_ptrs_1 = a_ptr + (offs_m_1[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
119
+ a_ptrs_2 = a_ptr + (offs_m_2[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
120
+ b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) # async_task 0
121
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 1
122
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 2
123
+ for k in range(0, tl.cdiv(K, BLOCK_K)): # async_task 0, 1, 2
124
+ a_1 = tl.load(a_ptrs_1) # async_task 0
125
+ a_2 = tl.load(a_ptrs_2) # async_task 0
126
+ b = tl.load(b_ptrs) # async_task 0
127
+ acc_1 += tl.dot(a_1, b) # async_task 1
128
+ acc_2 += tl.dot(a_2, b) # async_task 2
129
+ a_ptrs_1 += BLOCK_K * stride_ak # async_task 0
130
+ a_ptrs_2 += BLOCK_K * stride_ak # async_task 0
131
+ b_ptrs += BLOCK_K * stride_bk # async_task 0
132
+ c_1 = acc_1.to(tl.float16) # async_task 1
133
+ c_2 = acc_2.to(tl.float16) # async_task 2
134
+ c_ptrs_1 = c_ptr_1 + stride_cm * offs_m_1[:, None] + stride_cn * offs_n[None, :] # async_task 1
135
+ c_ptrs_2 = c_ptr_2 + stride_cm * offs_m_2[:, None] + stride_cn * offs_n[None, :] # async_task 2
136
+ tl.store(c_ptrs_1, c_1) # async_task 1
137
+ tl.store(c_ptrs_2, c_2) # async_task 2
138
+ ```
139
+
140
+ A few general triton tips:
141
+ - tl.arange only takes in constexpr arguments (static or tl.constexpr)
142
+ - You cannot use continue in your kernel code
143
+ - tl.dot can only take in two input tensors
144
+ - There is no tl.mean
145
+
146
+ Here are the different configs that your kernel will be benchmarked on (optimize runtime for these):
147
+
148
+ Benchmark Cases:
149
+ - {"size": 1024} (1024x1024 RGB image)
150
+ - {"size": 2048} (2048x2048 RGB image)
151
+ - {"size": 4096} (4096x4096 RGB image)
152
+ - {"size": 8192} (8192x8192 RGB image)
153
+
154
+ Key optimization strategies to consider:
155
+ - Memory coalescing: the RGB data is (H, W, 3), so adjacent pixels in a row are stride-3 apart. Consider vectorized loads or layout transformations.
156
+ - Block size tuning: larger blocks amortize launch overhead but may reduce occupancy.
157
+ - Use of shared memory or register-level optimizations for the weighted sum.
158
+ - Vectorized loads (e.g., loading 3 floats at once per pixel).
159
+
160
+ Rules:
161
+ - The tensors arguments passed in will be already on your cuda device.
162
+ - Define all of your code in one final ```python ``` block.
163
+ - We will test the correctness of your kernel on multiple input shapes, make sure to support different potential test cases.
164
+ - Your final output must be in float32.
165
+ - You must use trition 3.3.1 and these kernels will be run on an H100.
166
+ - You do not have to implement everything in triton, you may choose to have some of the operations done in pytorch. However, you must implement at least part of the operations in a kernel.
167
+ - Include a short docstring at the top summarizing your algorithm.
168
+ evaluator:
169
+ timeout: 600
170
+ max_retries: 3
171
+ cascade_evaluation: true
172
+ cascade_thresholds: [0.4, 0.3]
173
+
174
+ diff_based_generation: true
175
+ max_solution_length: 60000
176
+ random_seed: 42
benchmarks/gpu_mode/grayscale/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for Grayscale — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/grayscale/initial_program.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVOLVE-BLOCK-START
2
+ """
3
+ Initial Grayscale submission with Triton kernel.
4
+ Y = 0.2989 R + 0.5870 G + 0.1140 B
5
+ """
6
+
7
+ import torch
8
+ import triton
9
+ import triton.language as tl
10
+
11
+
12
+ @triton.jit
13
+ def grayscale_kernel(
14
+ rgb_ptr, out_ptr,
15
+ H, W,
16
+ stride_h, stride_w, stride_c,
17
+ BLOCK_SIZE: tl.constexpr,
18
+ ):
19
+ pid = tl.program_id(0)
20
+ n_pixels = H * W
21
+ block_start = pid * BLOCK_SIZE
22
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
23
+ mask = offsets < n_pixels
24
+
25
+ h_idx = offsets // W
26
+ w_idx = offsets % W
27
+
28
+ r_ptr = rgb_ptr + h_idx * stride_h + w_idx * stride_w + 0 * stride_c
29
+ g_ptr = rgb_ptr + h_idx * stride_h + w_idx * stride_w + 1 * stride_c
30
+ b_ptr = rgb_ptr + h_idx * stride_h + w_idx * stride_w + 2 * stride_c
31
+
32
+ r = tl.load(r_ptr, mask=mask)
33
+ g = tl.load(g_ptr, mask=mask)
34
+ b = tl.load(b_ptr, mask=mask)
35
+
36
+ gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
37
+
38
+ out_offsets = h_idx * W + w_idx
39
+ tl.store(out_ptr + out_offsets, gray, mask=mask)
40
+
41
+
42
+ def custom_kernel(data):
43
+ rgb, output = data
44
+ H, W, C = rgb.shape
45
+ assert C == 3
46
+ rgb = rgb.contiguous()
47
+ stride_h, stride_w, stride_c = rgb.stride()
48
+ n_pixels = H * W
49
+ BLOCK_SIZE = 1024
50
+ grid = (triton.cdiv(n_pixels, BLOCK_SIZE),)
51
+ grayscale_kernel[grid](
52
+ rgb, output, H, W,
53
+ stride_h, stride_w, stride_c,
54
+ BLOCK_SIZE=BLOCK_SIZE,
55
+ )
56
+ return output
57
+ # EVOLVE-BLOCK-END
benchmarks/gpu_mode/grayscale/reference.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for Grayscale Triton kernel.
3
+ Y = 0.2989 R + 0.5870 G + 0.1140 B
4
+
5
+ Input: (H, W, 3) float32 RGB tensor
6
+ Output: (H, W) float32 grayscale tensor
7
+ """
8
+
9
+ import torch
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Scoring and benchmark configuration (read by shared_eval.py)
13
+ # ---------------------------------------------------------------------------
14
+
15
+ SCORE_SCALE = 3000.0
16
+
17
+ # grayscale uses CUDA events timing, 0.1% rel error, 120s wall clock timeout
18
+ BENCH_USE_CUDA_EVENTS = True
19
+ BENCH_REL_ERROR = 0.001
20
+ BENCH_WALL_TIMEOUT_NS = 120e9
21
+ BENCH_NO_GRAD = False
22
+ BENCH_MAX_REPEATS = 100
23
+ BENCH_MAX_TIME_NS = 10e9
24
+ BENCH_WARMUP_STYLE = 'tiny_benchmark'
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Test / benchmark cases
28
+ # ---------------------------------------------------------------------------
29
+
30
+ TEST_CASES = [
31
+ {"size": 256, "seed": 42},
32
+ {"size": 512, "seed": 123},
33
+ {"size": 1024, "seed": 456},
34
+ {"size": 2048, "seed": 789},
35
+ ]
36
+
37
+ BENCHMARK_CASES = [
38
+ {"size": 1024, "seed": 1001},
39
+ {"size": 2048, "seed": 1002},
40
+ {"size": 4096, "seed": 1003},
41
+ {"size": 8192, "seed": 1004},
42
+ ]
43
+
44
+ # ---------------------------------------------------------------------------
45
+ # Reference kernel
46
+ # ---------------------------------------------------------------------------
47
+
48
+
49
+ def ref_kernel(data):
50
+ """Reference: Y = 0.2989 R + 0.5870 G + 0.1140 B"""
51
+ rgb, output = data
52
+ weights = torch.tensor([0.2989, 0.5870, 0.1140], device=rgb.device, dtype=rgb.dtype)
53
+ output[...] = torch.sum(rgb * weights, dim=-1)
54
+ return output
55
+
56
+
57
+ def generate_input(size, seed):
58
+ gen = torch.Generator(device="cuda")
59
+ gen.manual_seed(seed)
60
+ x = torch.rand(size, size, 3, device="cuda", dtype=torch.float32, generator=gen).contiguous()
61
+ y = torch.empty(size, size, device="cuda", dtype=torch.float32).contiguous()
62
+ return x, y
63
+
64
+
65
+ def check_implementation(data, submission_output, rtol=1e-4, atol=1e-4):
66
+ ref_output = ref_kernel(data)
67
+ if submission_output.shape != ref_output.shape:
68
+ return False, f"Shape mismatch: expected {ref_output.shape}, got {submission_output.shape}"
69
+ if torch.allclose(submission_output, ref_output, rtol=rtol, atol=atol):
70
+ return True, "Match"
71
+ diff = torch.abs(submission_output.float() - ref_output.float())
72
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
73
+
74
+
75
+ # ---------------------------------------------------------------------------
76
+ # Self-contained reference code for Modal remote execution
77
+ # ---------------------------------------------------------------------------
78
+
79
+ MODAL_REFERENCE_CODE = r'''
80
+ import torch
81
+
82
+ def ref_kernel(data):
83
+ rgb, output = data
84
+ weights = torch.tensor([0.2989, 0.5870, 0.1140], device=rgb.device, dtype=rgb.dtype)
85
+ output[...] = torch.sum(rgb * weights, dim=-1)
86
+ return output
87
+
88
+ def generate_input(size, seed):
89
+ gen = torch.Generator(device="cuda")
90
+ gen.manual_seed(seed)
91
+ x = torch.rand(size, size, 3, device="cuda", dtype=torch.float32, generator=gen).contiguous()
92
+ y = torch.empty(size, size, device="cuda", dtype=torch.float32).contiguous()
93
+ return x, y
94
+
95
+ def check_implementation(data, submission_output, rtol=1e-4, atol=1e-4):
96
+ ref_output = ref_kernel(data)
97
+ if submission_output.shape != ref_output.shape:
98
+ return False, f"Shape mismatch: expected {ref_output.shape}, got {submission_output.shape}"
99
+ if torch.allclose(submission_output, ref_output, rtol=rtol, atol=atol):
100
+ return True, "Match"
101
+ diff = torch.abs(submission_output.float() - ref_output.float())
102
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
103
+ '''
benchmarks/gpu_mode/grayscale/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/gpu_mode/mla_decode/README.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Multi-Head Latent Attention (MLA) Decode
2
+
3
+ Evolve a Triton kernel for the MLA decode operator using SkyDiscover.
4
+
5
+ Core attention mechanism from DeepSeek-V2/V3, used for efficient inference with compressed KV cache via LoRA projections and RoPE.
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/mla_decode/initial_program.py \
14
+ benchmarks/gpu_mode/mla_decode/evaluator.py \
15
+ -c benchmarks/gpu_mode/mla_decode/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness:** Must match reference MLA output (rtol=0.06, atol=0.06 in bfloat16)
22
+ - **Score:** `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
23
+ - Higher is better (faster runtime = higher score)
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ **Note:** This benchmark requires an H200 GPU (141GB VRAM). The H100 (80GB) does not have enough memory.
28
+
29
+ ```bash
30
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H200 \
31
+ uv run skydiscover-run \
32
+ benchmarks/gpu_mode/mla_decode/initial_program.py \
33
+ benchmarks/gpu_mode/mla_decode/evaluator.py \
34
+ -c benchmarks/gpu_mode/mla_decode/config.yaml \
35
+ -s [your_algorithm] -i 50
36
+ ```
benchmarks/gpu_mode/mla_decode/config.yaml ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: MLA Decode (Multi-Head Latent Attention) Triton Kernel
2
+
3
+ max_iterations: 100
4
+ checkpoint_interval: 1
5
+ log_level: "INFO"
6
+
7
+ llm:
8
+ models:
9
+ - name: "gpt-5"
10
+ weight: 1.0
11
+ api_base: https://api.openai.com/v1
12
+ temperature: 1.0
13
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
14
+ max_tokens: 32000
15
+ timeout: 600
16
+
17
+ prompt:
18
+ system_message: |
19
+ You are an expert Triton engineer tasked with translating PyTorch code into highly optimized Triton kernel code.
20
+
21
+ Below is a pytorch implementation of the multi-head latent attention (MLA) module. You will want to implement a Triton kernel for the operations in the forward call:
22
+
23
+ ```python
24
+ import math
25
+ from dataclasses import dataclass
26
+ import torch
27
+ from torch import nn
28
+ import torch.nn.functional as F
29
+
30
+ class RoPE(nn.Module):
31
+ def __init__(self, d_model: int):
32
+ super().__init__()
33
+ self.d_model = d_model
34
+ theta = 10000 ** (-torch.arange(0, d_model//2,dtype=torch.bfloat16) / (d_model//2))
35
+ self.register_buffer("theta", theta)
36
+
37
+ def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
38
+ x1, x2 = x.chunk(2, dim=-1)
39
+ return torch.cat((-x2, x1), dim=-1)
40
+
41
+ def forward(self, x: torch.Tensor, start_pos: int = 0) -> torch.Tensor:
42
+ seq_len = x.size(-2)
43
+ d_model = x.size(-1)
44
+ assert d_model == self.d_model
45
+ seq_idx = torch.arange(start_pos, start_pos + seq_len, device=x.device)
46
+ idx_theta = torch.einsum('s,d->sd', seq_idx, self.theta)
47
+ idx_theta2 = torch.cat([idx_theta, idx_theta], dim=-1)
48
+ cos = idx_theta2.cos().to(torch.bfloat16)
49
+ sin = idx_theta2.sin().to(torch.bfloat16)
50
+ return x * cos + self.rotate_half(x) * sin
51
+
52
+ class KVCache(nn.Module):
53
+ def __init__(self, kv_cache_shape: tuple) -> None:
54
+ super().__init__()
55
+ self.register_buffer('data', torch.zeros(kv_cache_shape, dtype=torch.bfloat16, device='cuda'))
56
+ self.seq_len = 0
57
+ self.zero()
58
+
59
+ def zero(self) -> None:
60
+ self.data.zero_()
61
+
62
+ def get_data(self) -> torch.Tensor:
63
+ return self.data
64
+
65
+ def forward(self, c_kv: torch.Tensor) -> torch.Tensor:
66
+ assert self.seq_len + c_kv.size(1) <= self.data.size(1), "KV Cache Exceeded"
67
+
68
+ self.data = self.data.to(c_kv.dtype)
69
+ self.data[
70
+ :, self.seq_len : self.seq_len + c_kv.size(1), :
71
+ ] = c_kv
72
+ self.seq_len += c_kv.size(1)
73
+
74
+ return self.data[:, :self.seq_len], self.seq_len
75
+
76
+ @dataclass
77
+ class Config:
78
+ batch_size: int
79
+ dim: int
80
+ n_heads: int
81
+ q_lora_rank: int
82
+ kv_lora_rank: int
83
+ qk_nope_head_dim: int
84
+ qk_rope_head_dim: int
85
+ v_head_dim: int
86
+ seq_len: int
87
+ max_seq_len: int
88
+ kv_cache_shape: tuple
89
+ Q_proj_down_weight: torch.Tensor
90
+ Q_proj_up_weight: torch.Tensor
91
+ KV_proj_down_weight: torch.Tensor
92
+ KV_proj_up_weight: torch.Tensor
93
+ wo_weight: torch.Tensor
94
+
95
+ class MLA(nn.Module):
96
+ def __init__(self, config: Config):
97
+ super().__init__()
98
+ self.dim = config.dim
99
+ self.n_heads = config.n_heads
100
+ self.q_lora_rank = config.q_lora_rank
101
+ self.kv_lora_rank = config.kv_lora_rank
102
+ self.nope_head_dim = config.qk_nope_head_dim
103
+ self.rope_head_dim = config.qk_rope_head_dim
104
+ self.v_head_dim = config.v_head_dim
105
+ # Down-projection matrices
106
+ self.Q_proj_down = nn.Linear(self.dim, self.q_lora_rank, bias=False, dtype=torch.bfloat16)
107
+ self.KV_proj_down = nn.Linear(self.dim, self.kv_lora_rank + self.rope_head_dim, bias=False, dtype=torch.bfloat16)
108
+
109
+ # Up-projection and rope projection matrices
110
+ self.Q_proj_up = nn.Linear(self.q_lora_rank, (self.nope_head_dim + self.rope_head_dim) * self.n_heads, bias=False, dtype=torch.bfloat16)
111
+ self.KV_proj_up = nn.Linear(self.kv_lora_rank, (self.nope_head_dim + self.v_head_dim) * self.n_heads, bias=False, dtype=torch.bfloat16)
112
+
113
+ # RoPE on half embeddings
114
+ self.q_rope = RoPE(self.rope_head_dim)
115
+ self.k_rope = RoPE(self.rope_head_dim)
116
+
117
+ # Output projection
118
+ self.wo = nn.Linear(self.v_head_dim * self.n_heads, self.dim, dtype=torch.bfloat16, bias=False)
119
+ self.eps = 1e-6
120
+
121
+ def forward(self, x: torch.Tensor, kv_cache: KVCache) -> torch.Tensor:
122
+ # seq_len = 1 always here
123
+ batch_size, seq_len, model_dim = x.size()
124
+
125
+ ## Step 1: Handle down-projection + KV cache ##
126
+
127
+ q_lora = self.Q_proj_down(x)
128
+ kv_lora = self.KV_proj_down(x)
129
+ kv_lora, kv_len = kv_cache(kv_lora)
130
+ query_pos = kv_len - 1
131
+
132
+ ## Step 2: Up-project and prepare NoPE + RoPE ##
133
+
134
+ # Handle queries Q first
135
+ q_nope_and_rope = self.Q_proj_up(q_lora).view(
136
+ batch_size, seq_len, self.n_heads, self.nope_head_dim + self.rope_head_dim)
137
+ q_nope, q_rope = torch.split(q_nope_and_rope, [self.nope_head_dim, self.rope_head_dim], dim=-1)
138
+
139
+ # Handle keys and values K/V. V does not need RoPE
140
+ kv_nope, k_rope = torch.split(kv_lora, [self.kv_lora_rank, self.rope_head_dim], dim=-1)
141
+ kv_nope = self.KV_proj_up(kv_nope).view(
142
+ batch_size, kv_len, self.n_heads, self.nope_head_dim + self.v_head_dim)
143
+ k_nope, v = torch.split(kv_nope, [self.nope_head_dim, self.v_head_dim], dim=-1)
144
+
145
+ ## Step 3: Handle RoPE Stream ##
146
+
147
+ # Compute RoPE for queries and combine with no-RoPE part
148
+ q_rope = q_rope.permute(0, 2, 1, 3) # bs x n_heads x seq_len x rope_head_dim
149
+ q_rope = self.q_rope(q_rope, start_pos=query_pos)
150
+
151
+ q_nope = q_nope.permute(0, 2, 1, 3) # bs x n_heads x seq_len x rope_head_dim
152
+ q = torch.concat([q_nope, q_rope], dim=-1)
153
+
154
+ # Compute RoPE for keys and combine with no-RoPE part
155
+ k_rope = k_rope[:, None, :, :]
156
+ k_rope = self.k_rope(k_rope).expand(-1,self.n_heads,-1,-1)
157
+ k_nope = k_nope.permute(0, 2, 1, 3) # bs x kv_len x n_heads x rope_head_dim
158
+ k = torch.concat([k_nope, k_rope], dim=-1)
159
+
160
+ ## Step 4: Compute Multi-head Attention ##
161
+
162
+ v = v.permute(0, 2, 1, 3) # bs x n_heads x kv_len x v_head_dim
163
+ scores = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(self.rope_head_dim + self.nope_head_dim)
164
+ attn = F.softmax(scores, dim=-1).to(torch.bfloat16)
165
+ y = torch.matmul(attn, v).view(batch_size, 1, -1)
166
+ y = self.wo(y)
167
+
168
+ return y, kv_cache.get_data()
169
+ ```
170
+
171
+ Your function should be defined as 'custom_kernel' (skeleton provided below)
172
+
173
+ ```python
174
+ ### DO NOT CHANGE THIS IMPORT STATEMENTS BLOCK ###
175
+ import os
176
+ import math
177
+ from typing import Tuple
178
+ import torch
179
+ import torch.nn.functional as F
180
+ import triton
181
+ from reference import KVCache, Config # Definition of KVCache and Config classes are shown above. Must import this way. Do not rewrite yourself.
182
+ ### END OF IMPORT STATEMENTS BLOCK ###
183
+
184
+ ### Import other packages here if needed
185
+
186
+ def custom_kernel(data: Tuple[Config, torch.Tensor, KVCache]) -> Tuple[torch.Tensor, KVCache]:
187
+ config, x, kv_cache = data
188
+
189
+ bs = config.batch_size
190
+ sl = config.seq_len
191
+ pl = kv_cache.seq_len
192
+ msl = config.max_seq_len
193
+ nh = config.n_heads
194
+ d = config.dim
195
+ dq = config.q_lora_rank
196
+ dkv = config.kv_lora_rank
197
+ dnope = config.qk_nope_head_dim
198
+ drope = config.qk_rope_head_dim
199
+ dv = config.v_head_dim
200
+
201
+ wDQ = config.Q_proj_down_weight
202
+ wDKV = config.KV_proj_down_weight
203
+ wUQ = config.Q_proj_up_weight
204
+ wUKV = config.KV_proj_up_weight
205
+ wO = config.wo_weight
206
+
207
+ # Perform MLA operations to process data into output and updated kv_cache
208
+
209
+ return output, kv_cache.data
210
+ ```
211
+
212
+ with the following signature:
213
+
214
+ Input:
215
+ - `data`: Tuple of (config: Config, x: torch.Tensor, kv_cache: KVCache)
216
+ - config: An instance of class `Config` containing model configurations and weights
217
+ - x: Input tensor of shape [batch_size, seq_len, dim]
218
+ - kv_cache: An instance of KVCache class for caching the keys and values
219
+
220
+ Output:
221
+ - output: Output tensor [batch_size, seq_len, dim]
222
+ - kv_cache.data: The data field of the updated `KVCache` instance with the new keys and values added
223
+
224
+ To warm you up in writing optimized triton code, here is an example code which is correct for your task but very unoptimized. Your code should be as optimized as possible but still correct.
225
+
226
+ ```python
227
+ import os
228
+ import math
229
+ from typing import Tuple
230
+ import torch
231
+ import torch.nn.functional as F
232
+ import triton
233
+ import triton.language as tl
234
+ from reference import KVCache, Config
235
+
236
+ @triton.jit
237
+ def rope_swap_halves_kernel(
238
+ x_ptr,
239
+ cos_ptr, sin_ptr,
240
+ B: tl.constexpr,
241
+ T: tl.constexpr,
242
+ D: tl.constexpr,
243
+ stride_xb, stride_xt, stride_xd,
244
+ stride_cos_t, stride_cos_d,
245
+ stride_sin_t, stride_sin_d,
246
+ BLOCK_HALF: tl.constexpr,
247
+ ):
248
+ pid = tl.program_id(0)
249
+ bt = pid
250
+ b = bt // T
251
+ t = bt - b * T
252
+ half = D // 2
253
+ off = tl.arange(0, BLOCK_HALF)
254
+ mask = off < half
255
+ x_base = x_ptr + b * stride_xb + t * stride_xt
256
+ x0_ptr = x_base + off * stride_xd
257
+ x1_ptr = x_base + (half + off) * stride_xd
258
+ cos_base = cos_ptr + t * stride_cos_t
259
+ sin_base = sin_ptr + t * stride_sin_t
260
+ c_ptr = cos_base + off * stride_cos_d
261
+ s_ptr = sin_base + off * stride_sin_d
262
+ x0 = tl.load(x0_ptr, mask=mask, other=0.0).to(tl.float32)
263
+ x1 = tl.load(x1_ptr, mask=mask, other=0.0).to(tl.float32)
264
+ c = tl.load(c_ptr, mask=mask, other=0.0).to(tl.float32)
265
+ s = tl.load(s_ptr, mask=mask, other=0.0).to(tl.float32)
266
+ out0 = x0 * c - x1 * s
267
+ out1 = x1 * c + x0 * s
268
+ tl.store(x0_ptr, out0.to(tl.bfloat16), mask=mask)
269
+ tl.store(x1_ptr, out1.to(tl.bfloat16), mask=mask)
270
+
271
+ # ... (see initial_program.py for full working baseline)
272
+ ```
273
+
274
+ Below are the different configs that your kernel will be tested on:
275
+
276
+ Common configs:
277
+ - {"batch_size": 128, "seq_len": 1, "kv_lora_rank": 512, "qk_rope_head_dim": 64, "v_head_dim": 128, "n_heads": 128, "dim": 7168, "q_lora_rank": 1536, "max_seq_len": 8192}
278
+
279
+ For correctness check:
280
+ - {"prefill": 128}
281
+ - {"prefill": 512}
282
+ - {"prefill": 1024}
283
+ - {"prefill": 2048}
284
+
285
+ For performance benchmark (optimize runtime for these):
286
+ - {"prefill": 6144}
287
+
288
+ Rules:
289
+ - The tensors arguments passed in will be already on your cuda device.
290
+ - The weights for all parameters in the MLA will be given as input.
291
+ - All weights and data will be in `torch.bfloat16` format.
292
+ - Define all of your code in one final ```python ``` block.
293
+ - The entrypoint to your code must be named 'custom_kernel'.
294
+ - You will be using trition 3.4.0 and your kernels will be run on an Nvidia H200 GPU.
295
+ - Consider optimizing multiple operations with triton, not just limited to softmax. E.g., rope, attention, etc.
296
+ - You are allowed to use torch.compile().
297
+
298
+ Important rules in triton 3.4.0:
299
+ - `tl.load` does not have an argument called `dtype`. Never use it like `tl.load(..., dtype=...)`.
300
+ - Triton dtypes are not callable, so never use them like `tl.float16(1.0)`, `tl.float32(0.0)`.
301
+ - `tl.arange(start, end)`:
302
+ - range length (end - start) must be power-of-2
303
+ - start, end must be of type `tl.constexpr`
304
+ - `tl.range(start, end, step, num_stages)`:
305
+ - keep loop index type stable, don't reassign it
306
+ - start, end, step do not have to be `tl.constexpr` but must stay scalar integer types
307
+ - num_stages must be `tl.constexpr`
308
+ - Do not something like x[0] or offs[0] inside a Triton kernel. Triton tensors are SIMD vectors; scalar indexing like [0] is not generally supported.
309
+
310
+ Here's an simple example correctly following these rules:
311
+
312
+ ```python
313
+ import torch
314
+ import triton
315
+ import triton.language as tl
316
+
317
+ @triton.jit
318
+ def kernel_right(
319
+ x_ptr, y_ptr, out_ptr,
320
+ n_elements: tl.constexpr,
321
+ BLOCK: tl.constexpr,
322
+ ROW_STEP: tl.constexpr,
323
+ NUM_STAGES: tl.constexpr,
324
+ ):
325
+ pid = tl.program_id(axis=0)
326
+ offs = pid * BLOCK + tl.arange(0, BLOCK)
327
+ mask = offs < n_elements
328
+ x = tl.load(x_ptr + offs, mask=mask, other=0.0)
329
+ y = tl.load(y_ptr + offs, mask=mask, other=0.0)
330
+ one_f32 = tl.full([], 1.0, tl.float32)
331
+ acc = tl.zeros((BLOCK,), dtype=tl.float32)
332
+ acc = tl.cast(x, tl.float32) + tl.cast(y, tl.float32) + one_f32
333
+ base = tl.full([], pid * BLOCK, tl.int32)
334
+ x0 = tl.load(x_ptr + base, mask=(base < n_elements), other=0.0)
335
+ x0_vec = tl.full((BLOCK,), x0, tl.float32)
336
+ out_vec = acc + x0_vec
337
+ n_rows = tl.full([], 4, tl.int32)
338
+ extra = tl.zeros((BLOCK,), dtype=tl.float32)
339
+ for r in tl.range(0, n_rows, ROW_STEP, num_stages=NUM_STAGES):
340
+ shift = r * tl.full([], 1, tl.int32)
341
+ offs_r = offs + shift
342
+ xr = tl.load(x_ptr + offs_r, mask=(offs_r < n_elements), other=0.0)
343
+ extra += tl.cast(xr, tl.float32)
344
+ out_vec = out_vec + extra
345
+ tl.store(out_ptr + offs, tl.cast(out_vec, tl.float16), mask=mask)
346
+ ```
347
+ evaluator:
348
+ timeout: 600
349
+ max_retries: 3
350
+ cascade_evaluation: true
351
+ cascade_thresholds: [0.4, 0.3]
352
+
353
+ diff_based_generation: true
354
+ max_solution_length: 60000
355
+ random_seed: 42
benchmarks/gpu_mode/mla_decode/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for MLA Decode — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/mla_decode/initial_program.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVOLVE-BLOCK-START
2
+ """
3
+ Initial MLA Decode submission — optimised baseline with Triton softmax and RoPE kernels.
4
+ """
5
+
6
+ import os
7
+ import math
8
+ from typing import Tuple
9
+ import torch
10
+ import torch.nn.functional as F
11
+ import triton
12
+ import triton.language as tl
13
+ from reference import KVCache, Config
14
+
15
+
16
+ @triton.jit
17
+ def rope_swap_halves_kernel(
18
+ x_ptr,
19
+ cos_ptr, sin_ptr,
20
+ B: tl.constexpr,
21
+ T: tl.constexpr,
22
+ D: tl.constexpr,
23
+ stride_xb, stride_xt, stride_xd,
24
+ stride_cos_t, stride_cos_d,
25
+ stride_sin_t, stride_sin_d,
26
+ BLOCK_HALF: tl.constexpr,
27
+ ):
28
+ pid = tl.program_id(0)
29
+ bt = pid
30
+ b = bt // T
31
+ t = bt - b * T
32
+
33
+ half = D // 2
34
+
35
+ off = tl.arange(0, BLOCK_HALF)
36
+ mask = off < half
37
+
38
+ x_base = x_ptr + b * stride_xb + t * stride_xt
39
+ x0_ptr = x_base + off * stride_xd
40
+ x1_ptr = x_base + (half + off) * stride_xd
41
+
42
+ cos_base = cos_ptr + t * stride_cos_t
43
+ sin_base = sin_ptr + t * stride_sin_t
44
+
45
+ c_ptr = cos_base + off * stride_cos_d
46
+ s_ptr = sin_base + off * stride_sin_d
47
+
48
+ x0 = tl.load(x0_ptr, mask=mask, other=0.0).to(tl.float32)
49
+ x1 = tl.load(x1_ptr, mask=mask, other=0.0).to(tl.float32)
50
+ c = tl.load(c_ptr, mask=mask, other=0.0).to(tl.float32)
51
+ s = tl.load(s_ptr, mask=mask, other=0.0).to(tl.float32)
52
+
53
+ out0 = x0 * c - x1 * s
54
+ out1 = x1 * c + x0 * s
55
+
56
+ tl.store(x0_ptr, out0.to(tl.bfloat16), mask=mask)
57
+ tl.store(x1_ptr, out1.to(tl.bfloat16), mask=mask)
58
+
59
+
60
+ def rope_inplace_query(q_rope: torch.Tensor, cos_q: torch.Tensor, sin_q: torch.Tensor):
61
+ assert q_rope.is_cuda
62
+ assert q_rope.shape[-1] % 2 == 0
63
+ bs, nh, d_rope = q_rope.shape
64
+
65
+ half = d_rope // 2
66
+ BLOCK_HALF = 1 << (half - 1).bit_length()
67
+
68
+ grid = (bs * nh,)
69
+
70
+ rope_swap_halves_kernel[grid](
71
+ q_rope,
72
+ cos_q, sin_q,
73
+ B=bs, T=nh, D=d_rope,
74
+ stride_xb=q_rope.stride(0),
75
+ stride_xt=q_rope.stride(1),
76
+ stride_xd=q_rope.stride(2),
77
+ stride_cos_t=0, stride_cos_d=cos_q.stride(0),
78
+ stride_sin_t=0, stride_sin_d=sin_q.stride(0),
79
+ BLOCK_HALF=BLOCK_HALF,
80
+ num_warps=4,
81
+ )
82
+
83
+
84
+ _rope_cache = {}
85
+
86
+
87
+ def _rotate_half(x: torch.Tensor) -> torch.Tensor:
88
+ half = x.shape[-1] // 2
89
+ return torch.cat((-x[..., half:], x[..., :half]), dim=-1)
90
+
91
+
92
+ def _get_rope_tables(dim: int, max_seq_len: int, device: torch.device):
93
+ key = (dim, max_seq_len, device)
94
+ if key not in _rope_cache:
95
+ half = dim // 2
96
+ theta = (10000.0 ** (-torch.arange(half, dtype=torch.float32, device=device) / half)).to(
97
+ torch.bfloat16
98
+ )
99
+ pos = torch.arange(max_seq_len, dtype=torch.int64, device=device).unsqueeze_(1)
100
+ idx = pos * theta[None, :]
101
+ idx = torch.cat([idx, idx], dim=-1)
102
+ _rope_cache[key] = (idx.cos().to(torch.bfloat16), idx.sin().to(torch.bfloat16))
103
+ return _rope_cache[key]
104
+
105
+
106
+ @triton.jit
107
+ def _softmax_kernel(
108
+ out_ptr, in_ptr,
109
+ stride_out, stride_in,
110
+ n_cols,
111
+ BLOCK_SIZE: tl.constexpr,
112
+ NUM_STAGES: tl.constexpr,
113
+ ):
114
+ row = tl.program_id(0)
115
+ row_off_in = row * stride_in
116
+ row_off_out = row * stride_out
117
+
118
+ max_val = tl.full([BLOCK_SIZE], -float("inf"), tl.float32)
119
+ col = tl.arange(0, BLOCK_SIZE)
120
+ for start in range(0, n_cols, BLOCK_SIZE):
121
+ cur = start + col
122
+ mask = cur < n_cols
123
+ val = tl.load(in_ptr + row_off_in + cur, mask=mask, other=-float('inf'))
124
+ max_val = tl.maximum(max_val, tl.cast(val, tl.float32))
125
+ row_max = tl.max(max_val)
126
+
127
+ sum_val = tl.full([BLOCK_SIZE], 0.0, tl.float32)
128
+ for start in range(0, n_cols, BLOCK_SIZE):
129
+ cur = start + col
130
+ mask = cur < n_cols
131
+ val = tl.load(in_ptr + row_off_in + cur, mask=mask, other=-float('inf'))
132
+ exp_val = tl.exp(tl.cast(val, tl.float32) - row_max)
133
+ tl.store(out_ptr + row_off_out + cur, tl.cast(exp_val, tl.bfloat16), mask=mask)
134
+ sum_val += exp_val
135
+ row_sum = tl.sum(sum_val)
136
+
137
+ for start in range(0, n_cols, BLOCK_SIZE):
138
+ cur = start + col
139
+ mask = cur < n_cols
140
+ val = tl.load(out_ptr + row_off_out + cur, mask=mask, other=0.0)
141
+ norm = tl.cast(val, tl.float32) / row_sum
142
+ tl.store(out_ptr + row_off_out + cur, tl.cast(norm, tl.bfloat16), mask=mask)
143
+
144
+
145
+ def _triton_softmax(x: torch.Tensor) -> torch.Tensor:
146
+ assert x.is_cuda and x.dtype == torch.bfloat16
147
+ n_rows, n_cols = x.shape
148
+
149
+ if n_cols <= 32:
150
+ BLOCK_SIZE = 32
151
+ elif n_cols <= 64:
152
+ BLOCK_SIZE = 64
153
+ elif n_cols <= 128:
154
+ BLOCK_SIZE = 128
155
+ else:
156
+ BLOCK_SIZE = 1 << (n_cols - 1).bit_length()
157
+ BLOCK_SIZE = min(BLOCK_SIZE, 1024)
158
+
159
+ out = torch.empty_like(x)
160
+ grid = (n_rows,)
161
+ _softmax_kernel[grid](
162
+ out, x,
163
+ out.stride(0), x.stride(0),
164
+ n_cols,
165
+ BLOCK_SIZE=BLOCK_SIZE,
166
+ NUM_STAGES=2,
167
+ num_warps=4,
168
+ )
169
+ return out
170
+
171
+
172
+ def custom_kernel(data: Tuple[Config, torch.Tensor, KVCache]) -> Tuple[torch.Tensor, torch.Tensor]:
173
+ """
174
+ Optimised forward step of the Multi-head Latent Attention (MLA) module.
175
+ """
176
+ config, x, kv_cache = data
177
+
178
+ bs = config.batch_size
179
+ sl = config.seq_len
180
+ nh = config.n_heads
181
+ dq = config.q_lora_rank
182
+ dkv = config.kv_lora_rank
183
+ d_nope = config.qk_nope_head_dim
184
+ d_rope = config.qk_rope_head_dim
185
+ dv = config.v_head_dim
186
+ msl = config.max_seq_len
187
+
188
+ wDQ = config.Q_proj_down_weight
189
+ wDKV = config.KV_proj_down_weight
190
+ wUQ = config.Q_proj_up_weight
191
+ wUKV = config.KV_proj_up_weight
192
+ wO = config.wo_weight
193
+
194
+ q_lora = F.linear(x, wDQ)
195
+ kv_lora_input = F.linear(x, wDKV)
196
+
197
+ kv_lora, kv_len = kv_cache(kv_lora_input)
198
+ query_pos = kv_len - 1
199
+
200
+ q_up = F.linear(q_lora.squeeze(1), wUQ)
201
+ q_up = q_up.view(bs, nh, d_nope + d_rope)
202
+ q_nope = q_up[..., :d_nope]
203
+ q_rope = q_up[..., d_nope:]
204
+
205
+ kv_nope_input = kv_lora[..., :dkv]
206
+ k_rope_input = kv_lora[..., dkv:]
207
+
208
+ cos_table, sin_table = _get_rope_tables(d_rope, msl, x.device)
209
+
210
+ cos_q = cos_table[query_pos].view(d_rope).contiguous()
211
+ sin_q = sin_table[query_pos].view(d_rope).contiguous()
212
+ rope_inplace_query(q_rope, cos_q, sin_q)
213
+
214
+ cos_k = cos_table[:kv_len]
215
+ sin_k = sin_table[:kv_len]
216
+ k_rope = k_rope_input * cos_k + _rotate_half(k_rope_input) * sin_k
217
+
218
+ wUKV_view = wUKV.view(nh, d_nope + dv, dkv)
219
+ wK = wUKV_view[:, :d_nope, :]
220
+ q_nope_latent = torch.einsum('bhd,hdk->bhk', q_nope, wK)
221
+
222
+ kv_nope_T = kv_nope_input.transpose(1, 2)
223
+ scores_nope = torch.matmul(q_nope_latent, kv_nope_T)
224
+
225
+ scores_rope = torch.matmul(q_rope, k_rope.transpose(-2, -1))
226
+
227
+ scale = 1.0 / math.sqrt(d_nope + d_rope)
228
+ scores = (scores_nope + scores_rope) * scale
229
+
230
+ scores_flat = scores.reshape(bs * nh, kv_len)
231
+ attn_flat = _triton_softmax(scores_flat)
232
+ attn = attn_flat.view(bs, nh, kv_len)
233
+
234
+ M = torch.matmul(attn, kv_nope_input)
235
+
236
+ wV = wUKV_view[:, d_nope:, :]
237
+ wV_T = wV.permute(0, 2, 1)
238
+ y_head = torch.einsum('bhd,hdk->bhk', M, wV_T)
239
+
240
+ y = y_head.reshape(bs, nh * dv)
241
+ y = y.unsqueeze(1)
242
+ output = F.linear(y, wO)
243
+
244
+ return output, kv_cache.data
245
+ # EVOLVE-BLOCK-END
benchmarks/gpu_mode/mla_decode/reference.py ADDED
@@ -0,0 +1,520 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for MLA Decode (Multi-Head Latent Attention) Triton kernel.
3
+ Same test cases, benchmarks, generate_input, ref_kernel, and check_implementation.
4
+ """
5
+
6
+ import math
7
+ from dataclasses import dataclass
8
+ import torch
9
+ from torch import nn
10
+ import torch.nn.functional as F
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Scoring and benchmark configuration (read by shared_eval.py)
14
+ # ---------------------------------------------------------------------------
15
+
16
+ SCORE_SCALE = 3000.0
17
+
18
+ # MLA uses wall-clock timing, 1% rel error, no wall clock timeout, torch.no_grad()
19
+ BENCH_USE_CUDA_EVENTS = False
20
+ BENCH_REL_ERROR = 0.01
21
+ BENCH_WALL_TIMEOUT_NS = None
22
+ BENCH_NO_GRAD = True
23
+ BENCH_MAX_REPEATS = 100
24
+ BENCH_MAX_TIME_NS = 10e9
25
+ BENCH_WARMUP_STYLE = 'timed_calls'
26
+
27
+ # ---------------------------------------------------------------------------
28
+ # Model classes (needed by both reference and submissions)
29
+ # ---------------------------------------------------------------------------
30
+
31
+
32
+ class RoPE(nn.Module):
33
+ def __init__(self, d_model: int):
34
+ super().__init__()
35
+ self.d_model = d_model
36
+ theta = 10000 ** (-torch.arange(0, d_model // 2, dtype=torch.bfloat16) / (d_model // 2))
37
+ self.register_buffer("theta", theta)
38
+
39
+ def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
40
+ x1, x2 = x.chunk(2, dim=-1)
41
+ return torch.cat((-x2, x1), dim=-1)
42
+
43
+ def forward(self, x: torch.Tensor, start_pos: int = 0) -> torch.Tensor:
44
+ seq_len = x.size(-2)
45
+ d_model = x.size(-1)
46
+ assert d_model == self.d_model
47
+ seq_idx = torch.arange(start_pos, start_pos + seq_len, device=x.device)
48
+ idx_theta = torch.einsum('s,d->sd', seq_idx, self.theta)
49
+ idx_theta2 = torch.cat([idx_theta, idx_theta], dim=-1)
50
+ cos = idx_theta2.cos().to(torch.bfloat16)
51
+ sin = idx_theta2.sin().to(torch.bfloat16)
52
+ return x * cos + self.rotate_half(x) * sin
53
+
54
+
55
+ class KVCache(nn.Module):
56
+ def __init__(self, kv_cache_shape: tuple, **kwargs) -> None:
57
+ super().__init__(**kwargs)
58
+ self.register_buffer('data', torch.zeros(kv_cache_shape, dtype=torch.bfloat16))
59
+ self.seq_len = 0
60
+ self.zero()
61
+
62
+ def zero(self) -> None:
63
+ self.data.zero_()
64
+
65
+ def get_data(self) -> torch.Tensor:
66
+ return self.data
67
+
68
+ def forward(self, c_kv: torch.Tensor) -> torch.Tensor:
69
+ assert self.seq_len + c_kv.size(1) <= self.data.size(1), "KV Cache Exceeded"
70
+
71
+ self.data = self.data.to(c_kv.dtype)
72
+ self.data[
73
+ :, self.seq_len: self.seq_len + c_kv.size(1), :
74
+ ] = c_kv
75
+ self.seq_len += c_kv.size(1)
76
+
77
+ return self.data[:, :self.seq_len], self.seq_len
78
+
79
+
80
+ @dataclass
81
+ class Config:
82
+ batch_size: int
83
+ dim: int
84
+ n_heads: int
85
+ q_lora_rank: int
86
+ kv_lora_rank: int
87
+ qk_nope_head_dim: int
88
+ qk_rope_head_dim: int
89
+ v_head_dim: int
90
+ seq_len: int
91
+ max_seq_len: int
92
+ kv_cache_shape: tuple
93
+ Q_proj_down_weight: torch.Tensor
94
+ Q_proj_up_weight: torch.Tensor
95
+ KV_proj_down_weight: torch.Tensor
96
+ KV_proj_up_weight: torch.Tensor
97
+ wo_weight: torch.Tensor
98
+
99
+
100
+ class MLA(nn.Module):
101
+ def __init__(self, config: Config):
102
+ super().__init__()
103
+ self.dim = config.dim
104
+ self.n_heads = config.n_heads
105
+ self.q_lora_rank = config.q_lora_rank
106
+ self.kv_lora_rank = config.kv_lora_rank
107
+ self.nope_head_dim = config.qk_nope_head_dim
108
+ self.rope_head_dim = config.qk_rope_head_dim
109
+ self.v_head_dim = config.v_head_dim
110
+ self.Q_proj_down = nn.Linear(self.dim, self.q_lora_rank, dtype=torch.bfloat16, bias=False)
111
+ self.KV_proj_down = nn.Linear(self.dim, self.kv_lora_rank + self.rope_head_dim, dtype=torch.bfloat16, bias=False)
112
+ self.Q_proj_up = nn.Linear(self.q_lora_rank, (self.nope_head_dim + self.rope_head_dim) * self.n_heads, dtype=torch.bfloat16, bias=False)
113
+ self.KV_proj_up = nn.Linear(self.kv_lora_rank, (self.nope_head_dim + self.v_head_dim) * self.n_heads, dtype=torch.bfloat16, bias=False)
114
+ self.q_rope = RoPE(self.rope_head_dim)
115
+ self.k_rope = RoPE(self.rope_head_dim)
116
+ self.wo = nn.Linear(self.v_head_dim * self.n_heads, self.dim, dtype=torch.bfloat16, bias=False)
117
+ self.eps = 1e-6
118
+
119
+ def forward(self, x: torch.Tensor, kv_cache: KVCache) -> torch.Tensor:
120
+ batch_size, seq_len, model_dim = x.size()
121
+
122
+ q_lora = self.Q_proj_down(x)
123
+ kv_lora = self.KV_proj_down(x)
124
+ kv_lora, kv_len = kv_cache(kv_lora)
125
+ query_pos = kv_len - 1
126
+
127
+ q_nope_and_rope = self.Q_proj_up(q_lora).view(
128
+ batch_size, seq_len, self.n_heads, self.nope_head_dim + self.rope_head_dim)
129
+ q_nope, q_rope = torch.split(q_nope_and_rope, [self.nope_head_dim, self.rope_head_dim], dim=-1)
130
+
131
+ kv_nope, k_rope = torch.split(kv_lora, [self.kv_lora_rank, self.rope_head_dim], dim=-1)
132
+ kv_nope = self.KV_proj_up(kv_nope).view(
133
+ batch_size, kv_len, self.n_heads, self.nope_head_dim + self.v_head_dim)
134
+ k_nope, v = torch.split(kv_nope, [self.nope_head_dim, self.v_head_dim], dim=-1)
135
+
136
+ q_rope = q_rope.permute(0, 2, 1, 3)
137
+ q_rope = self.q_rope(q_rope, start_pos=query_pos)
138
+
139
+ q_nope = q_nope.permute(0, 2, 1, 3)
140
+ q = torch.concat([q_nope, q_rope], dim=-1)
141
+
142
+ k_rope = k_rope[:, None, :, :]
143
+ k_rope = self.k_rope(k_rope).expand(-1, self.n_heads, -1, -1)
144
+ k_nope = k_nope.permute(0, 2, 1, 3)
145
+ k = torch.concat([k_nope, k_rope], dim=-1)
146
+
147
+ v = v.permute(0, 2, 1, 3)
148
+ scores = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(self.rope_head_dim + self.nope_head_dim)
149
+ attn = F.softmax(scores, dim=-1).to(torch.bfloat16)
150
+ y = torch.matmul(attn, v).view(batch_size, 1, -1)
151
+ y = self.wo(y)
152
+
153
+ return y, kv_cache.get_data()
154
+
155
+
156
+ # ---------------------------------------------------------------------------
157
+ # Test / benchmark cases — from discover task.yml
158
+ # ---------------------------------------------------------------------------
159
+
160
+ TEST_CASES = [
161
+ {"batchsize": 128, "dim": 7168, "dq": 1536, "prefill": 128, "seed": 9247},
162
+ {"batchsize": 128, "dim": 7168, "dq": 1536, "prefill": 512, "seed": 2197},
163
+ {"batchsize": 128, "dim": 7168, "dq": 1536, "prefill": 1024, "seed": 9107},
164
+ {"batchsize": 128, "dim": 7168, "dq": 1536, "prefill": 2048, "seed": 5291},
165
+ ]
166
+
167
+ BENCHMARK_CASES = [
168
+ {"batchsize": 128, "dim": 7168, "dq": 1536, "prefill": 4096, "seed": 9817},
169
+ {"batchsize": 128, "dim": 7168, "dq": 1536, "prefill": 6144, "seed": 5291},
170
+ ]
171
+
172
+
173
+ # ---------------------------------------------------------------------------
174
+ # Input generation
175
+ # ---------------------------------------------------------------------------
176
+
177
+
178
+ def generate_input(batchsize, dim, dq, prefill, seed):
179
+ gen = torch.Generator(device='cuda')
180
+ gen.manual_seed(seed)
181
+
182
+ Q_proj_down_weight = torch.randn((dq, dim), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(dim)
183
+ KV_proj_down_weight = torch.randn((512 + 64, dim), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(dim)
184
+ Q_proj_up_weight = torch.randn(((128 + 64) * 128, dq), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(dq)
185
+ KV_proj_up_weight = torch.randn(((128 + 128) * 128, 512), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(512)
186
+ wo_weight = torch.randn((dim, 128 * 128), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(128 * 128)
187
+
188
+ config = Config(
189
+ batch_size=batchsize,
190
+ dim=dim,
191
+ q_lora_rank=dq,
192
+ n_heads=128,
193
+ kv_lora_rank=512,
194
+ qk_nope_head_dim=128,
195
+ qk_rope_head_dim=64,
196
+ v_head_dim=128,
197
+ seq_len=1,
198
+ max_seq_len=8192,
199
+ kv_cache_shape=(batchsize, 8192, 512 + 64),
200
+ Q_proj_down_weight=Q_proj_down_weight,
201
+ Q_proj_up_weight=Q_proj_up_weight,
202
+ KV_proj_down_weight=KV_proj_down_weight,
203
+ KV_proj_up_weight=KV_proj_up_weight,
204
+ wo_weight=wo_weight,
205
+ )
206
+ x = torch.randn((config.batch_size, 1, config.dim), dtype=torch.bfloat16, generator=gen, device='cuda')
207
+
208
+ kv_cache = KVCache((config.batch_size, config.max_seq_len, config.kv_lora_rank + config.qk_rope_head_dim)).to('cuda')
209
+ pre_filled_cache = torch.randn(
210
+ (config.batch_size, prefill, config.kv_lora_rank + config.qk_rope_head_dim),
211
+ dtype=torch.bfloat16, generator=gen, device='cuda')
212
+ kv_cache(pre_filled_cache)
213
+
214
+ return config, x, kv_cache
215
+
216
+
217
+ # ---------------------------------------------------------------------------
218
+ # Reference kernel
219
+ # ---------------------------------------------------------------------------
220
+
221
+
222
+ def ref_kernel(data):
223
+ config, x, kv_cache = data
224
+
225
+ model = MLA(config).to('cuda')
226
+ model.Q_proj_down.weight = nn.Parameter(config.Q_proj_down_weight)
227
+ model.Q_proj_up.weight = nn.Parameter(config.Q_proj_up_weight)
228
+ model.KV_proj_down.weight = nn.Parameter(config.KV_proj_down_weight)
229
+ model.KV_proj_up.weight = nn.Parameter(config.KV_proj_up_weight)
230
+ model.wo.weight = nn.Parameter(config.wo_weight)
231
+
232
+ output, kv_data = model(x, kv_cache)
233
+ return output, kv_data
234
+
235
+
236
+ # ---------------------------------------------------------------------------
237
+ # Correctness checking
238
+ # ---------------------------------------------------------------------------
239
+
240
+
241
+ @torch.no_grad()
242
+ def _verbose_allclose(received, expected, rtol=1e-05, atol=1e-08, max_print=5):
243
+ if received.shape != expected.shape:
244
+ return False, [f"SIZE MISMATCH. received shape: {received.shape}, expected shape: {expected.shape}"]
245
+
246
+ diff = torch.abs(received.to(torch.float32) - expected.to(torch.float32))
247
+ tolerance = atol + rtol * torch.abs(expected.to(torch.float32))
248
+ tol_mismatched = diff > tolerance
249
+ nan_mismatched = torch.logical_xor(torch.isnan(received), torch.isnan(expected))
250
+ posinf_mismatched = torch.logical_xor(torch.isposinf(received), torch.isposinf(expected))
251
+ neginf_mismatched = torch.logical_xor(torch.isneginf(received), torch.isneginf(expected))
252
+ mismatched = torch.logical_or(
253
+ torch.logical_or(tol_mismatched, nan_mismatched),
254
+ torch.logical_or(posinf_mismatched, neginf_mismatched),
255
+ )
256
+
257
+ mismatched_indices = torch.nonzero(mismatched)
258
+ num_mismatched = mismatched.count_nonzero().item()
259
+
260
+ if num_mismatched >= 1:
261
+ mismatch_details = [f"Number of mismatched elements: {num_mismatched}"]
262
+ for index in mismatched_indices[:max_print]:
263
+ i = tuple(index.tolist())
264
+ mismatch_details.append(f"ERROR at {i}: {received[i]} {expected[i]}")
265
+ if num_mismatched > max_print:
266
+ mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.")
267
+ return False, mismatch_details
268
+
269
+ return True, [f"Maximum error: {torch.max(diff)}"]
270
+
271
+
272
+ def check_implementation(data, submission_output, rtol=2e-2, atol=8e-3):
273
+ """Check submission output against reference. Returns (passed: bool, msg: str)."""
274
+ import gc
275
+ output_mla, output_kv = submission_output
276
+
277
+ # Move submission output to CPU and free GPU memory before running ref kernel
278
+ output_mla_cpu = output_mla.cpu()
279
+ output_kv_cpu = output_kv.cpu()
280
+ del output_mla, output_kv
281
+ gc.collect()
282
+ torch.cuda.empty_cache()
283
+
284
+ config, x, kv_cache = data
285
+ with torch.no_grad():
286
+ expected_mla, expected_kv = ref_kernel((config, x, kv_cache))
287
+
288
+ # Move ref output to CPU and free GPU memory before comparison
289
+ expected_mla_cpu = expected_mla.cpu()
290
+ expected_kv_cpu = expected_kv.cpu()
291
+ del expected_mla, expected_kv
292
+ gc.collect()
293
+ torch.cuda.empty_cache()
294
+
295
+ good_mla, reasons_mla = _verbose_allclose(output_mla_cpu, expected_mla_cpu, rtol=rtol, atol=atol)
296
+ good_kv, reasons_kv = _verbose_allclose(output_kv_cpu, expected_kv_cpu, rtol=rtol, atol=atol)
297
+
298
+ if not good_mla:
299
+ return False, "MLA output mismatch: " + " ".join(reasons_mla)
300
+ if not good_kv:
301
+ return False, "KV cache mismatch: " + " ".join(reasons_kv)
302
+
303
+ return True, "Match"
304
+
305
+
306
+ # ---------------------------------------------------------------------------
307
+ # Self-contained reference code for Modal remote execution
308
+ # ---------------------------------------------------------------------------
309
+
310
+ MODAL_REFERENCE_CODE = r'''
311
+ import math
312
+ from dataclasses import dataclass
313
+ import torch
314
+ from torch import nn
315
+ import torch.nn.functional as F
316
+
317
+
318
+ class RoPE(nn.Module):
319
+ def __init__(self, d_model: int):
320
+ super().__init__()
321
+ self.d_model = d_model
322
+ theta = 10000 ** (-torch.arange(0, d_model // 2, dtype=torch.bfloat16) / (d_model // 2))
323
+ self.register_buffer("theta", theta)
324
+
325
+ def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
326
+ x1, x2 = x.chunk(2, dim=-1)
327
+ return torch.cat((-x2, x1), dim=-1)
328
+
329
+ def forward(self, x: torch.Tensor, start_pos: int = 0) -> torch.Tensor:
330
+ seq_len = x.size(-2)
331
+ d_model = x.size(-1)
332
+ assert d_model == self.d_model
333
+ seq_idx = torch.arange(start_pos, start_pos + seq_len, device=x.device)
334
+ idx_theta = torch.einsum('s,d->sd', seq_idx, self.theta)
335
+ idx_theta2 = torch.cat([idx_theta, idx_theta], dim=-1)
336
+ cos = idx_theta2.cos().to(torch.bfloat16)
337
+ sin = idx_theta2.sin().to(torch.bfloat16)
338
+ return x * cos + self.rotate_half(x) * sin
339
+
340
+
341
+ class KVCache(nn.Module):
342
+ def __init__(self, kv_cache_shape: tuple, **kwargs) -> None:
343
+ super().__init__(**kwargs)
344
+ self.register_buffer('data', torch.zeros(kv_cache_shape, dtype=torch.bfloat16))
345
+ self.seq_len = 0
346
+ self.zero()
347
+
348
+ def zero(self) -> None:
349
+ self.data.zero_()
350
+
351
+ def get_data(self) -> torch.Tensor:
352
+ return self.data
353
+
354
+ def forward(self, c_kv: torch.Tensor) -> torch.Tensor:
355
+ assert self.seq_len + c_kv.size(1) <= self.data.size(1), "KV Cache Exceeded"
356
+ self.data = self.data.to(c_kv.dtype)
357
+ self.data[:, self.seq_len: self.seq_len + c_kv.size(1), :] = c_kv
358
+ self.seq_len += c_kv.size(1)
359
+ return self.data[:, :self.seq_len], self.seq_len
360
+
361
+
362
+ @dataclass
363
+ class Config:
364
+ batch_size: int
365
+ dim: int
366
+ n_heads: int
367
+ q_lora_rank: int
368
+ kv_lora_rank: int
369
+ qk_nope_head_dim: int
370
+ qk_rope_head_dim: int
371
+ v_head_dim: int
372
+ seq_len: int
373
+ max_seq_len: int
374
+ kv_cache_shape: tuple
375
+ Q_proj_down_weight: torch.Tensor
376
+ Q_proj_up_weight: torch.Tensor
377
+ KV_proj_down_weight: torch.Tensor
378
+ KV_proj_up_weight: torch.Tensor
379
+ wo_weight: torch.Tensor
380
+
381
+
382
+ class MLA(nn.Module):
383
+ def __init__(self, config: Config):
384
+ super().__init__()
385
+ self.dim = config.dim
386
+ self.n_heads = config.n_heads
387
+ self.q_lora_rank = config.q_lora_rank
388
+ self.kv_lora_rank = config.kv_lora_rank
389
+ self.nope_head_dim = config.qk_nope_head_dim
390
+ self.rope_head_dim = config.qk_rope_head_dim
391
+ self.v_head_dim = config.v_head_dim
392
+ self.Q_proj_down = nn.Linear(self.dim, self.q_lora_rank, dtype=torch.bfloat16, bias=False)
393
+ self.KV_proj_down = nn.Linear(self.dim, self.kv_lora_rank + self.rope_head_dim, dtype=torch.bfloat16, bias=False)
394
+ self.Q_proj_up = nn.Linear(self.q_lora_rank, (self.nope_head_dim + self.rope_head_dim) * self.n_heads, dtype=torch.bfloat16, bias=False)
395
+ self.KV_proj_up = nn.Linear(self.kv_lora_rank, (self.nope_head_dim + self.v_head_dim) * self.n_heads, dtype=torch.bfloat16, bias=False)
396
+ self.q_rope = RoPE(self.rope_head_dim)
397
+ self.k_rope = RoPE(self.rope_head_dim)
398
+ self.wo = nn.Linear(self.v_head_dim * self.n_heads, self.dim, dtype=torch.bfloat16, bias=False)
399
+ self.eps = 1e-6
400
+
401
+ def forward(self, x: torch.Tensor, kv_cache: KVCache) -> torch.Tensor:
402
+ batch_size, seq_len, model_dim = x.size()
403
+ q_lora = self.Q_proj_down(x)
404
+ kv_lora = self.KV_proj_down(x)
405
+ kv_lora, kv_len = kv_cache(kv_lora)
406
+ query_pos = kv_len - 1
407
+ q_nope_and_rope = self.Q_proj_up(q_lora).view(
408
+ batch_size, seq_len, self.n_heads, self.nope_head_dim + self.rope_head_dim)
409
+ q_nope, q_rope = torch.split(q_nope_and_rope, [self.nope_head_dim, self.rope_head_dim], dim=-1)
410
+ kv_nope, k_rope = torch.split(kv_lora, [self.kv_lora_rank, self.rope_head_dim], dim=-1)
411
+ kv_nope = self.KV_proj_up(kv_nope).view(
412
+ batch_size, kv_len, self.n_heads, self.nope_head_dim + self.v_head_dim)
413
+ k_nope, v = torch.split(kv_nope, [self.nope_head_dim, self.v_head_dim], dim=-1)
414
+ q_rope = q_rope.permute(0, 2, 1, 3)
415
+ q_rope = self.q_rope(q_rope, start_pos=query_pos)
416
+ q_nope = q_nope.permute(0, 2, 1, 3)
417
+ q = torch.concat([q_nope, q_rope], dim=-1)
418
+ k_rope = k_rope[:, None, :, :]
419
+ k_rope = self.k_rope(k_rope).expand(-1, self.n_heads, -1, -1)
420
+ k_nope = k_nope.permute(0, 2, 1, 3)
421
+ k = torch.concat([k_nope, k_rope], dim=-1)
422
+ v = v.permute(0, 2, 1, 3)
423
+ scores = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(self.rope_head_dim + self.nope_head_dim)
424
+ attn = F.softmax(scores, dim=-1).to(torch.bfloat16)
425
+ y = torch.matmul(attn, v).view(batch_size, 1, -1)
426
+ y = self.wo(y)
427
+ return y, kv_cache.get_data()
428
+
429
+
430
+ def ref_kernel(data):
431
+ config, x, kv_cache = data
432
+ model = MLA(config).to('cuda')
433
+ model.Q_proj_down.weight = nn.Parameter(config.Q_proj_down_weight)
434
+ model.Q_proj_up.weight = nn.Parameter(config.Q_proj_up_weight)
435
+ model.KV_proj_down.weight = nn.Parameter(config.KV_proj_down_weight)
436
+ model.KV_proj_up.weight = nn.Parameter(config.KV_proj_up_weight)
437
+ model.wo.weight = nn.Parameter(config.wo_weight)
438
+ output, kv_data = model(x, kv_cache)
439
+ return output, kv_data
440
+
441
+
442
+ def generate_input(batchsize, dim, dq, prefill, seed):
443
+ gen = torch.Generator(device='cuda')
444
+ gen.manual_seed(seed)
445
+ Q_proj_down_weight = torch.randn((dq, dim), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(dim)
446
+ KV_proj_down_weight = torch.randn((512 + 64, dim), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(dim)
447
+ Q_proj_up_weight = torch.randn(((128 + 64) * 128, dq), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(dq)
448
+ KV_proj_up_weight = torch.randn(((128 + 128) * 128, 512), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(512)
449
+ wo_weight = torch.randn((dim, 128 * 128), dtype=torch.bfloat16, generator=gen, device='cuda') / math.sqrt(128 * 128)
450
+ config = Config(
451
+ batch_size=batchsize, dim=dim, q_lora_rank=dq, n_heads=128,
452
+ kv_lora_rank=512, qk_nope_head_dim=128, qk_rope_head_dim=64,
453
+ v_head_dim=128, seq_len=1, max_seq_len=8192,
454
+ kv_cache_shape=(batchsize, 8192, 512 + 64),
455
+ Q_proj_down_weight=Q_proj_down_weight, Q_proj_up_weight=Q_proj_up_weight,
456
+ KV_proj_down_weight=KV_proj_down_weight, KV_proj_up_weight=KV_proj_up_weight,
457
+ wo_weight=wo_weight,
458
+ )
459
+ x = torch.randn((config.batch_size, 1, config.dim), dtype=torch.bfloat16, generator=gen, device='cuda')
460
+ kv_cache = KVCache((config.batch_size, config.max_seq_len, config.kv_lora_rank + config.qk_rope_head_dim)).to('cuda')
461
+ pre_filled_cache = torch.randn(
462
+ (config.batch_size, prefill, config.kv_lora_rank + config.qk_rope_head_dim),
463
+ dtype=torch.bfloat16, generator=gen, device='cuda')
464
+ kv_cache(pre_filled_cache)
465
+ return config, x, kv_cache
466
+
467
+
468
+ @torch.no_grad()
469
+ def _verbose_allclose(received, expected, rtol=1e-05, atol=1e-08, max_print=5):
470
+ if received.shape != expected.shape:
471
+ return False, [f"SIZE MISMATCH. received shape: {received.shape}, expected shape: {expected.shape}"]
472
+ diff = torch.abs(received.to(torch.float32) - expected.to(torch.float32))
473
+ tolerance = atol + rtol * torch.abs(expected.to(torch.float32))
474
+ tol_mismatched = diff > tolerance
475
+ nan_mismatched = torch.logical_xor(torch.isnan(received), torch.isnan(expected))
476
+ posinf_mismatched = torch.logical_xor(torch.isposinf(received), torch.isposinf(expected))
477
+ neginf_mismatched = torch.logical_xor(torch.isneginf(received), torch.isneginf(expected))
478
+ mismatched = torch.logical_or(
479
+ torch.logical_or(tol_mismatched, nan_mismatched),
480
+ torch.logical_or(posinf_mismatched, neginf_mismatched),
481
+ )
482
+ mismatched_indices = torch.nonzero(mismatched)
483
+ num_mismatched = mismatched.count_nonzero().item()
484
+ if num_mismatched >= 1:
485
+ mismatch_details = [f"Number of mismatched elements: {num_mismatched}"]
486
+ for index in mismatched_indices[:max_print]:
487
+ i = tuple(index.tolist())
488
+ mismatch_details.append(f"ERROR at {i}: {received[i]} {expected[i]}")
489
+ if num_mismatched > max_print:
490
+ mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.")
491
+ return False, mismatch_details
492
+ return True, [f"Maximum error: {torch.max(diff)}"]
493
+
494
+
495
+ def check_implementation(data, submission_output, rtol=2e-2, atol=8e-3):
496
+ import gc
497
+ output_mla, output_kv = submission_output
498
+ # Move submission output to CPU and free GPU memory before running ref kernel
499
+ output_mla_cpu = output_mla.cpu()
500
+ output_kv_cpu = output_kv.cpu()
501
+ del output_mla, output_kv
502
+ gc.collect()
503
+ torch.cuda.empty_cache()
504
+ config, x, kv_cache = data
505
+ with torch.no_grad():
506
+ expected_mla, expected_kv = ref_kernel((config, x, kv_cache))
507
+ # Move ref output to CPU and free GPU memory before comparison
508
+ expected_mla_cpu = expected_mla.cpu()
509
+ expected_kv_cpu = expected_kv.cpu()
510
+ del expected_mla, expected_kv
511
+ gc.collect()
512
+ torch.cuda.empty_cache()
513
+ good_mla, reasons_mla = _verbose_allclose(output_mla_cpu, expected_mla_cpu, rtol=rtol, atol=atol)
514
+ good_kv, reasons_kv = _verbose_allclose(output_kv_cpu, expected_kv_cpu, rtol=rtol, atol=atol)
515
+ if not good_mla:
516
+ return False, "MLA output mismatch: " + " ".join(reasons_mla)
517
+ if not good_kv:
518
+ return False, "KV cache mismatch: " + " ".join(reasons_kv)
519
+ return True, "Match"
520
+ '''
benchmarks/gpu_mode/mla_decode/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/gpu_mode/modal_eval.py ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Shared Modal app for evaluating Triton kernels on cloud GPUs.
3
+ Scoring: score = score_scale / geom_mean_runtime_us.
4
+
5
+ Usage:
6
+ Set GPUMODE_USE_MODAL=true and GPUMODE_MODAL_GPU=H100 (or A100, L40S, T4, H200)
7
+ in environment variables, then call eval functions from evaluators.
8
+ """
9
+
10
+ import modal
11
+
12
+ app = modal.App("gpu-mode-triton-eval")
13
+
14
+ cuda_image = (
15
+ modal.Image.debian_slim(python_version="3.11")
16
+ .pip_install(
17
+ "torch>=2.2.0",
18
+ "triton>=3.0.0",
19
+ "numpy",
20
+ )
21
+ )
22
+
23
+
24
+ def _eval_triton_impl(
25
+ submission_code: str,
26
+ reference_code: str,
27
+ test_cases: list,
28
+ benchmark_cases: list,
29
+ score_scale: float = 3000.0,
30
+ bench_use_cuda_events: bool = True,
31
+ bench_rel_error: float = 0.001,
32
+ bench_wall_timeout_ns: float = 120e9,
33
+ bench_no_grad: bool = False,
34
+ bench_max_repeats: int = 100,
35
+ bench_max_time_ns: float = 10e9,
36
+ bench_warmup_style: str = 'tiny_benchmark',
37
+ ) -> dict:
38
+ """
39
+ Core evaluation logic that runs inside a Modal GPU container.
40
+
41
+ Returns dict with: combined_score, correctness, geom_mean_us, error
42
+ """
43
+ import os
44
+ import sys
45
+ import gc
46
+ import copy
47
+ import math
48
+ import time
49
+ import contextlib
50
+ import dataclasses
51
+ import tempfile
52
+
53
+ # Help with memory fragmentation for large models (MLA bs=128)
54
+ os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
55
+ import importlib.util
56
+ import traceback
57
+
58
+ import torch
59
+ import torch.cuda
60
+
61
+ def clone_data(data):
62
+ if isinstance(data, tuple):
63
+ return tuple(clone_data(x) for x in data)
64
+ elif isinstance(data, list):
65
+ return [clone_data(x) for x in data]
66
+ elif isinstance(data, dict):
67
+ return {k: clone_data(v) for k, v in data.items()}
68
+ elif isinstance(data, torch.Tensor):
69
+ return data.clone()
70
+ elif dataclasses.is_dataclass(data) and not isinstance(data, type):
71
+ fields = {f.name: clone_data(getattr(data, f.name)) for f in dataclasses.fields(data)}
72
+ return type(data)(**fields)
73
+ elif isinstance(data, torch.nn.Module):
74
+ cloned = copy.deepcopy(data)
75
+ if hasattr(data, 'seq_len'):
76
+ cloned.seq_len = data.seq_len
77
+ return cloned
78
+ return data
79
+
80
+ def stats(durations):
81
+ n = len(durations)
82
+ avg = sum(durations) / n
83
+ if n > 1:
84
+ var = sum((x - avg) ** 2 for x in durations) / (n - 1)
85
+ std = math.sqrt(var)
86
+ err = std / math.sqrt(n)
87
+ else:
88
+ std, err = 0.0, 0.0
89
+ return {"runs": n, "mean": avg, "std": std, "err": err}
90
+
91
+ tmpdir = tempfile.mkdtemp()
92
+
93
+ try:
94
+ ref_path = os.path.join(tmpdir, "reference.py")
95
+ sub_path = os.path.join(tmpdir, "submission.py")
96
+
97
+ with open(ref_path, "w") as f:
98
+ f.write(reference_code)
99
+ with open(sub_path, "w") as f:
100
+ f.write(submission_code)
101
+
102
+ sys.path.insert(0, tmpdir)
103
+
104
+ spec = importlib.util.spec_from_file_location("reference", ref_path)
105
+ reference = importlib.util.module_from_spec(spec)
106
+ spec.loader.exec_module(reference)
107
+
108
+ generate_input = reference.generate_input
109
+ check_implementation = reference.check_implementation
110
+
111
+ spec = importlib.util.spec_from_file_location("submission", sub_path)
112
+ submission = importlib.util.module_from_spec(spec)
113
+ spec.loader.exec_module(submission)
114
+ custom_kernel = submission.custom_kernel
115
+
116
+ # Correctness tests (use no_grad to reduce memory from autograd)
117
+ for i, test_args in enumerate(test_cases):
118
+ data = generate_input(**test_args)
119
+ data_copy = clone_data(data)
120
+ torch.cuda.synchronize()
121
+ with torch.no_grad():
122
+ output = custom_kernel(data)
123
+ torch.cuda.synchronize()
124
+ # Aggressively free GPU memory before ref kernel runs
125
+ del data
126
+ gc.collect()
127
+ torch.cuda.empty_cache()
128
+ passed, msg = check_implementation(data_copy, output)
129
+ del data_copy, output
130
+ gc.collect()
131
+ torch.cuda.empty_cache()
132
+ if not passed:
133
+ return {"combined_score": 0.0, "correctness": 0.0,
134
+ "error": f"Test {i} failed: {msg}"}
135
+
136
+ # Warmup
137
+ wb = benchmark_cases[0]
138
+ if bench_warmup_style == 'timed_calls':
139
+ wdata = generate_input(**wb)
140
+ start = time.perf_counter()
141
+ while time.perf_counter() - start < 0.2:
142
+ custom_kernel(wdata)
143
+ torch.cuda.synchronize()
144
+ else:
145
+ # tiny_benchmark: quick run to trigger compilation
146
+ wdata = generate_input(**wb)
147
+ for _ in range(3):
148
+ custom_kernel(wdata)
149
+ torch.cuda.synchronize()
150
+
151
+ # Benchmarks — collect mean runtimes in nanoseconds
152
+ ctx = torch.no_grad() if bench_no_grad else contextlib.nullcontext()
153
+ bench_means_ns = []
154
+
155
+ for bench_args in benchmark_cases:
156
+ data = generate_input(**bench_args)
157
+ data_copy = clone_data(data)
158
+
159
+ # Correctness check
160
+ with ctx:
161
+ output = custom_kernel(data)
162
+ torch.cuda.synchronize()
163
+ # Aggressively free GPU memory before ref kernel runs
164
+ del data
165
+ gc.collect()
166
+ torch.cuda.empty_cache()
167
+ passed, msg = check_implementation(data_copy, output)
168
+ del data_copy, output
169
+ gc.collect()
170
+ torch.cuda.empty_cache()
171
+ if not passed:
172
+ return {"combined_score": 0.0, "correctness": 1.0,
173
+ "error": f"Benchmark correctness: {msg}"}
174
+
175
+ # Regenerate data for timed runs (was freed during correctness check)
176
+ data = generate_input(**bench_args)
177
+
178
+ # Timed runs
179
+ durations_ns = []
180
+ bm_start = time.perf_counter_ns()
181
+
182
+ with ctx:
183
+ for t in range(bench_max_repeats):
184
+ torch.cuda.synchronize()
185
+
186
+ if bench_use_cuda_events:
187
+ s = torch.cuda.Event(enable_timing=True)
188
+ e = torch.cuda.Event(enable_timing=True)
189
+ s.record()
190
+ output = custom_kernel(data)
191
+ e.record()
192
+ torch.cuda.synchronize()
193
+ duration_ns = s.elapsed_time(e) * 1e6 # ms -> ns
194
+ else:
195
+ start_ns = time.perf_counter_ns()
196
+ output = custom_kernel(data)
197
+ torch.cuda.synchronize()
198
+ duration_ns = time.perf_counter_ns() - start_ns
199
+
200
+ del output
201
+ durations_ns.append(duration_ns)
202
+
203
+ if t > 1:
204
+ st = stats(durations_ns)
205
+ if st["mean"] > 0 and st["err"] / st["mean"] < bench_rel_error:
206
+ break
207
+ if st["mean"] * st["runs"] > bench_max_time_ns:
208
+ break
209
+ if bench_wall_timeout_ns is not None and \
210
+ (time.perf_counter_ns() - bm_start) > bench_wall_timeout_ns:
211
+ break
212
+
213
+ bench_means_ns.append(stats(durations_ns)["mean"])
214
+
215
+ # Scoring: geometric mean → microseconds → score
216
+ means_seconds = [ns / 1e9 for ns in bench_means_ns]
217
+ geom_mean_s = math.pow(math.prod(means_seconds), 1.0 / len(means_seconds))
218
+ geom_mean_us = geom_mean_s * 1e6
219
+ score = score_scale / geom_mean_us
220
+
221
+ bench_means_us = [ns / 1e3 for ns in bench_means_ns]
222
+ return {
223
+ "combined_score": score,
224
+ "correctness": 1.0,
225
+ "geom_mean_us": geom_mean_us,
226
+ "bench_means_us": bench_means_us,
227
+ }
228
+ except Exception as e:
229
+ return {"combined_score": 0.0, "correctness": 0.0,
230
+ "error": f"{e}\n{traceback.format_exc()}"}
231
+ finally:
232
+ sys.path.remove(tmpdir)
233
+ import shutil
234
+ shutil.rmtree(tmpdir, ignore_errors=True)
235
+
236
+
237
+ @app.function(image=cuda_image, gpu="H100", timeout=600)
238
+ def eval_triton_h100(**kwargs) -> dict:
239
+ return _eval_triton_impl(**kwargs)
240
+
241
+
242
+ @app.function(image=cuda_image, gpu="A100", timeout=600)
243
+ def eval_triton_a100(**kwargs) -> dict:
244
+ return _eval_triton_impl(**kwargs)
245
+
246
+
247
+ @app.function(image=cuda_image, gpu="L40S", timeout=600)
248
+ def eval_triton_l40s(**kwargs) -> dict:
249
+ return _eval_triton_impl(**kwargs)
250
+
251
+
252
+ @app.function(image=cuda_image, gpu="T4", timeout=600)
253
+ def eval_triton_t4(**kwargs) -> dict:
254
+ return _eval_triton_impl(**kwargs)
255
+
256
+
257
+ @app.function(image=cuda_image, gpu="H200", timeout=600)
258
+ def eval_triton_h200(**kwargs) -> dict:
259
+ return _eval_triton_impl(**kwargs)
benchmarks/gpu_mode/shared_eval.py ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Shared evaluator for GPU Mode Triton kernel optimization.
3
+
4
+ No @triton.jit requirement — pure PyTorch submissions are allowed.
5
+ Supports local GPU and Modal cloud GPU evaluation.
6
+ Set GPUMODE_USE_MODAL=true and GPUMODE_MODAL_GPU=H100 for Modal.
7
+
8
+ Scoring: combined_score = SCORE_SCALE / geom_mean_us (higher is better).
9
+ The geom_mean_us metric is also reported for absolute runtime tracking.
10
+
11
+ Each problem provides a reference.py module with:
12
+ - ref_kernel(data)
13
+ - generate_input(**kwargs)
14
+ - check_implementation(data, output) -> (bool, str)
15
+ - TEST_CASES: list of dicts
16
+ - BENCHMARK_CASES: list of dicts
17
+ - SCORE_SCALE: float
18
+
19
+ Optional benchmark configuration in reference.py:
20
+ - BENCH_USE_CUDA_EVENTS: bool (default True)
21
+ - BENCH_REL_ERROR: float (default 0.001)
22
+ - BENCH_WALL_TIMEOUT_NS: float or None (default 120e9)
23
+ - BENCH_NO_GRAD: bool (default False)
24
+ - BENCH_MAX_REPEATS: int (default 100)
25
+ - BENCH_MAX_TIME_NS: float (default 10e9)
26
+ - BENCH_WARMUP_STYLE: str ('tiny_benchmark' or 'timed_calls', default 'tiny_benchmark')
27
+ """
28
+
29
+ import os
30
+ import sys
31
+ import copy
32
+ import time
33
+ import math
34
+ import contextlib
35
+ import dataclasses
36
+ import traceback
37
+ import importlib.util
38
+
39
+ import torch
40
+ import torch.cuda
41
+
42
+ from skydiscover.evaluation.evaluation_result import EvaluationResult
43
+
44
+ # Import problem-specific reference (the problem dir is already on sys.path
45
+ # because SkyDiscover adds the evaluator file's directory before loading it).
46
+ import reference
47
+
48
+ # ---------------------------------------------------------------------------
49
+ # Environment configuration
50
+ # ---------------------------------------------------------------------------
51
+
52
+ USE_MODAL = os.environ.get("GPUMODE_USE_MODAL", "false").lower() == "true"
53
+ MODAL_GPU = os.environ.get("GPUMODE_MODAL_GPU", "H100")
54
+
55
+ # Read benchmark configuration from reference module with defaults
56
+ SCORE_SCALE = getattr(reference, 'SCORE_SCALE', 3000.0)
57
+ BENCH_USE_CUDA_EVENTS = getattr(reference, 'BENCH_USE_CUDA_EVENTS', True)
58
+ BENCH_REL_ERROR = getattr(reference, 'BENCH_REL_ERROR', 0.001)
59
+ BENCH_WALL_TIMEOUT_NS = getattr(reference, 'BENCH_WALL_TIMEOUT_NS', 120e9)
60
+ BENCH_NO_GRAD = getattr(reference, 'BENCH_NO_GRAD', False)
61
+ BENCH_MAX_REPEATS = getattr(reference, 'BENCH_MAX_REPEATS', 100)
62
+ BENCH_MAX_TIME_NS = getattr(reference, 'BENCH_MAX_TIME_NS', 10e9)
63
+ BENCH_WARMUP_STYLE = getattr(reference, 'BENCH_WARMUP_STYLE', 'tiny_benchmark')
64
+
65
+ # ---------------------------------------------------------------------------
66
+ # Helpers
67
+ # ---------------------------------------------------------------------------
68
+
69
+
70
+ def _clone(data):
71
+ """Recursively clone data, handling tensors, dataclasses, and nn.Modules."""
72
+ if isinstance(data, tuple):
73
+ return tuple(_clone(x) for x in data)
74
+ if isinstance(data, list):
75
+ return [_clone(x) for x in data]
76
+ if isinstance(data, dict):
77
+ return {k: _clone(v) for k, v in data.items()}
78
+ if isinstance(data, torch.Tensor):
79
+ return data.clone()
80
+ if dataclasses.is_dataclass(data) and not isinstance(data, type):
81
+ fields = {f.name: _clone(getattr(data, f.name)) for f in dataclasses.fields(data)}
82
+ return type(data)(**fields)
83
+ if isinstance(data, torch.nn.Module):
84
+ cloned = copy.deepcopy(data)
85
+ if hasattr(data, 'seq_len'):
86
+ cloned.seq_len = data.seq_len
87
+ return cloned
88
+ return data
89
+
90
+
91
+ def _stats(durations):
92
+ """Compute statistics from a list of durations (in nanoseconds)."""
93
+ n = len(durations)
94
+ avg = sum(durations) / n
95
+ if n > 1:
96
+ var = sum((x - avg) ** 2 for x in durations) / (n - 1)
97
+ std = math.sqrt(var)
98
+ err = std / math.sqrt(n)
99
+ else:
100
+ std, err = 0.0, 0.0
101
+ return {"runs": n, "mean": avg, "std": std, "err": err}
102
+
103
+
104
+ def _warmup(kernel_fn, bench_args):
105
+ """Warmup the kernel to trigger Triton compilation."""
106
+ if BENCH_WARMUP_STYLE == 'timed_calls':
107
+ # MLA-style: run repeatedly for 200ms
108
+ data = reference.generate_input(**bench_args)
109
+ start = time.perf_counter()
110
+ while time.perf_counter() - start < 0.2:
111
+ kernel_fn(data)
112
+ torch.cuda.synchronize()
113
+ else:
114
+ # trimul-style: run first benchmark with tiny time budget (10ms)
115
+ _bench_single(kernel_fn, bench_args, max_time_ns=10e7)
116
+
117
+
118
+ def _bench_single(kernel_fn, bench_args, max_time_ns=None):
119
+ """Benchmark a kernel on a single case.
120
+
121
+ Returns (stats_dict_or_None, error_str_or_None).
122
+ Stats dict has durations in nanoseconds.
123
+ """
124
+ if max_time_ns is None:
125
+ max_time_ns = BENCH_MAX_TIME_NS
126
+
127
+ data = reference.generate_input(**bench_args)
128
+ data_copy = _clone(data)
129
+
130
+ # Correctness check first
131
+ ctx = torch.no_grad() if BENCH_NO_GRAD else contextlib.nullcontext()
132
+ with ctx:
133
+ output = kernel_fn(data)
134
+ torch.cuda.synchronize()
135
+ passed, msg = reference.check_implementation(data_copy, output)
136
+ if not passed:
137
+ return None, f"Benchmark correctness: {msg}"
138
+ del output
139
+
140
+ # Timed runs — durations in nanoseconds
141
+ durations_ns = []
142
+ bm_start = time.perf_counter_ns()
143
+
144
+ with ctx:
145
+ for i in range(BENCH_MAX_REPEATS):
146
+ torch.cuda.synchronize()
147
+
148
+ if BENCH_USE_CUDA_EVENTS:
149
+ s = torch.cuda.Event(enable_timing=True)
150
+ e = torch.cuda.Event(enable_timing=True)
151
+ s.record()
152
+ output = kernel_fn(data)
153
+ e.record()
154
+ torch.cuda.synchronize()
155
+ duration_ns = s.elapsed_time(e) * 1e6 # ms -> ns
156
+ else:
157
+ start_ns = time.perf_counter_ns()
158
+ output = kernel_fn(data)
159
+ torch.cuda.synchronize()
160
+ duration_ns = time.perf_counter_ns() - start_ns
161
+
162
+ del output
163
+ durations_ns.append(duration_ns)
164
+
165
+ if i > 1:
166
+ st = _stats(durations_ns)
167
+ if st["mean"] > 0 and st["err"] / st["mean"] < BENCH_REL_ERROR:
168
+ break
169
+ if st["mean"] * st["runs"] > max_time_ns:
170
+ break
171
+ if BENCH_WALL_TIMEOUT_NS is not None and \
172
+ (time.perf_counter_ns() - bm_start) > BENCH_WALL_TIMEOUT_NS:
173
+ break
174
+
175
+ return _stats(durations_ns), None
176
+
177
+
178
+ # ---------------------------------------------------------------------------
179
+ # Modal path
180
+ # ---------------------------------------------------------------------------
181
+
182
+
183
+ def _evaluate_modal(submission_code):
184
+ parent_dir = os.path.dirname(os.path.abspath(__file__))
185
+ if parent_dir not in sys.path:
186
+ sys.path.insert(0, parent_dir)
187
+ from modal_eval import (
188
+ eval_triton_h100, eval_triton_a100, eval_triton_l40s, eval_triton_t4,
189
+ eval_triton_h200, app as modal_app,
190
+ )
191
+
192
+ gpu_fns = {
193
+ "H100": eval_triton_h100,
194
+ "A100": eval_triton_a100,
195
+ "L40S": eval_triton_l40s,
196
+ "T4": eval_triton_t4,
197
+ "H200": eval_triton_h200,
198
+ }
199
+ eval_fn = gpu_fns.get(MODAL_GPU, eval_triton_h100)
200
+
201
+ ref_code = getattr(reference, 'MODAL_REFERENCE_CODE', None)
202
+ if ref_code is None:
203
+ return EvaluationResult(
204
+ metrics={"combined_score": 0.0, "correctness": 0.0},
205
+ artifacts={"error": "MODAL_REFERENCE_CODE not defined in reference.py",
206
+ "failure_stage": "modal_setup"},
207
+ )
208
+
209
+ with modal_app.run():
210
+ result = eval_fn.remote(
211
+ submission_code=submission_code,
212
+ reference_code=ref_code,
213
+ test_cases=reference.TEST_CASES,
214
+ benchmark_cases=reference.BENCHMARK_CASES,
215
+ score_scale=SCORE_SCALE,
216
+ bench_use_cuda_events=BENCH_USE_CUDA_EVENTS,
217
+ bench_rel_error=BENCH_REL_ERROR,
218
+ bench_wall_timeout_ns=BENCH_WALL_TIMEOUT_NS,
219
+ bench_no_grad=BENCH_NO_GRAD,
220
+ bench_max_repeats=BENCH_MAX_REPEATS,
221
+ bench_max_time_ns=BENCH_MAX_TIME_NS,
222
+ bench_warmup_style=BENCH_WARMUP_STYLE,
223
+ )
224
+
225
+ if isinstance(result, dict):
226
+ error = result.get("error")
227
+ score = float(result.get("combined_score", 0.0))
228
+ metrics = {"combined_score": score, "correctness": float(result.get("correctness", 0.0))}
229
+ if "geom_mean_us" in result:
230
+ metrics["geom_mean_us"] = float(result["geom_mean_us"])
231
+ artifacts = {}
232
+ if error:
233
+ artifacts["error"] = str(error)
234
+ artifacts["failure_stage"] = "modal_eval"
235
+ if "bench_means_us" in result:
236
+ for i, us in enumerate(result["bench_means_us"]):
237
+ artifacts[f"bench_{i}_mean_us"] = f"{us:.2f}"
238
+ artifacts["hardware"] = MODAL_GPU
239
+ return EvaluationResult(metrics=metrics, artifacts=artifacts)
240
+
241
+ return EvaluationResult(
242
+ metrics={"combined_score": 0.0, "correctness": 0.0},
243
+ artifacts={"error": "Modal returned unexpected type", "failure_stage": "modal_eval"},
244
+ )
245
+
246
+
247
+ # ---------------------------------------------------------------------------
248
+ # Local path
249
+ # ---------------------------------------------------------------------------
250
+
251
+
252
+ def _evaluate_local(program_path):
253
+ try:
254
+ spec = importlib.util.spec_from_file_location("submission", program_path)
255
+ mod = importlib.util.module_from_spec(spec)
256
+ sys.modules["submission"] = mod
257
+ spec.loader.exec_module(mod)
258
+ custom_kernel = mod.custom_kernel
259
+ except Exception as exc:
260
+ return EvaluationResult(
261
+ metrics={"combined_score": 0.0, "correctness": 0.0},
262
+ artifacts={
263
+ "error": f"Failed to load submission: {exc}",
264
+ "traceback": traceback.format_exc(),
265
+ "failure_stage": "import",
266
+ },
267
+ )
268
+
269
+ # Correctness
270
+ for i, tc in enumerate(reference.TEST_CASES):
271
+ try:
272
+ data = reference.generate_input(**tc)
273
+ data_copy = _clone(data)
274
+ torch.cuda.synchronize()
275
+ output = custom_kernel(data)
276
+ torch.cuda.synchronize()
277
+ passed, msg = reference.check_implementation(data_copy, output)
278
+ if not passed:
279
+ return EvaluationResult(
280
+ metrics={"combined_score": 0.0, "correctness": 0.0},
281
+ artifacts={
282
+ "error": f"Test {i} failed: {msg}",
283
+ "failure_stage": "correctness",
284
+ "test_index": str(i),
285
+ },
286
+ )
287
+ except Exception as exc:
288
+ return EvaluationResult(
289
+ metrics={"combined_score": 0.0, "correctness": 0.0},
290
+ artifacts={
291
+ "error": f"Test {i} error: {exc}",
292
+ "traceback": traceback.format_exc(),
293
+ "failure_stage": "correctness",
294
+ "test_index": str(i),
295
+ },
296
+ )
297
+
298
+ # Warmup
299
+ _warmup(custom_kernel, reference.BENCHMARK_CASES[0])
300
+
301
+ # Benchmarks — collect mean runtimes in nanoseconds
302
+ bench_means_ns = []
303
+ for bench_args in reference.BENCHMARK_CASES:
304
+ st, err = _bench_single(custom_kernel, bench_args)
305
+ if err:
306
+ return EvaluationResult(
307
+ metrics={"combined_score": 0.0, "correctness": 1.0},
308
+ artifacts={"error": err, "failure_stage": "benchmark"},
309
+ )
310
+ bench_means_ns.append(st["mean"])
311
+
312
+ # Scoring: geometric mean of benchmark means → microseconds → score
313
+ means_seconds = [ns / 1e9 for ns in bench_means_ns]
314
+ geom_mean_s = math.pow(math.prod(means_seconds), 1.0 / len(means_seconds))
315
+ geom_mean_us = geom_mean_s * 1e6
316
+ score = SCORE_SCALE / geom_mean_us
317
+
318
+ metrics = {
319
+ "combined_score": score,
320
+ "correctness": 1.0,
321
+ "geom_mean_us": geom_mean_us,
322
+ }
323
+ artifacts = {
324
+ "hardware": "local",
325
+ }
326
+ for i, ns in enumerate(bench_means_ns):
327
+ artifacts[f"bench_{i}_mean_us"] = f"{ns / 1e3:.2f}"
328
+
329
+ return EvaluationResult(
330
+ metrics=metrics,
331
+ artifacts=artifacts,
332
+ )
333
+
334
+
335
+ # ---------------------------------------------------------------------------
336
+ # Public API (used by SkyDiscover)
337
+ # ---------------------------------------------------------------------------
338
+
339
+
340
+ def evaluate(program_path):
341
+ try:
342
+ with open(program_path, "r") as f:
343
+ code = f.read()
344
+ except Exception as exc:
345
+ return EvaluationResult(
346
+ metrics={"combined_score": 0.0, "correctness": 0.0},
347
+ artifacts={"error": f"Failed to read file: {exc}", "failure_stage": "file_read"},
348
+ )
349
+
350
+ if USE_MODAL:
351
+ try:
352
+ return _evaluate_modal(code)
353
+ except Exception as exc:
354
+ return EvaluationResult(
355
+ metrics={"combined_score": 0.0, "correctness": 0.0},
356
+ artifacts={
357
+ "error": f"Modal evaluation failed: {exc}",
358
+ "traceback": traceback.format_exc(),
359
+ "failure_stage": "modal_eval",
360
+ },
361
+ )
362
+
363
+ return _evaluate_local(program_path)
364
+
365
+
366
+ def evaluate_stage1(program_path):
367
+ try:
368
+ with open(program_path, "r") as f:
369
+ code = f.read()
370
+ except Exception as exc:
371
+ return EvaluationResult(
372
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
373
+ artifacts={"error": f"Failed to read file: {exc}", "failure_stage": "file_read"},
374
+ )
375
+
376
+ if "custom_kernel" not in code:
377
+ return EvaluationResult(
378
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
379
+ artifacts={"error": "Missing custom_kernel function", "failure_stage": "validation"},
380
+ )
381
+
382
+ try:
383
+ compile(code, program_path, "exec")
384
+ except SyntaxError as exc:
385
+ return EvaluationResult(
386
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
387
+ artifacts={
388
+ "error": f"Syntax error at line {exc.lineno}: {exc.msg}",
389
+ "failure_stage": "syntax_check",
390
+ },
391
+ )
392
+
393
+ # When using Modal, skip local import check (triton may not be installed locally).
394
+ if not USE_MODAL:
395
+ try:
396
+ spec = importlib.util.spec_from_file_location("submission_check", program_path)
397
+ mod = importlib.util.module_from_spec(spec)
398
+ spec.loader.exec_module(mod)
399
+ if not hasattr(mod, "custom_kernel"):
400
+ return EvaluationResult(
401
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
402
+ artifacts={"error": "custom_kernel not found after import", "failure_stage": "import"},
403
+ )
404
+ except Exception as exc:
405
+ return EvaluationResult(
406
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
407
+ artifacts={
408
+ "error": f"Import failed: {exc}",
409
+ "traceback": traceback.format_exc(),
410
+ "failure_stage": "import",
411
+ },
412
+ )
413
+
414
+ return EvaluationResult(
415
+ metrics={"combined_score": 0.5, "stage1_passed": 1.0},
416
+ artifacts={},
417
+ )
418
+
419
+
420
+ def evaluate_stage2(program_path):
421
+ return evaluate(program_path)
benchmarks/gpu_mode/trimul/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Triangle Multiplicative Update (TriMul)
2
+
3
+ Evolve a Triton kernel for the TriMul operator using SkyDiscover.
4
+
5
+ Core operation for AlphaFold3, Chai, Protenix protein structure models.
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/trimul/initial_program.py \
14
+ benchmarks/gpu_mode/trimul/evaluator.py \
15
+ -c benchmarks/gpu_mode/trimul/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness:** Must match reference output (rtol=0.02, atol=0.02 vs PyTorch reference)
22
+ - **Score:** `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
23
+ - Higher is better (faster runtime = higher score)
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ ```bash
28
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
29
+ uv run skydiscover-run \
30
+ benchmarks/gpu_mode/trimul/initial_program.py \
31
+ benchmarks/gpu_mode/trimul/evaluator.py \
32
+ -c benchmarks/gpu_mode/trimul/config.yaml \
33
+ -s [your_algorithm] -i 50
34
+ ```
benchmarks/gpu_mode/trimul/config.yaml ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Triangle Multiplicative Update (TriMul) Triton Kernel
2
+
3
+ max_iterations: 100
4
+ checkpoint_interval: 1
5
+ log_level: "INFO"
6
+
7
+ llm:
8
+ models:
9
+ - name: "gpt-5"
10
+ weight: 1.0
11
+ api_base: https://api.openai.com/v1
12
+ temperature: 1.0
13
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
14
+ max_tokens: 32000
15
+ timeout: 600
16
+
17
+ prompt:
18
+ system_message: |
19
+ You are an expert Triton engineer tasked with translating PyTorch code into highly optimized Triton kernel code.
20
+
21
+ You will be implementing a Triangle Multiplicative Update (TriMul) module that is a core operation
22
+ for AlphaFold3, Chai, Protenix, and other protein structure prediction models in BioML.
23
+
24
+ The TriMul operator operates over a 4D tensor of shape [B, N, N, C].
25
+
26
+ Your task:
27
+ - Implement the "outgoing" version of the TriMul operator from the AlphaFold3 paper.
28
+ - You will not have to compute or store gradients for this version. You will only need to implement the forward pass.
29
+
30
+ Your function should be defined as 'custom_kernel' with the following signature:
31
+ Input:
32
+ - `data`: Tuple of (input: torch.Tensor, weights: Dict[str, torch.Tensor], config: Dict)
33
+ - input: Input tensor of shape [bs, seq_len, seq_len, dim]
34
+ - mask: Mask tensor of shape [bs, seq_len, seq_len]
35
+ - weights: Dictionary containing model weights
36
+ - config: Dictionary containing model configuration parameters
37
+
38
+ Output:
39
+ - output: Processed tensor [bs, seq_len, seq_len, dim]
40
+
41
+ **Problem Constraints:**
42
+ - B ∈ {1,2}, N ∈ {128,256,512,1024}, c ∈ {128}, c_z ∈ {128,384,768}
43
+ - The input distribution will be sampled from a standard Normal distribution, or a heavy-tailed Cauchy distribution (gamma = 2).
44
+ - There will either be no mask, or a randomly sampled mask over the inputs.
45
+
46
+ **Remarks.** So why is this problem so annoying? Because you have to choose whether to load / deal with either the channel dimensions c,c_z that the LayerNorms require (otherwise you have to do a synchronize to compute the statistics like mean / variance) or the sequence dimension N.
47
+ The sequence dimension is particularly annoying because it's quite large, but also because we compute pair-wise operations at the last operation that sum over another sequence dimension (this is N^3!).
48
+ However, I really like this kernel because it only consists of "simple" operations, and is really easy to understand. It is a true test of "fusions" that torch.compile() doesn't do that well.
49
+
50
+ Here is a pytorch implementation of the TriMul module. You will want to implement a kernel for the operations in the forward call:
51
+
52
+ ```python
53
+ import torch
54
+ from torch import nn, einsum
55
+ import math
56
+
57
+ # Reference code in PyTorch
58
+ class TriMul(nn.Module):
59
+ def __init__(
60
+ self,
61
+ dim: int,
62
+ hidden_dim: int,
63
+ ):
64
+ super().__init__()
65
+
66
+ self.norm = nn.LayerNorm(dim)
67
+
68
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False)
69
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False)
70
+
71
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False)
72
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False)
73
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False)
74
+
75
+ self.to_out_norm = nn.LayerNorm(hidden_dim)
76
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False)
77
+
78
+ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
79
+ """
80
+ x: [bs, seq_len, seq_len, dim]
81
+ mask: [bs, seq_len, seq_len]
82
+
83
+ Returns:
84
+ output: [bs, seq_len, seq_len, dim]
85
+ """
86
+ batch_size, seq_len, _, dim = x.shape
87
+
88
+ x = self.norm(x)
89
+
90
+ left = self.left_proj(x)
91
+ right = self.right_proj(x)
92
+
93
+ mask = mask.unsqueeze(-1)
94
+ left = left * mask
95
+ right = right * mask
96
+
97
+ left_gate = self.left_gate(x).sigmoid()
98
+ right_gate = self.right_gate(x).sigmoid()
99
+ out_gate = self.out_gate(x).sigmoid()
100
+
101
+ left = left * left_gate
102
+ right = right * right_gate
103
+
104
+ out = einsum('... i k d, ... j k d -> ... i j d', left, right)
105
+ # This einsum is the same as the following:
106
+ # out = torch.zeros(batch_size, seq_len, seq_len, dim, device=x.device)
107
+
108
+ # # Compute using nested loops
109
+ # for b in range(batch_size):
110
+ # for i in range(seq_len):
111
+ # for j in range(seq_len):
112
+ # # Compute each output element
113
+ # for k in range(seq_len):
114
+ # out[b, i, j] += left[b, i, k, :] * right[b, j, k, :]
115
+
116
+ out = self.to_out_norm(out)
117
+ out = out * out_gate
118
+ return self.to_out(out)
119
+ ```
120
+
121
+ Here is some example skeleton code of the entrypoint function you will create:
122
+ ```python
123
+ def custom_kernel(data)
124
+ input_tensor, mask, weights, config = data
125
+ dim, hidden_dim = config["dim"], config["hidden_dim"]
126
+
127
+ # Access the given weights of the model
128
+ norm_weight = weights["norm.weight"]
129
+ norm_bias = weights["norm.bias"]
130
+ left_proj_weight = weights["left_proj.weight"]
131
+ right_proj_weight = weights["right_proj.weight"]
132
+ left_gate_weight = weights["left_gate.weight"]
133
+ right_gate_weight = weights["right_gate.weight"]
134
+ out_gate_weight = weights["out_gate.weight"]
135
+ to_out_norm_weight = weights["to_out_norm.weight"]
136
+ to_out_norm_bias = weights["to_out_norm.bias"]
137
+ to_out_weight = weights["to_out.weight"]
138
+
139
+ # Perform TriMul
140
+
141
+ return out
142
+ ```
143
+
144
+ To help you understand which triton version we are using, here is some example triton code for an unrelated task:
145
+ ```python
146
+ import triton
147
+ import triton.language as tl
148
+
149
+ @triton.jit
150
+ def matmul_persistent_ws_kernel(
151
+ a_ptr, b_ptr, c_ptr, M, N, K,
152
+ stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
153
+ BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
154
+ ):
155
+ pid = tl.program_id(axis=0) # async_task 0, 1, 2
156
+ num_pid_m = tl.cdiv(M, BLOCK_M) # async_task 0, 1, 2
157
+ num_pid_n = tl.cdiv(N, BLOCK_N) # async_task 0, 1, 2
158
+ pid_m = pid // num_pid_m # async_task 0, 1, 2
159
+ pid_n = pid % num_pid_n # async_task 0, 1, 2
160
+ offs_m_1 = pid_m * BLOCK_M + tl.arange(0, BLOCK_M // 2) # async_task 0, 1, 2
161
+ offs_m_2 = pid_m * BLOCK_M + tl.arange(BLOCK_M // 2, BLOCK_M) # async_task 0, 1, 2
162
+ offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_N) # async_task 0, 1, 2
163
+ offs_k = tl.arange(0, BLOCK_K) # async_task 0
164
+ a_ptrs_1 = a_ptr + (offs_m_1[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
165
+ a_ptrs_2 = a_ptr + (offs_m_2[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
166
+ b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) # async_task 0
167
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 1
168
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 2
169
+ for k in range(0, tl.cdiv(K, BLOCK_K)): # async_task 0, 1, 2
170
+ a_1 = tl.load(a_ptrs_1) # async_task 0
171
+ a_2 = tl.load(a_ptrs_2) # async_task 0
172
+ b = tl.load(b_ptrs) # async_task 0
173
+ acc_1 += tl.dot(a_1, b) # async_task 1
174
+ acc_2 += tl.dot(a_2, b) # async_task 2
175
+ a_ptrs_1 += BLOCK_K * stride_ak # async_task 0
176
+ a_ptrs_2 += BLOCK_K * stride_ak # async_task 0
177
+ b_ptrs += BLOCK_K * stride_bk # async_task 0
178
+ c_1 = acc_1.to(tl.float16) # async_task 1
179
+ c_2 = acc_2.to(tl.float16) # async_task 2
180
+ c_ptrs_1 = c_ptr_1 + stride_cm * offs_m_1[:, None] + stride_cn * offs_n[None, :] # async_task 1
181
+ c_ptrs_2 = c_ptr_2 + stride_cm * offs_m_2[:, None] + stride_cn * offs_n[None, :] # async_task 2
182
+ tl.store(c_ptrs_1, c_1) # async_task 1
183
+ tl.store(c_ptrs_2, c_2) # async_task 2
184
+ ```
185
+
186
+ A few general triton tips:
187
+ - tl.arange only takes in constexpr arguments (static or tl.constexpr)
188
+ - You cannot use continue in your kernel code
189
+ - tl.dot can only take in two input tensors
190
+ - There is no tl.mean
191
+
192
+ Here are the different configs that your kernel will be tested on ("nomask" sets whether there will be no mask, or a randomly sampled mask over the inputs):
193
+
194
+ Test Cases for correctness and runtime (optimize runtime for these):
195
+ - {"seqlen": 256, "bs": 2, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "normal"}
196
+ - {"seqlen": 768, "bs": 1, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "cauchy"}
197
+ - {"seqlen": 256, "bs": 2, "dim": 384, "hidden_dim": 128, "nomask": False, "distribution": "normal"}
198
+ - {"seqlen": 512, "bs": 1, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "normal"}
199
+ - {"seqlen": 1024, "bs": 1, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "cauchy"}
200
+ - {"seqlen": 768, "bs": 1, "dim": 384, "hidden_dim": 128, "nomask": False, "distribution": "normal"}
201
+ - {"seqlen": 1024, "bs": 1, "dim": 384, "hidden_dim": 128, "nomask": True, "distribution": "normal"}
202
+
203
+ Rules:
204
+ - The tensors arguments passed in will be already on your cuda device.
205
+ - Define all of your code in one final ```python ``` block.
206
+ - We will test the correctness of your kernel on multiple input shapes, make sure to support different potential test cases.
207
+ - You are allowed to use mixed precision computations, but make sure your final output is in float32.
208
+ - You must use trition 3.3.1 and these kernels will be run on an H100.
209
+ - You do not have to implement everything in triton, you may choose to have some of the operations done in pytorch. However, you must implement at least part of the operations in a kernel.
210
+ - Include a short docstring at the top summarizing your algorithm.
211
+ evaluator:
212
+ timeout: 600
213
+ max_retries: 3
214
+ cascade_evaluation: true
215
+ cascade_thresholds: [0.4, 0.3]
216
+
217
+ diff_based_generation: true
218
+ max_solution_length: 60000
219
+ random_seed: 42
benchmarks/gpu_mode/trimul/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for TriMul — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/trimul/initial_program.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVOLVE-BLOCK-START
2
+ """
3
+ Initial TriMul submission — PyTorch baseline with dummy Triton kernel.
4
+ """
5
+
6
+ import torch
7
+ from torch import nn, einsum
8
+ import triton
9
+ import triton.language as tl
10
+
11
+
12
+ @triton.jit
13
+ def _dummy_kernel(x_ptr, BLOCK_SIZE: tl.constexpr):
14
+ pid = tl.program_id(0)
15
+ pass
16
+
17
+
18
+ class TriMul(nn.Module):
19
+ def __init__(
20
+ self,
21
+ dim: int,
22
+ hidden_dim: int,
23
+ ):
24
+ super().__init__()
25
+
26
+ self.norm = nn.LayerNorm(dim)
27
+
28
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False, dtype=torch.float32)
29
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False, dtype=torch.float32)
30
+
31
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False, dtype=torch.float32)
32
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False, dtype=torch.float32)
33
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False, dtype=torch.float32)
34
+
35
+ self.to_out_norm = nn.LayerNorm(hidden_dim)
36
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False, dtype=torch.float32)
37
+
38
+ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
39
+ batch_size, seq_len, _, dim = x.shape
40
+
41
+ x = self.norm(x)
42
+ x = x.to(torch.float32)
43
+
44
+ left = self.left_proj(x.to(torch.float32))
45
+ right = self.right_proj(x.to(torch.float32))
46
+
47
+ mask = mask.unsqueeze(-1)
48
+ left = left * mask
49
+ right = right * mask
50
+
51
+ left_gate = self.left_gate(x.to(torch.float32)).sigmoid()
52
+ right_gate = self.right_gate(x.to(torch.float32)).sigmoid()
53
+ out_gate = self.out_gate(x.to(torch.float32)).sigmoid()
54
+
55
+ left = left * left_gate
56
+ right = right * right_gate
57
+
58
+ out = einsum('... i k d, ... j k d -> ... i j d', left.to(torch.bfloat16), right.to(torch.bfloat16))
59
+
60
+ out = out.to(torch.float32)
61
+ out = self.to_out_norm(out)
62
+ out = out * out_gate
63
+ return self.to_out(out)
64
+
65
+
66
+ def custom_kernel(data):
67
+ input_tensor, mask, weights, config = data
68
+ trimul = TriMul(config["dim"], config["hidden_dim"]).to(input_tensor.device)
69
+
70
+ trimul.norm.weight = nn.Parameter(weights['norm.weight'].to(torch.float32))
71
+ trimul.left_proj.weight = nn.Parameter(weights['left_proj.weight'].to(torch.float32))
72
+ trimul.right_proj.weight = nn.Parameter(weights['right_proj.weight'].to(torch.float32))
73
+ trimul.left_gate.weight = nn.Parameter(weights['left_gate.weight'].to(torch.float32))
74
+ trimul.right_gate.weight = nn.Parameter(weights['right_gate.weight'].to(torch.float32))
75
+ trimul.out_gate.weight = nn.Parameter(weights['out_gate.weight'].to(torch.float32))
76
+ trimul.to_out_norm.weight = nn.Parameter(weights['to_out_norm.weight'].to(torch.float32))
77
+ trimul.to_out.weight = nn.Parameter(weights['to_out.weight'].to(torch.float32))
78
+ trimul.norm.bias = nn.Parameter(weights['norm.bias'].to(torch.float32))
79
+ trimul.to_out_norm.bias = nn.Parameter(weights['to_out_norm.bias'].to(torch.float32))
80
+
81
+ output = trimul(input_tensor, mask).to(torch.float32)
82
+
83
+ return output
84
+ # EVOLVE-BLOCK-END
benchmarks/gpu_mode/trimul/reference.py ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for Triangle Multiplicative Update (TriMul) Triton kernel.
3
+ Core operation for AlphaFold3, Chai, Protenix protein structure models.
4
+ Same test cases, benchmarks, generate_input, ref_kernel, and check_implementation.
5
+ """
6
+
7
+ import math
8
+ import torch
9
+ from torch import nn, einsum
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Scoring and benchmark configuration (read by shared_eval.py)
13
+ # ---------------------------------------------------------------------------
14
+
15
+ SCORE_SCALE = 3000.0
16
+
17
+ # trimul uses CUDA events timing, 0.1% rel error, 120s wall clock timeout
18
+ BENCH_USE_CUDA_EVENTS = True
19
+ BENCH_REL_ERROR = 0.001
20
+ BENCH_WALL_TIMEOUT_NS = 120e9
21
+ BENCH_NO_GRAD = False
22
+ BENCH_MAX_REPEATS = 100
23
+ BENCH_MAX_TIME_NS = 10e9
24
+ BENCH_WARMUP_STYLE = 'tiny_benchmark'
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Test / benchmark cases — full set from discover task.yml
28
+ # ---------------------------------------------------------------------------
29
+
30
+ TEST_CASES = [
31
+ {"seqlen": 32, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 9371, "nomask": True, "distribution": "normal"},
32
+ {"seqlen": 32, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 1092, "nomask": False, "distribution": "normal"},
33
+ {"seqlen": 64, "bs": 2, "dim": 256, "hiddendim": 128, "seed": 2291, "nomask": True, "distribution": "normal"},
34
+ {"seqlen": 64, "bs": 2, "dim": 256, "hiddendim": 128, "seed": 210284, "nomask": False, "distribution": "normal"},
35
+ {"seqlen": 128, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 81934, "nomask": True, "distribution": "normal"},
36
+ {"seqlen": 256, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 1932, "nomask": True, "distribution": "normal"},
37
+ {"seqlen": 256, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 10432, "nomask": False, "distribution": "normal"},
38
+ {"seqlen": 768, "bs": 2, "dim": 128, "hiddendim": 128, "seed": 731, "nomask": True, "distribution": "normal"},
39
+ {"seqlen": 1024, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 53121, "nomask": False, "distribution": "normal"},
40
+ {"seqlen": 1024, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 31, "nomask": True, "distribution": "normal"},
41
+ {"seqlen": 1024, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 4921, "nomask": False, "distribution": "normal"},
42
+ {"seqlen": 32, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 937321, "nomask": True, "distribution": "cauchy"},
43
+ {"seqlen": 64, "bs": 2, "dim": 256, "hiddendim": 128, "seed": 2291, "nomask": True, "distribution": "cauchy"},
44
+ {"seqlen": 128, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 8134, "nomask": True, "distribution": "cauchy"},
45
+ {"seqlen": 256, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 932, "nomask": True, "distribution": "cauchy"},
46
+ {"seqlen": 768, "bs": 2, "dim": 128, "hiddendim": 128, "seed": 31, "nomask": True, "distribution": "cauchy"},
47
+ {"seqlen": 1024, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 5321, "nomask": False, "distribution": "cauchy"},
48
+ {"seqlen": 1024, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 491, "nomask": False, "distribution": "cauchy"},
49
+ ]
50
+
51
+ BENCHMARK_CASES = [
52
+ {"seqlen": 256, "bs": 2, "dim": 128, "hiddendim": 128, "seed": 9371, "nomask": True, "distribution": "normal"},
53
+ {"seqlen": 768, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 381, "nomask": True, "distribution": "cauchy"},
54
+ {"seqlen": 256, "bs": 2, "dim": 384, "hiddendim": 128, "seed": 2301, "nomask": False, "distribution": "normal"},
55
+ {"seqlen": 512, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 12819, "nomask": True, "distribution": "normal"},
56
+ {"seqlen": 1024, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 381, "nomask": True, "distribution": "cauchy"},
57
+ {"seqlen": 768, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 481, "nomask": False, "distribution": "normal"},
58
+ {"seqlen": 1024, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 23291, "nomask": True, "distribution": "normal"},
59
+ ]
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # Reference kernel
63
+ # ---------------------------------------------------------------------------
64
+
65
+
66
+ class _TriMul(nn.Module):
67
+ def __init__(self, dim, hidden_dim, device="cuda"):
68
+ super().__init__()
69
+ self.norm = nn.LayerNorm(dim, device=device)
70
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
71
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
72
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
73
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
74
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
75
+ self.to_out_norm = nn.LayerNorm(hidden_dim, device=device)
76
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False, device=device)
77
+
78
+ def forward(self, x, mask):
79
+ x = self.norm(x)
80
+ left = self.left_proj(x)
81
+ right = self.right_proj(x)
82
+ mask = mask.unsqueeze(-1)
83
+ left = left * mask
84
+ right = right * mask
85
+ left = left * self.left_gate(x).sigmoid()
86
+ right = right * self.right_gate(x).sigmoid()
87
+ out_gate = self.out_gate(x).sigmoid()
88
+ out = einsum('... i k d, ... j k d -> ... i j d', left, right)
89
+ out = self.to_out_norm(out)
90
+ out = out * out_gate
91
+ return self.to_out(out)
92
+
93
+
94
+ def ref_kernel(data):
95
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
96
+ old_cudnn = torch.backends.cudnn.allow_tf32
97
+ torch.backends.cuda.matmul.allow_tf32 = False
98
+ torch.backends.cudnn.allow_tf32 = False
99
+ try:
100
+ input_tensor, mask, weights, config = data
101
+ trimul = _TriMul(dim=config["dim"], hidden_dim=config["hidden_dim"],
102
+ device=input_tensor.device)
103
+ trimul.norm.weight = nn.Parameter(weights['norm.weight'])
104
+ trimul.norm.bias = nn.Parameter(weights['norm.bias'])
105
+ trimul.left_proj.weight = nn.Parameter(weights['left_proj.weight'])
106
+ trimul.right_proj.weight = nn.Parameter(weights['right_proj.weight'])
107
+ trimul.left_gate.weight = nn.Parameter(weights['left_gate.weight'])
108
+ trimul.right_gate.weight = nn.Parameter(weights['right_gate.weight'])
109
+ trimul.out_gate.weight = nn.Parameter(weights['out_gate.weight'])
110
+ trimul.to_out_norm.weight = nn.Parameter(weights['to_out_norm.weight'])
111
+ trimul.to_out_norm.bias = nn.Parameter(weights['to_out_norm.bias'])
112
+ trimul.to_out.weight = nn.Parameter(weights['to_out.weight'])
113
+ return trimul(input_tensor, mask)
114
+ finally:
115
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
116
+ torch.backends.cudnn.allow_tf32 = old_cudnn
117
+
118
+
119
+ def generate_input(seqlen, bs, dim, hiddendim, seed, nomask, distribution="normal"):
120
+ hidden_dim = hiddendim
121
+ config = {"hidden_dim": hidden_dim, "dim": dim}
122
+ gen = torch.Generator(device='cuda')
123
+ gen.manual_seed(seed)
124
+
125
+ if distribution == "cauchy":
126
+ u = torch.empty((bs, seqlen, seqlen, dim), device="cuda", dtype=torch.float32)
127
+ u.uniform_(0.0, 1.0, generator=gen)
128
+ input_tensor = 2.0 * torch.tan(math.pi * (u - 0.5))
129
+ else:
130
+ input_tensor = torch.randn(
131
+ (bs, seqlen, seqlen, dim), device='cuda', dtype=torch.float32, generator=gen
132
+ ).contiguous()
133
+
134
+ if nomask:
135
+ mask = torch.ones(bs, seqlen, seqlen, device="cuda")
136
+ else:
137
+ mask = torch.randint(0, 2, (bs, seqlen, seqlen), device="cuda", generator=gen).float()
138
+
139
+ weights = {
140
+ "norm.weight": torch.randn(dim, device="cuda"),
141
+ "norm.bias": torch.randn(dim, device="cuda"),
142
+ "left_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
143
+ "right_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
144
+ "left_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
145
+ "right_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
146
+ "out_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
147
+ "to_out_norm.weight": torch.randn(hidden_dim, device="cuda"),
148
+ "to_out_norm.bias": torch.randn(hidden_dim, device="cuda"),
149
+ "to_out.weight": torch.randn(dim, hidden_dim, device="cuda") / math.sqrt(dim),
150
+ }
151
+ return (input_tensor, mask, weights, config)
152
+
153
+
154
+ def check_implementation(data, submission_output, rtol=2e-2, atol=2e-2):
155
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
156
+ old_cudnn = torch.backends.cudnn.allow_tf32
157
+ torch.backends.cuda.matmul.allow_tf32 = False
158
+ torch.backends.cudnn.allow_tf32 = False
159
+ try:
160
+ ref_output = ref_kernel(data)
161
+ if ref_output.shape != submission_output.shape:
162
+ return False, f"Shape mismatch: {ref_output.shape} vs {submission_output.shape}"
163
+ if torch.allclose(ref_output.float(), submission_output.float(), rtol=rtol, atol=atol):
164
+ return True, "Match"
165
+ diff = torch.abs(ref_output.float() - submission_output.float())
166
+ return False, f"max_diff={diff.max().item():.6f}, avg_diff={diff.mean().item():.6f}"
167
+ finally:
168
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
169
+ torch.backends.cudnn.allow_tf32 = old_cudnn
170
+
171
+
172
+ # ---------------------------------------------------------------------------
173
+ # Self-contained reference code for Modal remote execution
174
+ # ---------------------------------------------------------------------------
175
+
176
+ MODAL_REFERENCE_CODE = r'''
177
+ import math
178
+ import torch
179
+ from torch import nn, einsum
180
+
181
+
182
+ class _TriMul(nn.Module):
183
+ def __init__(self, dim, hidden_dim, device="cuda"):
184
+ super().__init__()
185
+ self.norm = nn.LayerNorm(dim, device=device)
186
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
187
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
188
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
189
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
190
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
191
+ self.to_out_norm = nn.LayerNorm(hidden_dim, device=device)
192
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False, device=device)
193
+
194
+ def forward(self, x, mask):
195
+ x = self.norm(x)
196
+ left = self.left_proj(x)
197
+ right = self.right_proj(x)
198
+ mask = mask.unsqueeze(-1)
199
+ left = left * mask
200
+ right = right * mask
201
+ left = left * self.left_gate(x).sigmoid()
202
+ right = right * self.right_gate(x).sigmoid()
203
+ out_gate = self.out_gate(x).sigmoid()
204
+ out = einsum('... i k d, ... j k d -> ... i j d', left, right)
205
+ out = self.to_out_norm(out)
206
+ out = out * out_gate
207
+ return self.to_out(out)
208
+
209
+
210
+ def ref_kernel(data):
211
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
212
+ old_cudnn = torch.backends.cudnn.allow_tf32
213
+ torch.backends.cuda.matmul.allow_tf32 = False
214
+ torch.backends.cudnn.allow_tf32 = False
215
+ try:
216
+ input_tensor, mask, weights, config = data
217
+ trimul = _TriMul(dim=config["dim"], hidden_dim=config["hidden_dim"],
218
+ device=input_tensor.device)
219
+ trimul.norm.weight = nn.Parameter(weights['norm.weight'])
220
+ trimul.norm.bias = nn.Parameter(weights['norm.bias'])
221
+ trimul.left_proj.weight = nn.Parameter(weights['left_proj.weight'])
222
+ trimul.right_proj.weight = nn.Parameter(weights['right_proj.weight'])
223
+ trimul.left_gate.weight = nn.Parameter(weights['left_gate.weight'])
224
+ trimul.right_gate.weight = nn.Parameter(weights['right_gate.weight'])
225
+ trimul.out_gate.weight = nn.Parameter(weights['out_gate.weight'])
226
+ trimul.to_out_norm.weight = nn.Parameter(weights['to_out_norm.weight'])
227
+ trimul.to_out_norm.bias = nn.Parameter(weights['to_out_norm.bias'])
228
+ trimul.to_out.weight = nn.Parameter(weights['to_out.weight'])
229
+ return trimul(input_tensor, mask)
230
+ finally:
231
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
232
+ torch.backends.cudnn.allow_tf32 = old_cudnn
233
+
234
+
235
+ def generate_input(seqlen, bs, dim, hiddendim, seed, nomask, distribution="normal"):
236
+ hidden_dim = hiddendim
237
+ config = {"hidden_dim": hidden_dim, "dim": dim}
238
+ gen = torch.Generator(device='cuda')
239
+ gen.manual_seed(seed)
240
+
241
+ if distribution == "cauchy":
242
+ u = torch.empty((bs, seqlen, seqlen, dim), device="cuda", dtype=torch.float32)
243
+ u.uniform_(0.0, 1.0, generator=gen)
244
+ input_tensor = 2.0 * torch.tan(math.pi * (u - 0.5))
245
+ else:
246
+ input_tensor = torch.randn(
247
+ (bs, seqlen, seqlen, dim), device='cuda', dtype=torch.float32, generator=gen
248
+ ).contiguous()
249
+
250
+ if nomask:
251
+ mask = torch.ones(bs, seqlen, seqlen, device="cuda")
252
+ else:
253
+ mask = torch.randint(0, 2, (bs, seqlen, seqlen), device="cuda", generator=gen).float()
254
+
255
+ weights = {
256
+ "norm.weight": torch.randn(dim, device="cuda"),
257
+ "norm.bias": torch.randn(dim, device="cuda"),
258
+ "left_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
259
+ "right_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
260
+ "left_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
261
+ "right_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
262
+ "out_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
263
+ "to_out_norm.weight": torch.randn(hidden_dim, device="cuda"),
264
+ "to_out_norm.bias": torch.randn(hidden_dim, device="cuda"),
265
+ "to_out.weight": torch.randn(dim, hidden_dim, device="cuda") / math.sqrt(dim),
266
+ }
267
+ return (input_tensor, mask, weights, config)
268
+
269
+
270
+ def check_implementation(data, submission_output, rtol=2e-2, atol=2e-2):
271
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
272
+ old_cudnn = torch.backends.cudnn.allow_tf32
273
+ torch.backends.cuda.matmul.allow_tf32 = False
274
+ torch.backends.cudnn.allow_tf32 = False
275
+ try:
276
+ ref_output = ref_kernel(data)
277
+ if ref_output.shape != submission_output.shape:
278
+ return False, f"Shape mismatch: {ref_output.shape} vs {submission_output.shape}"
279
+ if torch.allclose(ref_output.float(), submission_output.float(), rtol=rtol, atol=atol):
280
+ return True, "Match"
281
+ diff = torch.abs(ref_output.float() - submission_output.float())
282
+ return False, f"max_diff={diff.max().item():.6f}, avg_diff={diff.mean().item():.6f}"
283
+ finally:
284
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
285
+ torch.backends.cudnn.allow_tf32 = old_cudnn
286
+ '''
benchmarks/gpu_mode/trimul/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/gpu_mode/vecadd/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Float16 Vector Addition
2
+
3
+ Evolve a Triton kernel for float16 vector addition using SkyDiscover.
4
+
5
+ **Operation:** `C = A + B` (element-wise, float16)
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/vecadd/initial_program.py \
14
+ benchmarks/gpu_mode/vecadd/evaluator.py \
15
+ -c benchmarks/gpu_mode/vecadd/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness weight:** 0.3 (must return float16, rtol/atol=1e-3)
22
+ - **Speedup weight:** 1.0 (geometric mean vs PyTorch reference, capped at 10x)
23
+ - **Combined:** `0.3 * correctness + speedup`
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ ```bash
28
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
29
+ uv run skydiscover-run \
30
+ benchmarks/gpu_mode/vecadd/initial_program.py \
31
+ benchmarks/gpu_mode/vecadd/evaluator.py \
32
+ -c benchmarks/gpu_mode/vecadd/config.yaml \
33
+ -s [your_algorithm] -i 50
34
+ ```
benchmarks/gpu_mode/vecadd/config.yaml ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Float16 Vector Addition
2
+ # Usage: uv run skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy> -i 50
3
+
4
+ max_iterations: 100
5
+ checkpoint_interval: 10
6
+ log_level: INFO
7
+
8
+ llm:
9
+ models:
10
+ - name: "gpt-5"
11
+ weight: 1.0
12
+ api_base: https://api.openai.com/v1
13
+ temperature: 0.7
14
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
15
+ max_tokens: 32000
16
+ timeout: 600
17
+
18
+ prompt:
19
+ system_message: |
20
+ You are an expert Triton kernel engineer. Output ONLY Python code - no explanations.
21
+
22
+ REQUIRED OUTPUT STRUCTURE:
23
+ 1. Imports: torch, triton, triton.language as tl
24
+ 2. @triton.jit kernel function(s)
25
+ 3. def custom_kernel(data) wrapper - REQUIRED entry point
26
+
27
+ Task: Optimize float16 vector addition kernel. C = A + B
28
+ Input: Tuple of (A, B) tensors of shape (N, N) and dtype torch.float16
29
+ Output: Tensor of shape (N, N) and dtype torch.float16
30
+ N can be: 256, 512, 1024, 2048, 4096, 8192
31
+
32
+ Optimization tips:
33
+ - Block size tuning (512, 1024, 2048, 4096)
34
+ - Use @triton.autotune for automatic parameter tuning
35
+ - Vectorized loads for memory operations
36
+ - Grid configuration for occupancy
37
+ - Memory coalescing for sequential access patterns
38
+
39
+ MUST use @triton.jit decorator. MUST return float16 tensor.
40
+ Output complete, working code in a single ```python``` block.
41
+
42
+ evaluator:
43
+ timeout: 600
44
+ max_retries: 3
45
+ cascade_evaluation: true
46
+ cascade_thresholds: [0.4, 0.3]
47
+
48
+ diff_based_generation: true
49
+ max_solution_length: 60000
50
+ random_seed: 42
benchmarks/gpu_mode/vecadd/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for float16 Vector Addition — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/vecadd/initial_program.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVOLVE-BLOCK-START
2
+ """
3
+ Initial float16 vector addition with Triton kernel.
4
+ """
5
+
6
+ import torch
7
+ import triton
8
+ import triton.language as tl
9
+
10
+
11
+ @triton.jit
12
+ def vecadd_kernel(
13
+ a_ptr, b_ptr, c_ptr,
14
+ n_elements,
15
+ BLOCK_SIZE: tl.constexpr,
16
+ ):
17
+ pid = tl.program_id(0)
18
+ block_start = pid * BLOCK_SIZE
19
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
20
+ mask = offsets < n_elements
21
+
22
+ a = tl.load(a_ptr + offsets, mask=mask)
23
+ b = tl.load(b_ptr + offsets, mask=mask)
24
+ c = a + b
25
+
26
+ tl.store(c_ptr + offsets, c, mask=mask)
27
+
28
+
29
+ def custom_kernel(data):
30
+ a, b = data
31
+ a = a.contiguous()
32
+ b = b.contiguous()
33
+ c = torch.empty_like(a)
34
+ n_elements = a.numel()
35
+ BLOCK_SIZE = 1024
36
+ grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
37
+ vecadd_kernel[grid](a, b, c, n_elements, BLOCK_SIZE=BLOCK_SIZE)
38
+ return c
39
+ # EVOLVE-BLOCK-END
benchmarks/gpu_mode/vecadd/reference.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for float16 vector addition Triton kernel.
3
+ C = A + B
4
+ """
5
+
6
+ import math
7
+ try:
8
+ import torch
9
+ except ImportError:
10
+ torch = None # Modal-only mode — functions below won't be called locally
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Reward parameters
14
+ # ---------------------------------------------------------------------------
15
+
16
+ CORRECTNESS_WEIGHT = 0.3
17
+ SPEED_WEIGHT = 1.0
18
+ SPEED_MAX_REWARD = 10.0
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Test / benchmark cases
22
+ # ---------------------------------------------------------------------------
23
+
24
+ TEST_CASES = [
25
+ {"N": 256, "seed": 42},
26
+ {"N": 512, "seed": 123},
27
+ {"N": 1024, "seed": 456},
28
+ {"N": 2048, "seed": 789},
29
+ ]
30
+
31
+ BENCHMARK_CASES = [
32
+ {"N": 1024, "seed": 1001},
33
+ {"N": 2048, "seed": 1002},
34
+ {"N": 4096, "seed": 1003},
35
+ {"N": 8192, "seed": 1004},
36
+ ]
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Reference kernel
40
+ # ---------------------------------------------------------------------------
41
+
42
+
43
+ def ref_kernel(data):
44
+ a, b = data
45
+ return a + b
46
+
47
+
48
+ def generate_input(N, seed):
49
+ gen = torch.Generator(device="cuda")
50
+ gen.manual_seed(seed)
51
+ a = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
52
+ b = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
53
+ return (a, b)
54
+
55
+
56
+ def check_implementation(data, output, rtol=1e-3, atol=1e-3):
57
+ ref_out = ref_kernel(data)
58
+ if output.shape != ref_out.shape:
59
+ return False, f"Shape mismatch: expected {ref_out.shape}, got {output.shape}"
60
+ if output.dtype != torch.float16:
61
+ return False, f"Dtype mismatch: expected float16, got {output.dtype}"
62
+ if torch.allclose(output, ref_out, rtol=rtol, atol=atol):
63
+ return True, "Match"
64
+ diff = torch.abs(output.float() - ref_out.float())
65
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
66
+
67
+
68
+ # ---------------------------------------------------------------------------
69
+ # Self-contained reference code for Modal execution
70
+ # ---------------------------------------------------------------------------
71
+
72
+ MODAL_REFERENCE_CODE = '''
73
+ import torch
74
+
75
+ def ref_kernel(data):
76
+ a, b = data
77
+ return a + b
78
+
79
+ def generate_input(N, seed):
80
+ gen = torch.Generator(device="cuda")
81
+ gen.manual_seed(seed)
82
+ a = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
83
+ b = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
84
+ return (a, b)
85
+
86
+ def check_implementation(data, output, rtol=1e-3, atol=1e-3):
87
+ ref_out = ref_kernel(data)
88
+ if output.shape != ref_out.shape:
89
+ return False, f"Shape mismatch: expected {ref_out.shape}, got {output.shape}"
90
+ if output.dtype != torch.float16:
91
+ return False, f"Dtype mismatch: expected float16, got {output.dtype}"
92
+ if torch.allclose(output, ref_out, rtol=rtol, atol=atol):
93
+ return True, "Match"
94
+ diff = torch.abs(output.float() - ref_out.float())
95
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
96
+ '''
benchmarks/gpu_mode/vecadd/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/kernelbench/README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KernelBench Integration with SkyDiscover
2
+
3
+ GPU kernel optimization tasks using the [KernelBench](https://github.com/ScalingIntelligence/KernelBench) dataset and evaluation protocol.
4
+
5
+ ## Overview
6
+
7
+ The KernelBench integration allows you to run SkyDiscover on any problem from the KernelBench dataset. The framework automatically:
8
+
9
+ 1. Fetches the reference implementation of the target kernel from KernelBench
10
+ 2. Creates an initial_program.py with EVOLVE-BLOCK markers
11
+ 3. Configures the evaluator with problem-specific parameters
12
+ 4. Runs the optimization using either a containerized or native Python evaluator
13
+
14
+ The evaluator uses the KernelBench evaluation infrastructure to measure speedup over PyTorch eager execution.
15
+
16
+ ### Evaluator Modes
17
+
18
+ - **Containerized (Docker)**: Runs evaluation inside a Docker container (default)
19
+ - **Native Python**: Runs evaluation directly as Python code (for clusters without Docker/Podman)
20
+
21
+ ## Directory Structure
22
+
23
+ ```
24
+ benchmarks/kernelbench/
25
+ ├── config.yaml # System prompt + search/evaluator settings
26
+ ├── resolver.py # Benchmark loader (fetches target problems from KernelBench)
27
+ ├── requirements.txt # Resolver dependencies (kernelbench library)
28
+ └── evaluator/ # Self-contained Docker benchmark
29
+ ├── Dockerfile # Container image definition
30
+ ├── evaluate.sh # Entrypoint (receives solution path)
31
+ ├── evaluator.py # Scoring logic using KernelBench
32
+ ├── requirements.txt # Evaluator dependencies (kernelbench[gpu])
33
+ └── wrapper.py # JSON protocol wrapper
34
+ ```
35
+
36
+ **Note:** The `run_and_check.py` script is downloaded directly from the KernelBench repository during Docker build (pinned to commit `423217d` for reproducibility). To update, modify the `KERNELBENCH_COMMIT` build arg in the Dockerfile.
37
+
38
+ ## Installation
39
+
40
+ Before using the KernelBench integration, install the required dependencies:
41
+
42
+ ```bash
43
+ # Install KernelBench library (required for problem fetching)
44
+ uv pip install -r benchmarks/kernelbench/requirements.txt
45
+ ```
46
+
47
+ **Note:** The resolver (problem fetching) only needs the base `kernelbench` package. The containerized evaluator installs `kernelbench[gpu]` for GPU support.
48
+
49
+ ## Quick Start
50
+
51
+ ### Using Docker (Default)
52
+
53
+ Edit `benchmarks/kernelbench/config.yaml` to select a target kernel from the [KernelBench database](https://huggingface.co/datasets/ScalingIntelligence/KernelBench):
54
+
55
+ ```yaml
56
+ benchmark:
57
+ # KernelBench problem specification
58
+ level: 2 # Problem difficulty level (1, 2, 3 or 4)
59
+ problem_id: 5 # Specific problem ID within the level
60
+ ```
61
+
62
+ Then, run optimization on this problem:
63
+
64
+ ```bash
65
+ # algo can be "adaevolve", "evox", "topk", "beam_search", "best_of_n", etc.
66
+ uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
67
+ -c benchmarks/kernelbench/config.yaml \
68
+ --search <algo> \
69
+ --iterations 50
70
+ ```
71
+
72
+ ### Using Native Python (No Docker Required)
73
+
74
+ For clusters without Docker/Podman privileges, you can run the evaluator as native Python code.
75
+
76
+ #### 1. Install Dependencies
77
+
78
+ ```bash
79
+ # Install KernelBench with GPU support
80
+ pip install -r benchmarks/kernelbench/evaluator/requirements.txt
81
+ ```
82
+
83
+ #### 2. Configure Native Mode
84
+
85
+ Edit `benchmarks/kernelbench/config.yaml`:
86
+
87
+ ```yaml
88
+ benchmark:
89
+ enabled: true
90
+ name: kernelbench
91
+ resolver: benchmarks.kernelbench.resolver
92
+
93
+ # Set to false to use native Python evaluator (no Docker)
94
+ use_docker: false
95
+
96
+ level: 2
97
+ problem_id: 11
98
+ # ... rest of config
99
+ ```
100
+
101
+ #### 3. Run Optimization
102
+
103
+ ```bash
104
+ # algo can be "adaevolve", "evox", "topk", "beam_search", "best_of_n", etc.
105
+ uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
106
+ -c benchmarks/kernelbench/config.yaml \
107
+ --search <algo> \
108
+ --iterations 50
109
+ ```
110
+
111
+ **Note:** The `run_and_check.py` script from KernelBench will be automatically downloaded on first run.
112
+
113
+ **Note:** No initial_program argument is needed - it is fetched automatically based on the `benchmark` section in config.yaml.
114
+
115
+ ## Configuration Reference
116
+
117
+ ### Benchmark Section
118
+
119
+ The `benchmark` section in `config.yaml` controls problem loading:
120
+
121
+ ```yaml
122
+ benchmark:
123
+ enabled: true # Enable benchmark loader
124
+ name: kernelbench # Benchmark name (for logging)
125
+ resolver: benchmarks.kernelbench.resolver # Python module path
126
+
127
+ # Evaluator mode
128
+ use_docker: true # true: containerized (Docker), false: native Python
129
+
130
+ # Problem specification
131
+ level: 1 # Difficulty: 1 (easy), 2 (medium), 3 (hard), 4 (very hard)
132
+ problem_id: 1 # Problem ID within the level
133
+
134
+ # Dataset source
135
+ dataset_src: huggingface # 'huggingface' or 'local'
136
+ dataset_name: ScalingIntelligence/KernelBench # HF dataset name
137
+
138
+ # Evaluation settings
139
+ eval_mode: local # 'local' or 'modal'
140
+ gpu: H100 # GPU type: H100, A100, etc.
141
+ num_correct_trials: 5 # Correctness validation runs
142
+ num_perf_trials: 100 # Performance measurement runs
143
+ ```
144
+
145
+ ### Environment Variables
146
+
147
+ The resolver provides these environment variables to the evaluator:
148
+
149
+ - `KERNELBENCH_LEVEL`: Problem difficulty level (1, 2, or 3)
150
+ - `KERNELBENCH_PROBLEM_ID`: Specific problem within the level
151
+ - `KERNELBENCH_EVAL_MODE`: Evaluation mode (local, modal)
152
+ - `KERNELBENCH_GPU`: GPU type (H100, A100, etc.)
153
+ - `KERNELBENCH_NUM_CORRECT_TRIALS`: Number of correctness validation runs
154
+ - `KERNELBENCH_NUM_PERF_TRIALS`: Number of performance measurement runs
155
+ - `KERNELBENCH_TIMEOUT`: Timeout per evaluation in seconds
156
+
157
+ These variables are passed directly to the evaluator (not set globally), ensuring isolation between concurrent runs.
158
+
159
+ ### Evaluation Modes
160
+
161
+ - **local**: Run evaluation on your local machine (requires GPU)
162
+ - **modal**: Run evaluation on Modal's cloud GPUs (requires Modal setup)
163
+
164
+ ### GPU Types
165
+
166
+ The list of currently supported GPU types can be found [here](https://github.com/ScalingIntelligence/KernelBench/blob/423217d9fda91e0c2d67e4a43bf62f96f6d104f1/scripts/run_and_check.py#L16).
167
+
168
+ ## Metrics
169
+
170
+ The evaluator returns:
171
+
172
+ - **combined_score**: Speedup over PyTorch eager execution (primary metric)
173
+ - **speedup_over_eager**: Same as combined_score
174
+ - **speedup_over_compile**: Speedup over torch.compile()
175
+ - **kernel_time_ms**: Execution time of optimized kernel
176
+ - **ref_eager_time_ms**: Reference eager execution time
177
+
178
+
179
+ ## Traditional Usage (Manual Initial Program)
180
+
181
+ You can still provide an initial program manually if needed:
182
+
183
+ ```bash
184
+ # Run with explicit initial program
185
+ uv run skydiscover-run my_kernel.py benchmarks/kernelbench/evaluator/ \
186
+ -c benchmarks/kernelbench/config.yaml \
187
+ --search <algo>
188
+ ```
189
+
190
+ ## Troubleshooting
191
+
192
+ ### Error: "kernelbench package not found"
193
+
194
+ Install KernelBench:
195
+ ```bash
196
+ pip install "kernelbench[gpu] @ git+https://github.com/ScalingIntelligence/KernelBench.git"
197
+ ```
198
+
199
+ ### Error: "Failed to resolve benchmark problem"
200
+
201
+ Check that:
202
+ 1. `benchmark.enabled` is `true` in config
203
+ 2. `level` and `problem_id` are valid
204
+ 3. KernelBench package is installed
205
+ 4. You have internet access (for HuggingFace dataset)
206
+
207
+ ### Generated Files Location
208
+
209
+ The framework creates temporary files in `/tmp/skydiscover_kernelbench_*/`:
210
+ - `initial_program.py`: Generated initial program
211
+ - Evaluator uses the existing `benchmarks/kernelbench/evaluator/` directory
benchmarks/kernelbench/config.yaml ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KernelBench optimization benchmark configuration
2
+ # Usage: skydiscover-run evaluator/ -c config.yaml -s <strategy>
3
+ # Note: initial_program is automatically fetched from KernelBench dataset, based on the `level` and `problem_id` fields.
4
+
5
+ language: python
6
+
7
+ # Benchmark loader configuration
8
+ benchmark:
9
+ enabled: true
10
+ name: kernelbench
11
+ resolver: benchmarks.kernelbench.resolver
12
+
13
+ # Evaluator mode: set to false for native Python (no Docker), true for containerized
14
+ use_docker: true # Set to false when running on clusters without Docker/Podman privileges
15
+
16
+ # KernelBench problem specification
17
+ level: 1 # Problem difficulty level (1, 2, 3 or 4)
18
+ problem_id: 1 # Specific problem ID within the level
19
+
20
+ dataset_src: huggingface # 'huggingface' or 'local'
21
+ dataset_name: ScalingIntelligence/KernelBench
22
+
23
+ # Evaluation configuration
24
+ eval_mode: local # 'local' or 'modal'
25
+ gpu: H100 # GPU type for evaluation
26
+ num_correct_trials: 5 # Number of correctness validation runs
27
+ num_perf_trials: 100 # Number of performance measurement runs
28
+
29
+ diff_based_generation: true
30
+ max_iterations: 100
31
+ checkpoint_interval: 10
32
+ max_solution_length: 60000
33
+
34
+ llm:
35
+ api_base: "${BASE_URL}"
36
+ api_key: "${API_KEY}"
37
+ models:
38
+ - name: "gpt-5"
39
+ weight: 1.0
40
+ max_tokens: 32000
41
+ timeout: 600
42
+
43
+ prompt:
44
+ system_message: |-
45
+ You are an expert in GPU kernel optimization and PyTorch performance engineering with deep expertise
46
+ in writing high-performance CUDA kernels, Triton kernels, and optimized PyTorch operations.
47
+
48
+ PROBLEM SPECIFICATION:
49
+
50
+ Your task is to optimize a PyTorch neural network operation to achieve maximum speedup
51
+ over the baseline execution. The execution is evaluated on GPU hardware and compared against:
52
+ 1. PyTorch eager mode (baseline)
53
+ 2. torch.compile() optimization
54
+
55
+ PERFORMANCE METRICS:
56
+
57
+ 1. **speedup_over_eager**: Speedup compared to PyTorch eager execution (PRIMARY OBJECTIVE - maximize)
58
+ 2. **combined_score**: Same as speedup_over_eager (used for optimization)
59
+ 3. **speedup_over_compile**: Speedup compared to torch.compile() (SECONDARY - maximize)
60
+ 4. **kernel_time_ms**: Execution time of your optimized kernel in milliseconds (minimize)
61
+ 5. **ref_eager_time_ms**: Reference eager execution time in milliseconds (for comparison)
62
+
63
+ OPTIMIZATION STRATEGIES:
64
+
65
+ - Consider writing custom kernels in CUDA or Triton
66
+ - Use efficient memory access patterns (coalesced reads/writes)
67
+ - Minimize memory transfers between CPU and GPU
68
+ - Leverage tensor cores when applicable
69
+ - Use fused operations to reduce kernel launches
70
+ - Optimize for the specific GPU architecture (H100, A100, etc.)
71
+ - Use appropriate data types (fp16, bf16, fp32)
72
+ - Minimize synchronization points
73
+
74
+ TECHNICAL REQUIREMENTS:
75
+
76
+ - **Correctness**: Your implementation must produce numerically correct results
77
+ - **Determinism**: Use fixed random seeds if employing stochastic methods
78
+ - **Error handling**: Graceful handling of edge cases and invalid inputs
79
+ - **GPU compatibility**: Code must run on the specified GPU hardware
80
+
81
+ # change the SkyDiscover default of 500 which causes the model to focus only on simplification
82
+ suggest_simplification_after_chars: 5000
83
+
84
+ evaluator:
85
+ timeout: 600
86
+ max_retries: 3
benchmarks/kernelbench/evaluator/Dockerfile ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+ WORKDIR /benchmark
3
+
4
+ # Install system dependencies
5
+ RUN apt-get update && apt-get install -y \
6
+ git \
7
+ curl \
8
+ && rm -rf /var/lib/apt/lists/*
9
+
10
+ COPY requirements.txt .
11
+ RUN pip install --no-cache-dir -r requirements.txt
12
+
13
+ # wrapper.py provides backwards compatibility for old Python-based evaluators
14
+ # that define evaluate(program_path) -> dict. Bridges them to the container
15
+ # JSON protocol. Source of truth: skydiscover/evaluation/wrapper.py
16
+ COPY . .
17
+
18
+ # Download run_and_check.py from KernelBench repository (pinned to specific commit)
19
+ ARG KERNELBENCH_COMMIT=423217d
20
+ RUN curl -o run_and_check.py \
21
+ "https://raw.githubusercontent.com/ScalingIntelligence/KernelBench/${KERNELBENCH_COMMIT}/scripts/run_and_check.py"
22
+
23
+ RUN chmod +x evaluate.sh
24
+
25
+ ENTRYPOINT ["./evaluate.sh"]
benchmarks/kernelbench/evaluator/evaluate.sh ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ PROGRAM="$1"
5
+
6
+ python /benchmark/evaluator.py "$PROGRAM"
benchmarks/kernelbench/evaluator/evaluator.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Evaluator for KernelBench problems using kernelbench evaluation logic.
3
+
4
+ This evaluator can run inside a Docker container or as a native Python script,
5
+ and evaluates candidate kernel programs against KernelBench reference implementations.
6
+ """
7
+
8
+ import os
9
+ import re
10
+ import subprocess
11
+ import sys
12
+ import tempfile
13
+ import traceback
14
+ from pathlib import Path
15
+
16
+
17
+ def ensure_run_and_check(evaluator_dir: Path):
18
+ """Download run_and_check.py if not present.
19
+
20
+ This allows the evaluator to work in native Python mode without Docker,
21
+ automatically fetching the KernelBench evaluation script on first use.
22
+
23
+ Args:
24
+ evaluator_dir: Directory where the evaluator is located
25
+
26
+ Returns:
27
+ Path to run_and_check.py
28
+ """
29
+ run_and_check_path = evaluator_dir / "run_and_check.py"
30
+
31
+ if not run_and_check_path.exists():
32
+ import urllib.request
33
+
34
+ commit = "423217d"
35
+ url = f"https://raw.githubusercontent.com/ScalingIntelligence/KernelBench/{commit}/scripts/run_and_check.py"
36
+
37
+ print(
38
+ f"[INFO] Downloading run_and_check.py from KernelBench (commit {commit})...",
39
+ file=sys.stderr,
40
+ )
41
+ try:
42
+ urllib.request.urlretrieve(url, run_and_check_path)
43
+ print(f"[INFO] Downloaded to {run_and_check_path}", file=sys.stderr)
44
+ except Exception as e:
45
+ raise RuntimeError(f"Failed to download run_and_check.py: {e}")
46
+
47
+ return run_and_check_path
48
+
49
+
50
+ def evaluate(program_path: str):
51
+ """
52
+ Evaluate a candidate kernel program against the reference using run_and_check.py.
53
+
54
+ Args:
55
+ program_path: Path to the candidate program file
56
+
57
+ Returns:
58
+ Dictionary with combined_score (higher is better) and optional artifacts
59
+ """
60
+ try:
61
+ # Read configuration from environment variables
62
+ # These are injected by the benchmark setup
63
+ level = int(os.environ.get("KERNELBENCH_LEVEL", "1"))
64
+ problem_id = int(os.environ.get("KERNELBENCH_PROBLEM_ID", "1"))
65
+ eval_mode = os.environ.get("KERNELBENCH_EVAL_MODE", "local")
66
+ gpu = os.environ.get("KERNELBENCH_GPU", "H100")
67
+ num_correct_trials = int(os.environ.get("KERNELBENCH_NUM_CORRECT_TRIALS", "5"))
68
+ num_perf_trials = int(os.environ.get("KERNELBENCH_NUM_PERF_TRIALS", "100"))
69
+ timeout = int(os.environ.get("KERNELBENCH_TIMEOUT", "300"))
70
+
71
+ # Read the program and wrap it in ModelNew class for KernelBench format
72
+ with open(program_path, "r") as f:
73
+ program_content = f.read()
74
+
75
+ is_triton = bool(
76
+ re.search(r"^(import triton|from triton)", program_content, flags=re.MULTILINE)
77
+ )
78
+
79
+ # Create a temporary file with ModelNew wrapper
80
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as tmp_file:
81
+ # Replace class Model with class ModelNew (if not already ModelNew)
82
+ converted_content = program_content
83
+ if "class ModelNew" not in converted_content:
84
+ converted_content = re.sub(
85
+ r"^class Model(?=[(:])", "class ModelNew", converted_content, flags=re.MULTILINE
86
+ )
87
+ # Fix super() calls - use modern Python 3 super() without arguments
88
+ converted_content = re.sub(r"super\(Model,\s*self\)", "super()", converted_content)
89
+ converted_content = re.sub(r"super\(Model,\s*cls\)", "super()", converted_content)
90
+
91
+ tmp_file.write(converted_content)
92
+ kernel_src_path = tmp_file.name
93
+
94
+ try:
95
+ # Ensure run_and_check.py is available (downloads if needed)
96
+ evaluator_dir = Path(__file__).parent
97
+ run_and_check_path = ensure_run_and_check(evaluator_dir)
98
+
99
+ # Build command to run run_and_check.py
100
+ cmd = [
101
+ sys.executable,
102
+ str(run_and_check_path),
103
+ "ref_origin=kernelbench",
104
+ f"level={level}",
105
+ f"problem_id={problem_id}",
106
+ f"kernel_src_path={kernel_src_path}",
107
+ f"eval_mode={eval_mode}",
108
+ f"gpu={gpu}",
109
+ f"num_correct_trials={num_correct_trials}",
110
+ f"num_perf_trials={num_perf_trials}",
111
+ f"timeout={timeout}",
112
+ "check_kernel=False", # Disable static checker to allow reference code
113
+ ]
114
+
115
+ # Setting the backend is important for KernelBench triton evaluation to work
116
+ if is_triton:
117
+ cmd.append("backend=triton")
118
+
119
+ # Set up environment
120
+ env = os.environ.copy()
121
+
122
+ # Run the evaluation from the evaluator directory
123
+ print(f"[INFO] Running evaluation command: {' '.join(cmd)}", file=sys.stderr)
124
+ result = subprocess.run(
125
+ cmd,
126
+ capture_output=True,
127
+ text=True,
128
+ timeout=timeout,
129
+ cwd=str(evaluator_dir),
130
+ env=env,
131
+ )
132
+ finally:
133
+ # Clean up temporary file
134
+ try:
135
+ os.unlink(kernel_src_path)
136
+ except Exception:
137
+ pass
138
+
139
+ # Parse the output to extract speedup
140
+ stdout = result.stdout
141
+ stderr = result.stderr
142
+
143
+ if result.returncode != 0:
144
+ print(
145
+ f"[ERROR] Evaluation failed with return code {result.returncode}", file=sys.stderr
146
+ )
147
+ print(f"[ERROR] stdout: {stdout}", file=sys.stderr)
148
+ print(f"[ERROR] stderr: {stderr}", file=sys.stderr)
149
+ return {
150
+ "combined_score": -100.0,
151
+ "error": f"Evaluation subprocess failed: {stderr[:500]}",
152
+ "return_code": result.returncode,
153
+ }
154
+
155
+ # Extract speedup from output
156
+ speedup_eager = None
157
+ speedup_compile = None
158
+ kernel_time = None
159
+ ref_eager_time = None
160
+
161
+ for line in stdout.split("\n"):
162
+ if "Speedup over eager:" in line:
163
+ match = re.search(r"([0-9.]+)x", line)
164
+ if match:
165
+ speedup_eager = float(match.group(1))
166
+ elif "Speedup over torch.compile:" in line:
167
+ match = re.search(r"([0-9.]+)x", line)
168
+ if match:
169
+ speedup_compile = float(match.group(1))
170
+ elif "Custom Kernel exec time:" in line:
171
+ match = re.search(r"([0-9.]+) ms", line)
172
+ if match:
173
+ kernel_time = float(match.group(1))
174
+ elif "PyTorch Reference Eager exec time:" in line:
175
+ match = re.search(r"([0-9.]+) ms", line)
176
+ if match:
177
+ ref_eager_time = float(match.group(1))
178
+
179
+ # If we found speedup, use it as the score
180
+ if speedup_eager is not None and speedup_eager > 0:
181
+ return {
182
+ "combined_score": float(speedup_eager),
183
+ "speedup_over_eager": speedup_eager,
184
+ "speedup_over_compile": speedup_compile,
185
+ "kernel_time_ms": kernel_time,
186
+ "ref_eager_time_ms": ref_eager_time,
187
+ "eval_mode": eval_mode,
188
+ "gpu": gpu,
189
+ }
190
+ else:
191
+ # Kernel failed correctness or didn't compile
192
+ # Extract only relevant output starting from [Eval]
193
+ stdout_excerpt = stdout
194
+ if "[Eval]" in stdout:
195
+ eval_start = stdout.find("[Eval]")
196
+ stdout_excerpt = stdout[eval_start:]
197
+
198
+ # Take last 5000 chars if too long
199
+ if len(stdout_excerpt) > 5000:
200
+ stdout_excerpt = stdout_excerpt[-5000:]
201
+
202
+ return {
203
+ "combined_score": -100.0,
204
+ "error": "Kernel failed correctness check or did not compile",
205
+ "stdout_excerpt": stdout_excerpt,
206
+ }
207
+
208
+ except subprocess.TimeoutExpired:
209
+ return {
210
+ "combined_score": -1.0,
211
+ "error": f"Evaluation timed out after {timeout} seconds",
212
+ }
213
+ except Exception as e:
214
+ traceback.print_exc()
215
+ return {
216
+ "combined_score": -100.0,
217
+ "error": f"Error during evaluation: {str(e)}",
218
+ "error_type": type(e).__name__,
219
+ }
220
+
221
+
222
+ if __name__ == "__main__":
223
+ # Backwards-compat: bridges old evaluate() -> dict to the container JSON
224
+ # protocol. wrapper.py is copied from skydiscover/evaluation/wrapper.py.
225
+ from wrapper import run
226
+
227
+ run(evaluate)
benchmarks/kernelbench/evaluator/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # KernelBench library with GPU support
2
+ kernelbench[gpu] @ git+https://github.com/ScalingIntelligence/KernelBench.git
benchmarks/kernelbench/evaluator/wrapper.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Backwards-compat wrapper for old Python-based evaluators.
2
+
3
+ Old-style evaluators define ``evaluate(program_path) -> dict``. This module
4
+ bridges that interface to the container JSON protocol expected by
5
+ ContainerizedEvaluator.
6
+
7
+ Usage — add this to the bottom of your evaluator.py::
8
+
9
+ if __name__ == "__main__":
10
+ from wrapper import run
11
+ run(evaluate)
12
+ """
13
+
14
+ import json
15
+ import sys
16
+ import traceback
17
+
18
+
19
+ def run(evaluate_fn):
20
+ """Call *evaluate_fn*, format the result as container-protocol JSON on stdout.
21
+
22
+ * Reads ``sys.argv[1]`` as the program path.
23
+ * Redirects stdout → stderr while *evaluate_fn* runs so that debug prints
24
+ don't contaminate the JSON output.
25
+ * Separates numeric metrics from non-numeric artifacts.
26
+ * Guarantees ``combined_score`` is always present in metrics.
27
+ """
28
+ if len(sys.argv) < 2:
29
+ print("Usage: evaluator.py <program_path>", file=sys.stderr)
30
+ sys.exit(1)
31
+
32
+ program_path = sys.argv[1]
33
+
34
+ # Redirect stdout → stderr during evaluation so debug prints from
35
+ # the evaluator don't contaminate the JSON output on stdout.
36
+ real_stdout = sys.stdout
37
+ sys.stdout = sys.stderr
38
+ try:
39
+ result = evaluate_fn(program_path)
40
+ except Exception as e:
41
+ sys.stdout = real_stdout
42
+ print(
43
+ json.dumps(
44
+ {
45
+ "status": "error",
46
+ "combined_score": 0.0,
47
+ "metrics": {"combined_score": 0.0},
48
+ "artifacts": {
49
+ "error": str(e),
50
+ "traceback": traceback.format_exc(),
51
+ },
52
+ }
53
+ )
54
+ )
55
+ return
56
+ sys.stdout = real_stdout
57
+
58
+ if not isinstance(result, dict):
59
+ print(
60
+ json.dumps(
61
+ {
62
+ "status": "error",
63
+ "combined_score": 0.0,
64
+ "metrics": {"combined_score": 0.0},
65
+ "artifacts": {
66
+ "error": f"evaluate() returned {type(result).__name__}, expected dict"
67
+ },
68
+ }
69
+ )
70
+ )
71
+ return
72
+
73
+ # Separate numeric metrics from non-numeric artifacts.
74
+ metrics = {}
75
+ artifacts = {}
76
+ for k, v in result.items():
77
+ if isinstance(v, bool):
78
+ metrics[k] = float(v)
79
+ elif isinstance(v, (int, float)):
80
+ metrics[k] = float(v)
81
+ elif isinstance(v, str):
82
+ artifacts[k] = v
83
+ elif isinstance(v, (list, dict)):
84
+ artifacts[k] = json.dumps(v)
85
+
86
+ if "combined_score" not in metrics:
87
+ metrics["combined_score"] = 0.0
88
+
89
+ status = "error" if "error" in artifacts else "success"
90
+ output = {
91
+ "status": status,
92
+ "combined_score": metrics["combined_score"],
93
+ "metrics": metrics,
94
+ }
95
+ if artifacts:
96
+ output["artifacts"] = artifacts
97
+
98
+ print(json.dumps(output))
benchmarks/kernelbench/resolver.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """KernelBench problem resolver for SkyDiscover.
2
+
3
+ This resolver fetches GPU kernel optimization problems from the KernelBench
4
+ dataset and generates the necessary files for SkyDiscover to run optimization.
5
+ """
6
+
7
+ import logging
8
+ from pathlib import Path
9
+ from typing import Any, Dict, Tuple
10
+
11
+ from skydiscover.benchmarks.base import BenchmarkResolution, BenchmarkResolver
12
+ from skydiscover.utils.prepare import prepare_program
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+
17
+ class KernelBenchResolver(BenchmarkResolver):
18
+ """Resolves KernelBench problems by fetching from dataset and generating files.
19
+
20
+ The resolver:
21
+ 1. Fetches the reference implementation from KernelBench dataset
22
+ 2. Generates initial_program.py with EVOLVE-BLOCK markers
23
+ 3. Sets environment variables for the evaluator
24
+ 4. Returns paths to the generated initial program and existing evaluator
25
+
26
+ Required config parameters:
27
+ - level: Problem difficulty level (1, 2, or 3)
28
+ - problem_id: Specific problem ID within the level
29
+
30
+ Optional config parameters:
31
+ - dataset_src: 'huggingface' (default) or 'local'
32
+ - dataset_name: HuggingFace dataset name (default: 'ScalingIntelligence/KernelBench')
33
+ - eval_mode: 'local' (default) or 'modal'
34
+ - gpu: GPU type for evaluation (default: 'H100')
35
+ - num_correct_trials: Number of correctness validation runs (default: 5)
36
+ - num_perf_trials: Number of performance measurement runs (default: 100)
37
+ """
38
+
39
+ def resolve(self, config: Dict[str, Any], output_dir: Path) -> BenchmarkResolution:
40
+ """Fetch KernelBench problem and generate initial_program + configure evaluator.
41
+
42
+ Args:
43
+ config: Configuration dictionary with 'level', 'problem_id', and optional params
44
+ output_dir: Directory where generated files will be placed
45
+
46
+ Returns:
47
+ BenchmarkResolution with initial program, evaluator path, and evaluator env vars
48
+ """
49
+ # Validate required parameters
50
+ level = config.get("level")
51
+ problem_id = config.get("problem_id")
52
+
53
+ if level is None or problem_id is None:
54
+ raise ValueError(
55
+ "KernelBench resolver requires 'level' and 'problem_id' in config. "
56
+ f"Got: level={level}, problem_id={problem_id}"
57
+ )
58
+
59
+ # Extract optional parameters with defaults
60
+ dataset_src = config.get("dataset_src", "huggingface")
61
+ dataset_name = config.get("dataset_name", "ScalingIntelligence/KernelBench")
62
+ eval_mode = config.get("eval_mode", "local")
63
+ gpu = config.get("gpu", "H100")
64
+ num_correct_trials = config.get("num_correct_trials", 5)
65
+ num_perf_trials = config.get("num_perf_trials", 100)
66
+
67
+ logger.info(f"Resolving KernelBench problem: level={level}, problem_id={problem_id}")
68
+ logger.info(f"Eval mode: {eval_mode}, GPU: {gpu}")
69
+
70
+ # Import KernelBench dataset utilities
71
+ try:
72
+ from kernelbench.dataset import construct_kernelbench_dataset
73
+ except ImportError as e:
74
+ raise ImportError(
75
+ "KernelBench package not found. Install with: "
76
+ "uv pip install 'kernelbench @ git+https://github.com/ScalingIntelligence/KernelBench.git'"
77
+ ) from e
78
+
79
+ # Fetch the problem from KernelBench dataset
80
+ try:
81
+ dataset = construct_kernelbench_dataset(
82
+ level=level,
83
+ source=dataset_src,
84
+ dataset_name=dataset_name,
85
+ )
86
+ problem = dataset.get_problem_by_id(problem_id)
87
+ except Exception as e:
88
+ raise RuntimeError(
89
+ f"Failed to fetch KernelBench problem (level={level}, id={problem_id}): {e}"
90
+ ) from e
91
+
92
+ logger.info(f"Fetched problem: {problem.name} (ID: {problem.problem_id})")
93
+
94
+ # Generate initial_program.py with EVOLVE-BLOCK markers using prepare_program
95
+ output_dir.mkdir(parents=True, exist_ok=True)
96
+ initial_program_path = prepare_program(
97
+ initial_program=problem.code, temp_dir=str(output_dir), temp_files=[]
98
+ )
99
+ logger.info(f"Generated initial program: {initial_program_path}")
100
+
101
+ use_docker = config.get("use_docker", True)
102
+
103
+ # Use evaluator.py file for native mode, directory for container mode
104
+ if use_docker:
105
+ evaluator_path = Path(__file__).parent / "evaluator"
106
+ logger.info("Using containerized evaluator (Docker required)")
107
+ else:
108
+ evaluator_path = Path(__file__).parent / "evaluator" / "evaluator.py"
109
+ logger.info("Using native Python evaluator (no Docker required)")
110
+
111
+ evaluator_env_vars = {
112
+ "KERNELBENCH_LEVEL": str(level),
113
+ "KERNELBENCH_PROBLEM_ID": str(problem_id),
114
+ "KERNELBENCH_EVAL_MODE": eval_mode,
115
+ "KERNELBENCH_GPU": gpu,
116
+ "KERNELBENCH_NUM_CORRECT_TRIALS": str(num_correct_trials),
117
+ "KERNELBENCH_NUM_PERF_TRIALS": str(num_perf_trials),
118
+ "KERNELBENCH_TIMEOUT": str(config.get("timeout", 300)),
119
+ }
120
+
121
+ mode_desc = "container" if use_docker else "native evaluator"
122
+ logger.info(f"Prepared evaluator environment for {mode_desc}:")
123
+ logger.info(f" KERNELBENCH_LEVEL={level}")
124
+ logger.info(f" KERNELBENCH_PROBLEM_ID={problem_id}")
125
+ logger.info(f" KERNELBENCH_EVAL_MODE={eval_mode}")
126
+ logger.info(f" KERNELBENCH_GPU={gpu}")
127
+
128
+ return BenchmarkResolution(
129
+ initial_program_path=str(initial_program_path),
130
+ evaluator_path=str(evaluator_path),
131
+ evaluator_env_vars=evaluator_env_vars,
132
+ )
133
+
134
+
135
+ # Module-level resolver instance
136
+ resolver = KernelBenchResolver()
benchmarks/math/README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Math Benchmarks
2
+
3
+ Mathematical optimization and algorithm evolution problems.
4
+
5
+ ## Problems
6
+
7
+ ### Signal processing & geometry (from SkyDiscover demos)
8
+
9
+ - [signal_processing](signal_processing/) — Real-time adaptive filtering for non-stationary time series
10
+ - [circle_packing](circle_packing/) — Pack 26 circles in a unit square to maximize sum of radii (AlphaEvolve B.12)
11
+
12
+ ### AlphaEvolve mathematical problems
13
+
14
+ 12 problems from [AlphaEvolve Appendices A and B](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf). All evaluators are normalized to **maximize** the target metric.
15
+
16
+ **Appendix A:**
17
+ - [matmul](matmul/) — Faster algorithm for matrix multiplication (A)
18
+
19
+ **Appendix B:**
20
+ 1. [first_autocorr_ineq](first_autocorr_ineq/) — Upper bound on autoconvolution constant (B.1)
21
+ 2. [second_autocorr_ineq](second_autocorr_ineq/) — Lower bound on autoconvolution norm constant (B.2)
22
+ 3. [third_autocorr_ineq](third_autocorr_ineq/) — Upper bound on absolute autoconvolution constant (B.3)
23
+ 4. [uncertainty_ineq](uncertainty_ineq/) — Upper bound on Fourier uncertainty constant (B.4)
24
+ 5. [erdos_min_overlap](erdos_min_overlap/) — Upper bound on Erdos minimum overlap constant (B.5)
25
+ 6. [sums_diffs_finite_sets](sums_diffs_finite_sets/) — Lower bound on sums/differences of finite sets (B.6)
26
+ 7. [hexagon_packing](hexagon_packing/) — Pack unit hexagons in a regular hexagon, n=11,12 (B.7)
27
+ 8. [minimizing_max_min_dist](minimizing_max_min_dist/) — Minimize max/min distance ratio, n=16 d=2 and n=14 d=3 (B.8)
28
+ 9. [heilbronn_triangle](heilbronn_triangle/) — Heilbronn problem for triangles, n=11 (B.9)
29
+ 10. [heilbronn_convex](heilbronn_convex/) — Heilbronn problem for convex regions, n=13,14 (B.10)
30
+ 11. [circle_packing_rect](circle_packing_rect/) — Pack circles in a rectangle of perimeter 4 (B.13)
31
+
32
+ ## Run
33
+
34
+ ```bash
35
+ uv run skydiscover-run \
36
+ benchmarks/math/signal_processing/initial_program.py \
37
+ benchmarks/math/signal_processing/evaluator.py \
38
+ -c benchmarks/math/signal_processing/config.yaml \
39
+ -s [your_algorithm] \
40
+ -i 100
41
+ ```
42
+
43
+ Each problem directory contains `initial_program.py`, `evaluator.py`, and either `config.yaml` or per-search configs. Some multi-variant problems have numbered subdirectories (e.g., `heilbronn_convex/13/`, `hexagon_packing/11/`).
benchmarks/math/circle_packing/README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Circle Packing
2
+
3
+ Pack 26 non-overlapping circles in a unit square to maximize the sum of their radii (AlphaEvolve B.12). Target: 2.635.
4
+
5
+ ## Problem
6
+
7
+ - Pack exactly 26 circles inside a unit square
8
+ - No circles may overlap
9
+ - Each circle must lie entirely within the square
10
+ - Maximize the sum of all radii
11
+
12
+ ## Run
13
+
14
+ ```bash
15
+ # From repo root
16
+ uv run skydiscover-run \
17
+ benchmarks/math/circle_packing/initial_program.py \
18
+ benchmarks/math/circle_packing/evaluator.py \
19
+ -c benchmarks/math/circle_packing/config.yaml \
20
+ -s [your_algorithm] \
21
+ -i 100
22
+ ```
23
+
24
+ A `codebase/reference/` directory is provided with geometric insights (hex grids, optimization patterns, packing strategies) that can be used with agentic mode (`--agentic`).
25
+
26
+ ## Scoring
27
+
28
+ - **combined_score**: `sum_of_radii / 2.635` (ratio to AlphaEvolve target)
29
+ - Evaluator validates no overlaps and boundary constraints
30
+
31
+ ## Files
32
+
33
+ | File | Description |
34
+ |------|-------------|
35
+ | `initial_program.py` | Seed: simple ring-based circle arrangement |
36
+ | `evaluator.py` | Validates constraints, computes sum-of-radii ratio to target |
37
+ | `config.yaml` | LLM and evaluator settings |
38
+ | `codebase/reference/` | Geometric reference material for agentic mode |
benchmarks/math/circle_packing/codebase/reference/hex_grid.py ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Hexagonal grid initialization for circle packing.
3
+
4
+ A hexagonal (offset) grid provides a good starting arrangement
5
+ because it's the densest regular packing pattern. Even rows are
6
+ offset by half the spacing, which reduces wasted space.
7
+ """
8
+
9
+ import numpy as np
10
+
11
+
12
+ def hexagonal_grid(n, margin=0.1):
13
+ """
14
+ Generate n points on a hexagonal grid inside [margin, 1-margin]^2.
15
+
16
+ Args:
17
+ n: number of points to generate
18
+ margin: distance from edges to keep clear
19
+
20
+ Returns:
21
+ np.array of shape (n, 2) with (x, y) coordinates
22
+ """
23
+ usable = 1.0 - 2 * margin
24
+ cols = int(np.ceil(np.sqrt(n * 2 / np.sqrt(3))))
25
+ rows = int(np.ceil(n / cols))
26
+
27
+ dx = usable / max(cols - 1, 1)
28
+ dy = usable / max(rows - 1, 1)
29
+
30
+ points = []
31
+ for row in range(rows):
32
+ for col in range(cols):
33
+ if len(points) >= n:
34
+ break
35
+ x = margin + col * dx
36
+ if row % 2 == 1:
37
+ x += dx / 2 # offset for hex pattern
38
+ y = margin + row * dy
39
+ x = np.clip(x, margin, 1 - margin)
40
+ y = np.clip(y, margin, 1 - margin)
41
+ points.append([x, y])
42
+
43
+ return np.array(points[:n])
benchmarks/math/circle_packing/codebase/reference/optimization_patterns.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Common patterns for constrained geometric optimization using scipy.
3
+
4
+ This module shows how to use scipy.optimize.minimize with inequality
5
+ constraints and the SLSQP solver — useful for any problem where you
6
+ need to maximize/minimize an objective subject to geometric constraints.
7
+ """
8
+
9
+ import numpy as np
10
+ from scipy.optimize import minimize
11
+
12
+
13
+ def example_constrained_optimization():
14
+ """
15
+ Template: pack n objects by optimizing positions + sizes jointly.
16
+
17
+ Decision vector: x = [pos_0, pos_1, ..., pos_{n-1}, size_0, ..., size_{n-1}]
18
+ Objective: maximize sum(sizes) => minimize -sum(sizes)
19
+ Constraints: non-overlap + boundary containment (all >= 0)
20
+ """
21
+ n = 10 # number of objects
22
+
23
+ # --- Objective: negative sum of sizes (we minimize, so negate to maximize) ---
24
+ def objective(x):
25
+ sizes = x[2 * n:]
26
+ return -np.sum(sizes)
27
+
28
+ # --- Constraints as a single function returning array of values >= 0 ---
29
+ def constraints_fn(x):
30
+ positions = x[:2 * n].reshape(n, 2)
31
+ sizes = x[2 * n:]
32
+
33
+ c = []
34
+ # Pairwise non-overlap: dist(i,j) - size_i - size_j >= 0
35
+ for i in range(n):
36
+ for j in range(i + 1, n):
37
+ dist = np.linalg.norm(positions[i] - positions[j])
38
+ c.append(dist - sizes[i] - sizes[j])
39
+
40
+ # Boundary: each object stays inside [0, 1] x [0, 1]
41
+ for i in range(n):
42
+ c.append(positions[i, 0] - sizes[i]) # left
43
+ c.append(1 - positions[i, 0] - sizes[i]) # right
44
+ c.append(positions[i, 1] - sizes[i]) # bottom
45
+ c.append(1 - positions[i, 1] - sizes[i]) # top
46
+
47
+ return np.array(c)
48
+
49
+ # --- Initial guess ---
50
+ x0_pos = np.random.rand(n, 2) * 0.6 + 0.2 # avoid edges
51
+ x0_sizes = np.full(n, 0.05)
52
+ x0 = np.concatenate([x0_pos.flatten(), x0_sizes])
53
+
54
+ # --- Bounds ---
55
+ pos_bounds = [(0, 1)] * (2 * n)
56
+ size_bounds = [(0.01, 0.25)] * n
57
+ bounds = pos_bounds + size_bounds
58
+
59
+ # --- Solve ---
60
+ result = minimize(
61
+ objective,
62
+ x0,
63
+ method="SLSQP",
64
+ bounds=bounds,
65
+ constraints={"type": "ineq", "fun": constraints_fn},
66
+ options={"maxiter": 1000, "ftol": 1e-9},
67
+ )
68
+
69
+ opt_positions = result.x[:2 * n].reshape(n, 2)
70
+ opt_sizes = result.x[2 * n:]
71
+ return opt_positions, opt_sizes, -result.fun # return positive sum
72
+
73
+
74
+ def multi_start_optimization(objective, constraint_fn, bounds, n_starts=5):
75
+ """
76
+ Run SLSQP from multiple random starts and keep the best.
77
+
78
+ This helps escape local optima — the solver is gradient-based
79
+ and sensitive to the initial guess.
80
+ """
81
+ best_result = None
82
+ for _ in range(n_starts):
83
+ x0 = np.array([np.random.uniform(lo, hi) for lo, hi in bounds])
84
+ result = minimize(
85
+ objective,
86
+ x0,
87
+ method="SLSQP",
88
+ bounds=bounds,
89
+ constraints={"type": "ineq", "fun": constraint_fn},
90
+ options={"maxiter": 500, "ftol": 1e-8},
91
+ )
92
+ if best_result is None or result.fun < best_result.fun:
93
+ best_result = result
94
+ return best_result
benchmarks/math/circle_packing/codebase/reference/packing_strategies.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Circle Packing Strategies for n=26 in a Unit Square
2
+
3
+ ## Key Insight
4
+ Naive geometric placement (rings, grids) gives sum_radii ~ 1.0.
5
+ Using numerical optimization (scipy.optimize) with proper constraint formulation
6
+ can push sum_radii above 2.5.
7
+
8
+ ## Why Optimization Works Better Than Manual Placement
9
+
10
+ Manual placement fixes circle positions, then computes maximum radii.
11
+ This leaves gaps because positions aren't optimized for the radii they produce.
12
+
13
+ **Joint optimization** treats both positions (x,y for each circle) AND radii
14
+ as decision variables, optimizing them simultaneously. This is the key insight.
15
+
16
+ Decision vector: [x0, y0, x1, y1, ..., x25, y25, r0, r1, ..., r25]
17
+ Total variables: 26*2 + 26 = 78
18
+
19
+ ## Constraint Formulation
20
+
21
+ 1. **Non-overlap**: For every pair (i,j): distance(center_i, center_j) >= r_i + r_j
22
+ 2. **Boundary**: For every circle i: x_i - r_i >= 0, x_i + r_i <= 1, y_i - r_i >= 0, y_i + r_i <= 1
23
+ 3. **Positive radii**: r_i > 0 for all i (use bounds, not constraints)
24
+
25
+ ## Recommended Solver
26
+
27
+ scipy.optimize.minimize with method="SLSQP":
28
+ - Handles inequality constraints natively
29
+ - Works with bounds on variables
30
+ - Good for smooth, continuous problems like circle packing
31
+ - Sensitive to initial guess — use multiple starts or a good heuristic
32
+
33
+ ## Initial Guess Strategy
34
+
35
+ A hexagonal grid initial guess works well:
36
+ - Place circles on offset rows (hex pattern)
37
+ - Start with equal small radii (e.g., 0.05)
38
+ - Let the optimizer adjust both positions and radii
39
+
40
+ ## Performance Tips
41
+
42
+ - Set maxiter=1000 or higher for 26 circles
43
+ - Use ftol=1e-8 or smaller for precise solutions
44
+ - Radii bounds: (0.01, 0.2) is a reasonable range for n=26
45
+ - The objective is -sum(radii) (minimize negative to maximize)
benchmarks/math/circle_packing/config.yaml ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Math benchmark: circle_packing
2
+ # Usage: skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy>
3
+ language: python
4
+ diff_based_generation: true
5
+ max_iterations: 100
6
+ checkpoint_interval: 10
7
+ max_solution_length: 60000
8
+ llm:
9
+ api_base: https://api.openai.com/v1
10
+ models:
11
+ - name: "gpt-5"
12
+ weight: 1.0
13
+ max_tokens: 16384
14
+ timeout: 600
15
+ prompt:
16
+ system_message: 'You are an expert mathematician specializing in circle packing problems and computational geometry. Your
17
+ task is to improve a constructor function that directly produces a specific arrangement of 26 circles in a unit square,
18
+ maximizing the sum of their radii. The AlphaEvolve paper achieved a sum of 2.635 for n=26.
19
+
20
+
21
+ Key geometric insights:
22
+
23
+ - Circle packings often follow hexagonal patterns in the densest regions
24
+
25
+ - Maximum density for infinite circle packing is pi/(2*sqrt(3)) ≈ 0.9069
26
+
27
+ - Edge effects make square container packing harder than infinite packing
28
+
29
+ - Circles can be placed in layers or shells when confined to a square
30
+
31
+ - Similar radius circles often form regular patterns, while varied radii allow better space utilization
32
+
33
+ - Perfect symmetry may not yield the optimal packing due to edge effects
34
+
35
+
36
+ Focus on designing an explicit constructor that places each circle in a specific position, rather than an iterative search
37
+ algorithm.
38
+
39
+ '
40
+ evaluator:
41
+ timeout: 360
42
+ cascade_evaluation: true
43
+ cascade_thresholds:
44
+ - 0.3
45
+ - 0.6
46
+
47
+ # Live monitor dashboard
48
+ monitor:
49
+ enabled: true
50
+ port: 8765
51
+ host: "127.0.0.1"
52
+
53
+ # Human feedback
54
+ human_feedback_enabled: true
benchmarks/math/circle_packing/evaluator/Dockerfile ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+ WORKDIR /benchmark
3
+
4
+ COPY requirements.txt .
5
+ RUN pip install --no-cache-dir -r requirements.txt
6
+
7
+ COPY evaluator.py .
8
+ COPY evaluate.sh .
9
+ RUN chmod +x evaluate.sh
10
+
11
+ ENTRYPOINT ["./evaluate.sh"]