[STF] Combine stackable_ctx and python support by caugonnet · Pull Request #8165 · NVIDIA/cccl

caugonnet · 2026-03-25T08:42:14Z

Description

closes

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Use regular stf_task_* functions for set_symbol/start/get/end/destroy (only create and add_dep have stackable variants that take ctx). Remove spurious ctx argument from stf_stackable_logical_data_set_symbol and stf_stackable_logical_data_destroy. Made-with: Cursor

is_invocable_v<Fun, host_launch_deps&> is true for unconstrained generic lambdas (auto param), incorrectly routing them to the untyped path. Use a private canary type to detect generic lambdas: if Fun also accepts the canary, it is generic and should use the typed path. This allows combining typed deps with add_deps while still correctly dispatching lambdas that specifically take host_launch_deps&. Made-with: Cursor

nvcc eagerly instantiates generic-lambda bodies during is_invocable_v checks, causing hard errors when the lambda body uses members that don't exist on host_launch_deps (e.g. data_handle()). Use std::conjunction to short-circuit: is_invocable<Fun, host_launch_deps&> is only instantiated when sizeof...(Deps) == 0, i.e. when deps are added dynamically via add_deps() (the C/Python binding path). When typed deps are present, the typed path is always used. Made-with: Cursor

caugonnet · 2026-03-25T12:26:58Z

/ok to test 1948670

Made-with: Cursor

graph_ctx_node::finalize() was not updating the graph_ctx's cache_stats, so CUDASTF_DISPLAY_GRAPH_STATS always showed zeros. For non-nested graphs (top-level graph scopes): query nnodes/nedges and track cache hit/miss like graph_ctx::instantiate() does. For nested graphs (while/repeat conditional bodies): report the body subgraph's node/edge counts. These are not independently cached so instantiate/update counts remain at 0, which is correct. Made-with: Cursor

Use index_copy_ in a pytorch_task to store per-iteration snapshots into a pre-allocated buffer, instead of host_launch. This avoids host callback nodes which are not supported inside CUDA conditional graph bodies, and allows promoting the outer Python for loop to ctx.repeat() — making the entire solver fully graph-captured with 5 nesting levels. Made-with: Cursor

…alysis Add test_burger_stackable_fast.py: same physics and 5-level graph nesting as the PyTorch version, but with fused Numba @cuda.jit kernels and band storage for the tridiagonal Jacobian. Includes GPU-side iteration counters (Newton/CG) and bandwidth analysis showing 80% of peak at N=1M, and a 31x speedup from conditional CUDA graphs vs host-synced loops at N=100K. Also fix stackable graph DOT dumping to honour CUDASTF_DUMP_GRAPHS (same env var as graph_ctx) for both top-level and nested graphs, and add missing logical_data.cuh include in host_launch_scope.cuh. Made-with: Cursor

Use ctx.host_launch() instead of the task(exec_place.host()) + stream synchronize + numba_arguments pattern for host-side data reads in test_fhe.py and test_fhe_decorator.py. Made-with: Cursor

caugonnet · 2026-03-25T17:00:19Z

/ok to test e2d225d

Remove unused per_step_ms variable in test_burger_stackable_fast.py, add a stream assertion in test_custom_exec_place duck-typed task test, and use _t for the intentionally-unused variable in the error-path test. Apply ruff-format line-length reformatting across touched files. Made-with: Cursor

caugonnet · 2026-03-25T18:17:04Z

/ok to test 1ba3bd8

…ret_cast Align with PR 8174 (stf_host_launch_untyped): change stf_host_launch_handle and stf_host_launch_deps_handle from void* to opaque struct pointers for type safety. Switch all C API handle casts in stf.cu from static_cast to reinterpret_cast (host_launch, stackable, and task/logical_data handles). Update Cython bindings accordingly. Restore documentation comments in host_launch_scope.cuh and context.cuh. Made-with: Cursor

caugonnet · 2026-03-30T09:31:24Z

/ok to test a2271de

Complete the void* elimination in the C API: change stf_while_scope_handle and stf_repeat_scope_handle from void* to opaque struct pointers, matching the pattern used by all other STF handles. Update reinterpret_cast in stf.cu and Cython typedefs accordingly. Made-with: Cursor

Align with the restructuring in stf_c_api (PR 5315): STF Python source, tests, and CI scripts now live under python/cuda_cccl_experimental/. Apply stackable-specific additions (bindings, tests) on top of the stf_c_api baseline in the new location, and remove the stale python/cuda_cccl/cuda/stf/ tree. Made-with: Cursor

caugonnet · 2026-03-30T10:30:40Z

/ok to test b8b7a0b

github-actions · 2026-03-30T16:20:57Z

😬 CI Workflow Results

🟥 Finished in 5h 47m: Pass: 97%/448 | Total: 13d 15h | Max: 2h 37m | Hits: 82%/514825

See results here.

caugonnet and others added 30 commits October 2, 2025 17:06

Pagerank batched example

15acd5d

More general burger examples

e3f3490

CUDASTF_DEBUG_STACKABLE_DOT only prints the instantiable graph

1da2239

Improve examples

2914044

instrument examples

8107630

define ctx::logical_data_t<T>

1a00d05

Merge branch 'main' into stackable_ctx_data

1335a2b

Merge branch 'main' into stf_c_api

7094dd5

move CUDA 12.4+ code into ifdef guards

2a20564

clang-format

55ab19d

take into account cuda13 changes in graph API

e316290

Merge branch 'main' into stackable_ctx_data

1260861

Adopt to new python hierarchy

76d78b4

Merge branch 'main' into stf_c_api

e03b062

fix errors in a previous merge

0c11b6a

cuda.cccl.experimental.stf => cuda.stf

f6c50e1

Misc stf python tests improvements

efea184

Save WIP on this warp example

c0d3592

Add sanity checks to test the is_void_interface() API

eba61eb

support tokens in python

e17c261

remove debug print

ec9c955

get_ready_dependencies is dead code

70fc9d6

Merge branch 'main' into stackable_ctx_data

1b38ec6

python cholesky with cupy

52f4823

improve cholesky example

5a32881

POTRI and Cholesky

abd5778

clang-format

80e1085

Merge branch 'main' into stf_c_api

865cf7b

how changes to numba-cuda have been merged

4c1551a

Merge branch 'main' into stf_c_api

77d6af1

caugonnet added 4 commits March 25, 2026 13:17

clang-format

1948670

caugonnet added 3 commits March 25, 2026 13:28

Remove obsolete test_burger_structure.py debug file

f91a373

Made-with: Cursor

This comment has been minimized.

Sign in to view

caugonnet and others added 3 commits March 25, 2026 16:11

Replace host task workaround with host_launch in FHE examples

1708f3e

Use ctx.host_launch() instead of the task(exec_place.host()) + stream synchronize + numba_arguments pattern for host-side data reads in test_fhe.py and test_fhe_decorator.py. Made-with: Cursor

Merge branch 'main' into stf_python_stackable_v2

e2d225d

caugonnet and others added 2 commits March 25, 2026 18:48

Merge branch 'main' into stf_python_stackable_v2

1ba3bd8

This comment has been minimized.

Sign in to view

caugonnet added 2 commits March 25, 2026 21:52

Merge branch 'main' into stf_python_stackable_v2

c8f6729

Merge branch 'main' into stf_python_stackable_v2

eb2f58c

caugonnet commented Mar 30, 2026

View reviewed changes

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated

caugonnet and others added 3 commits March 30, 2026 10:55

Merge branch 'main' into stf_python_stackable_v2

bf3307b

clang-format

a2271de

caugonnet commented Mar 30, 2026

View reviewed changes

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated

caugonnet and others added 3 commits March 30, 2026 12:10

Merge branch 'main' into stf_python_stackable_v2

b8b7a0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STF] Combine stackable_ctx and python support#8165

[STF] Combine stackable_ctx and python support#8165
caugonnet wants to merge 1018 commits intoNVIDIA:mainfrom
caugonnet:stf_python_stackable_v2

caugonnet commented Mar 25, 2026

Uh oh!

caugonnet commented Mar 25, 2026

Uh oh!

This comment has been minimized.

caugonnet commented Mar 25, 2026

Uh oh!

caugonnet commented Mar 25, 2026

Uh oh!

This comment has been minimized.

Uh oh!

caugonnet commented Mar 30, 2026

Uh oh!

Uh oh!

caugonnet commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caugonnet commented Mar 25, 2026

Description

Checklist

Uh oh!

caugonnet commented Mar 25, 2026

Uh oh!

This comment has been minimized.

caugonnet commented Mar 25, 2026

Uh oh!

caugonnet commented Mar 25, 2026

Uh oh!

This comment has been minimized.

Uh oh!

caugonnet commented Mar 30, 2026

Uh oh!

Uh oh!

caugonnet commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

😬 CI Workflow Results

🟥 Finished in 5h 47m: Pass: 97%/448 | Total: 13d 15h | Max: 2h 37m | Hits: 82%/514825

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants