[STF] Combine stackable_ctx and python support#8165
Draft
caugonnet wants to merge 1018 commits intoNVIDIA:mainfrom
Draft
[STF] Combine stackable_ctx and python support#8165caugonnet wants to merge 1018 commits intoNVIDIA:mainfrom
caugonnet wants to merge 1018 commits intoNVIDIA:mainfrom
Conversation
Use regular stf_task_* functions for set_symbol/start/get/end/destroy (only create and add_dep have stackable variants that take ctx). Remove spurious ctx argument from stf_stackable_logical_data_set_symbol and stf_stackable_logical_data_destroy. Made-with: Cursor
is_invocable_v<Fun, host_launch_deps&> is true for unconstrained generic lambdas (auto param), incorrectly routing them to the untyped path. Use a private canary type to detect generic lambdas: if Fun also accepts the canary, it is generic and should use the typed path. This allows combining typed deps with add_deps while still correctly dispatching lambdas that specifically take host_launch_deps&. Made-with: Cursor
nvcc eagerly instantiates generic-lambda bodies during is_invocable_v checks, causing hard errors when the lambda body uses members that don't exist on host_launch_deps (e.g. data_handle()). Use std::conjunction to short-circuit: is_invocable<Fun, host_launch_deps&> is only instantiated when sizeof...(Deps) == 0, i.e. when deps are added dynamically via add_deps() (the C/Python binding path). When typed deps are present, the typed path is always used. Made-with: Cursor
Contributor
Author
|
/ok to test 1948670 |
Made-with: Cursor
graph_ctx_node::finalize() was not updating the graph_ctx's cache_stats, so CUDASTF_DISPLAY_GRAPH_STATS always showed zeros. For non-nested graphs (top-level graph scopes): query nnodes/nedges and track cache hit/miss like graph_ctx::instantiate() does. For nested graphs (while/repeat conditional bodies): report the body subgraph's node/edge counts. These are not independently cached so instantiate/update counts remain at 0, which is correct. Made-with: Cursor
Use index_copy_ in a pytorch_task to store per-iteration snapshots into a pre-allocated buffer, instead of host_launch. This avoids host callback nodes which are not supported inside CUDA conditional graph bodies, and allows promoting the outer Python for loop to ctx.repeat() — making the entire solver fully graph-captured with 5 nesting levels. Made-with: Cursor
This comment has been minimized.
This comment has been minimized.
…alysis Add test_burger_stackable_fast.py: same physics and 5-level graph nesting as the PyTorch version, but with fused Numba @cuda.jit kernels and band storage for the tridiagonal Jacobian. Includes GPU-side iteration counters (Newton/CG) and bandwidth analysis showing 80% of peak at N=1M, and a 31x speedup from conditional CUDA graphs vs host-synced loops at N=100K. Also fix stackable graph DOT dumping to honour CUDASTF_DUMP_GRAPHS (same env var as graph_ctx) for both top-level and nested graphs, and add missing logical_data.cuh include in host_launch_scope.cuh. Made-with: Cursor
Use ctx.host_launch() instead of the task(exec_place.host()) + stream synchronize + numba_arguments pattern for host-side data reads in test_fhe.py and test_fhe_decorator.py. Made-with: Cursor
Contributor
Author
|
/ok to test e2d225d |
Remove unused per_step_ms variable in test_burger_stackable_fast.py, add a stream assertion in test_custom_exec_place duck-typed task test, and use _t for the intentionally-unused variable in the error-path test. Apply ruff-format line-length reformatting across touched files. Made-with: Cursor
Contributor
Author
|
/ok to test 1ba3bd8 |
This comment has been minimized.
This comment has been minimized.
caugonnet
commented
Mar 30, 2026
…ret_cast Align with PR 8174 (stf_host_launch_untyped): change stf_host_launch_handle and stf_host_launch_deps_handle from void* to opaque struct pointers for type safety. Switch all C API handle casts in stf.cu from static_cast to reinterpret_cast (host_launch, stackable, and task/logical_data handles). Update Cython bindings accordingly. Restore documentation comments in host_launch_scope.cuh and context.cuh. Made-with: Cursor
Contributor
Author
|
/ok to test a2271de |
caugonnet
commented
Mar 30, 2026
Complete the void* elimination in the C API: change stf_while_scope_handle and stf_repeat_scope_handle from void* to opaque struct pointers, matching the pattern used by all other STF handles. Update reinterpret_cast in stf.cu and Cython typedefs accordingly. Made-with: Cursor
Align with the restructuring in stf_c_api (PR 5315): STF Python source, tests, and CI scripts now live under python/cuda_cccl_experimental/. Apply stackable-specific additions (bindings, tests) on top of the stf_c_api baseline in the new location, and remove the stale python/cuda_cccl/cuda/stf/ tree. Made-with: Cursor
Contributor
Author
|
/ok to test b8b7a0b |
Contributor
😬 CI Workflow Results🟥 Finished in 5h 47m: Pass: 97%/448 | Total: 13d 15h | Max: 2h 37m | Hits: 82%/514825See results here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
closes
Checklist