Skip to content

[STF] Combine stackable_ctx and python support#8165

Draft
caugonnet wants to merge 1018 commits intoNVIDIA:mainfrom
caugonnet:stf_python_stackable_v2
Draft

[STF] Combine stackable_ctx and python support#8165
caugonnet wants to merge 1018 commits intoNVIDIA:mainfrom
caugonnet:stf_python_stackable_v2

Conversation

@caugonnet
Copy link
Copy Markdown
Contributor

Description

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Use regular stf_task_* functions for set_symbol/start/get/end/destroy
(only create and add_dep have stackable variants that take ctx).
Remove spurious ctx argument from stf_stackable_logical_data_set_symbol
and stf_stackable_logical_data_destroy.

Made-with: Cursor
is_invocable_v<Fun, host_launch_deps&> is true for unconstrained
generic lambdas (auto param), incorrectly routing them to the untyped
path. Use a private canary type to detect generic lambdas: if Fun also
accepts the canary, it is generic and should use the typed path.

This allows combining typed deps with add_deps while still correctly
dispatching lambdas that specifically take host_launch_deps&.

Made-with: Cursor
nvcc eagerly instantiates generic-lambda bodies during is_invocable_v
checks, causing hard errors when the lambda body uses members that
don't exist on host_launch_deps (e.g. data_handle()).

Use std::conjunction to short-circuit: is_invocable<Fun, host_launch_deps&>
is only instantiated when sizeof...(Deps) == 0, i.e. when deps are
added dynamically via add_deps() (the C/Python binding path).
When typed deps are present, the typed path is always used.

Made-with: Cursor
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 1948670

graph_ctx_node::finalize() was not updating the graph_ctx's cache_stats,
so CUDASTF_DISPLAY_GRAPH_STATS always showed zeros.

For non-nested graphs (top-level graph scopes): query nnodes/nedges and
track cache hit/miss like graph_ctx::instantiate() does.

For nested graphs (while/repeat conditional bodies): report the body
subgraph's node/edge counts. These are not independently cached so
instantiate/update counts remain at 0, which is correct.

Made-with: Cursor
Use index_copy_ in a pytorch_task to store per-iteration snapshots into
a pre-allocated buffer, instead of host_launch.  This avoids host
callback nodes which are not supported inside CUDA conditional graph
bodies, and allows promoting the outer Python for loop to ctx.repeat()
— making the entire solver fully graph-captured with 5 nesting levels.

Made-with: Cursor
@github-actions

This comment has been minimized.

caugonnet and others added 3 commits March 25, 2026 16:11
…alysis

Add test_burger_stackable_fast.py: same physics and 5-level graph nesting
as the PyTorch version, but with fused Numba @cuda.jit kernels and band
storage for the tridiagonal Jacobian. Includes GPU-side iteration counters
(Newton/CG) and bandwidth analysis showing 80% of peak at N=1M, and a
31x speedup from conditional CUDA graphs vs host-synced loops at N=100K.

Also fix stackable graph DOT dumping to honour CUDASTF_DUMP_GRAPHS
(same env var as graph_ctx) for both top-level and nested graphs, and
add missing logical_data.cuh include in host_launch_scope.cuh.

Made-with: Cursor
Use ctx.host_launch() instead of the task(exec_place.host()) +
stream synchronize + numba_arguments pattern for host-side data
reads in test_fhe.py and test_fhe_decorator.py.

Made-with: Cursor
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test e2d225d

caugonnet and others added 2 commits March 25, 2026 18:48
Remove unused per_step_ms variable in test_burger_stackable_fast.py,
add a stream assertion in test_custom_exec_place duck-typed task test,
and use _t for the intentionally-unused variable in the error-path test.
Apply ruff-format line-length reformatting across touched files.

Made-with: Cursor
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 1ba3bd8

@github-actions

This comment has been minimized.

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated
caugonnet and others added 3 commits March 30, 2026 10:55
…ret_cast

Align with PR 8174 (stf_host_launch_untyped): change stf_host_launch_handle
and stf_host_launch_deps_handle from void* to opaque struct pointers for
type safety. Switch all C API handle casts in stf.cu from static_cast to
reinterpret_cast (host_launch, stackable, and task/logical_data handles).
Update Cython bindings accordingly. Restore documentation comments in
host_launch_scope.cuh and context.cuh.

Made-with: Cursor
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test a2271de

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated
caugonnet and others added 3 commits March 30, 2026 12:10
Complete the void* elimination in the C API: change stf_while_scope_handle
and stf_repeat_scope_handle from void* to opaque struct pointers, matching
the pattern used by all other STF handles. Update reinterpret_cast in stf.cu
and Cython typedefs accordingly.

Made-with: Cursor
Align with the restructuring in stf_c_api (PR 5315): STF Python source,
tests, and CI scripts now live under python/cuda_cccl_experimental/.
Apply stackable-specific additions (bindings, tests) on top of the
stf_c_api baseline in the new location, and remove the stale
python/cuda_cccl/cuda/stf/ tree.

Made-with: Cursor
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test b8b7a0b

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 5h 47m: Pass: 97%/448 | Total: 13d 15h | Max: 2h 37m | Hits: 82%/514825

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants