WIP

melissawm · melissawm · commit 62ba7ad29e46 · 2025-07-22T17:33:49.000-03:00
diff --git a/docs/source/learn/xla-advanced.md b/docs/source/learn/xla-advanced.md
@@ -1,6 +1,8 @@
-# Advanced PyTorch XLA
+# Advanced Topics in PyTorch XLA
 
-- HLO is fed to the XLA compiler
+## Compilation, caching and execution
+
+HLO is fed to the XLA compiler
 for compilation and optimization. Compilation is then cached by PyTorch
 XLA to be reused later if/when needed. The compilation of the graph is
 done on the host (CPU), which is the machine that runs the Python code.
@@ -14,6 +16,8 @@ host does the compilation for XLA devices it is attached to. If SPMD is
 used, then the code is compiled only once (for given shapes and
 computations) on each host for all the devices.
 
+## Synchronous execution and blocking
+
 The *synchronous* operations in Pytorch XLA, like printing, logging,
 checkpointing or callbacks block tracing and result in slower execution.
 In the case when an operation requires a specific value of an XLA
diff --git a/docs/source/learn/xla-examples.md b/docs/source/learn/xla-examples.md
@@ -256,3 +256,8 @@ gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
   --worker=all \
   --command="python3 your_script.py"
 ```
+
+## Reference implementations
+
+The [AI-Hypercomputer/tpu-recipes](https://github.com/AI-Hypercomputer/tpu-recipes)
+repo. contains examples for training and serving many LLM and diffusion models.
diff --git a/docs/source/learn/xla-overview.md b/docs/source/learn/xla-overview.md
@@ -9,7 +9,7 @@ models with minimal code changes from their existing PyTorch workflows.
 
 At its core, PyTorch/XLA acts as a bridge between the familiar PyTorch Python
 frontend and the XLA compiler. When you run PyTorch operations on XLA
-devices using this library, the following steps are performed:
+devices using this library, you get the following key features:
 
 1. **Lazy Evaluation**: Operations are not executed immediately. Instead,
    PyTorch/XLA records these operations in an intermediate representation (IR)
@@ -34,7 +34,7 @@ devices using this library, the following steps are performed:
 
 ![img](../_static/img/pytorchXLA_flow.svg)
 
-This process allows PyTorch/XLA to unlock significant performance benefits,
+This process allows PyTorch/XLA to provide significant performance benefits,
 especially for large models and distributed training scenarios. For a deeper
 dive into the lazy tensor system, see our
 [LazyTensor guide](https://pytorch.org/blog/understanding-lazytensor-system-performance-with-pytorch-xla-on-cloud-tpu/).
@@ -44,7 +44,7 @@ dive into the lazy tensor system, see our
 * **High Performance on TPUs**: PyTorch/XLA is optimized to deliver exceptional performance for training and inference on Google Cloud TPUs, which are custom-designed AI accelerators.
 * **Scalability**: Seamlessly scale your models from a single device to large TPU Pods with minimal code changes, enabling you to tackle more ambitious projects.
 * **Familiar PyTorch Experience**: Continue using the PyTorch APIs and ecosystem you know and love. PyTorch/XLA aims to make the transition to XLA devices as smooth as possible, often requiring only minor modifications to existing PyTorch code.
-* **Cost-Efficiency**: TPUs offer a compelling price/performance ratio for many AI workloads. PyTorch/XLA helps you harness this efficiency.
+* **Cost-Efficiency**: TPUs offer a compelling price/performance ratio for many AI workloads. PyTorch/XLA helps you take advantage of this efficiency.
 * **Versatility**: Accelerate a wide range of AI workloads, including chatbots, code generation, media content generation, vision services, and recommendation engines.
 * **Support for Leading Frameworks**: While focused on PyTorch, XLA itself is a compiler backend used by other major frameworks like JAX and TensorFlow.
 
@@ -55,10 +55,49 @@ While PyTorch/XLA can theoretically run on any XLA-compatible backend, its prima
 * **Google Cloud TPUs**: Including various generations like TPU v5 and v6. [Learn more about TPUs](https://cloud.google.com/tpu/docs/intro-to-tpu).
 * **GPUs via XLA**: PyTorch/XLA also supports running on NVIDIA GPUs through the OpenXLA PJRT plugin, providing an alternative execution path. [Learn more about GPUs on Google Cloud](https://cloud.google.com/compute/docs/gpus).
 
+## TPU Setup
+
+Create a TPU with the base image to use nightly wheels or from the stable
+release by specifying the `RUNTIME_VERSION`.
+
+``` bash
+export ZONE=us-central2-b
+export PROJECT_ID=your-project-id
+export ACCELERATOR_TYPE=v4-8 # v4-16, v4-32, …
+export RUNTIME_VERSION=tpu-vm-v4-pt-2.0 # or tpu-vm-v4-base
+export TPU_NAME=your_tpu_name
+
+gcloud compute tpus tpu-vm create ${TPU_NAME} \
+--zone=${ZONE} \
+--accelerator-type=${ACCELERATOR_TYPE} \
+--version=${RUNTIME_VERSION} \
+--subnetwork=tpusubnet
+```
+
+If you have a single host VM (e.g. v4-8), you can ssh to your vm and run
+the following commands from the vm directly. Otherwise, in case of TPU
+pods, you can use `--worker=all --command=""` similar to
+
+``` bash
+gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
+--zone=us-central2-b \
+--worker=all \
+--command="pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-nightly-cp38-cp38-linux_x86_64.whl"
+```
+
+Next, if you are using base image, install nightly packages and required
+libraries
+
+``` bash
+pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-nightly-cp38-cp38-linux_x86_64.whl
+  pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp38-cp38-linux_x86_64.whl
+sudo apt-get install libopenblas-dev -y
+
+sudo apt-get update && sudo apt-get install libgl1 -y # diffusion specific
+```
+
 ## Next Steps
 
-- [Quickstart Guide](./xla-quickstart.md): Get started with PyTorch/XLA on Google Cloud TPUs.
 - [Examples](./xla-examples.md): Explore example code for training and inference on TPUs.
 - [Profiling and Performance](./xla-profiling.md): Learn how to profile and optimize your PyTorch/XLA applications.
 - [Advanced Topics](./xla-advanced.md): Dive deeper into advanced concepts like graph optimization, data loading, and distributed training with PyTorch/XLA.
-
diff --git a/docs/source/learn/xla-quickstart.md b/docs/source/learn/xla-quickstart.md