Flame: A Distributed Engine for AI

Flame is a distributed engine for AI, providing a comprehensive suite of mechanisms commonly required by various classes of AI workloads, including agents, reinforcement learning, and more. Built upon over a decade and a half of experience running diverse high-performance workloads at scale across multiple systems and platforms, Flame incorporates best-of-breed ideas and practices from the open source community.

Motivation

As AI workloads become increasingly adopted for innovation, a common runtime is essential to accelerate these elastic workloads through the following key aspects:

Scale: Unlike applications running on a single node, Flame scales workloads across multiple nodes to maximize performance acceleration while ensuring fair resource sharing across multiple tenants and sessions.
Performance: Elastic workloads typically involve tens of thousands of short tasks. Flame leverages cutting-edge features to improve roundtrip times and throughput in large-scale environments, while intelligently sharing runtime within sessions to minimize startup time.
Security: Flame utilizes microVM as a runtime for enhanced security, with each runtime environment (executor) dedicated to a single session to prevent data leakage. All Flame components communicate using mTLS for secure inter-component communication.
Flexibility: Flame defines a comprehensive set of general APIs to support multiple user scenarios. Additionally, Flame supports applications across multiple programming languages through gRPC, including Rust, Go, and Python.

Performance

Flame is designed for high-throughput task execution. Here's a benchmark running 30,000 tasks on a single-node deployment:

root@06383dd94875:/# flmping -p -t 30000
Session <flmping-1N1sIX> was created in <1 ms>, start to run <30,000> tasks in the session:

============================================================
BENCHMARK RESULTS
============================================================
Duration:        3.29s
Succeeded:       30000/30000
Failed:          0
Throughput:      9124.09 tasks/sec
============================================================

root@06383dd94875:/# flmctl list -s
 ID              State   App      Slots  Priority  Pending  Running  Succeed  Failed  Created
 flmping-1N1sIX  Closed  flmping  1      0         0        0        30000    0       18:11:26

Architecture Overview

Core Concepts

Session: A Session represents a group of related tasks. The Session Scheduler allocates resources to each session based on scheduling configurations by requesting the resource manager (e.g., Kubernetes) to launch executors. Clients can continuously create tasks until the session is closed.

Task: A task within a Session contains the main algorithm defined by the task's metadata and input/output information (e.g., volume paths).

Executor: The Executor manages the lifecycle of Applications/Services, which contain the user's code for executing tasks. Applications are typically not reused between sessions, though images may be reused to avoid repeated downloads.

Shim: The protocol implementation used by the Executor to manage applications, supporting various protocols such as gRPC, RESTful APIs, stdio, and more.

How It Works

Flame accepts connections from user clients and creates Sessions for jobs. Clients can continuously submit tasks to a session until it's closed, with no predefined replica requirements.

The Session Scheduler allocates resources to each session based on scheduling configurations by requesting the resource manager (e.g., Kubernetes) to launch executors.

Executors connect back to Flame via gRPC to pull tasks from their related Session and reuse the executor. Executors are released/deleted when no more tasks remain in the related session.

Services receive notifications when they're bound or unbound to related sessions, allowing them to take appropriate actions (e.g., connecting to databases). Services can then pull tasks from the Session and reuse data to accelerate execution.

Future enhancements to the Session Scheduler will include features to improve performance and usage, such as proportional allocation, delayed release, and min/max constraints.

Quick Start Guide

Option 1: Docker Compose (Recommended for First-Time Users)

This guide uses Docker Compose to start a local Flame cluster. After installing docker compose, you can start a local Flame cluster with the following steps:

$ docker compose up -d

After the Flame cluster is launched, use the following steps to log into the flame-console pod, which serves as a debug tool for both developers and SREs:

$ docker compose exec flame-console /bin/bash

Option 2: Local Installation with flmadm (Faster for Development)

For development and testing, you can install Flame directly on your machine using flmadm (requires Rust and uv):

# Build and install flmadm
$ cargo build --release -p flmadm
$ sudo install -m 755 target/release/flmadm /usr/local/bin/

# Install all components from local source and start services
$ sudo flmadm install --all --src-dir . --enable

# Add Flame binaries to PATH
$ source /usr/local/flame/sbin/flmenv.sh

For more details, see the flmadm README.

Verify the Installation

After starting Flame (via either option), verify the installation with flmping:

$ flmping
Session <flmping-Sf4R2o> was created in <1 ms>, start to run <10> tasks in the session:

 Session         Task  State    Output
 flmping-Sf4R2o  1     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  2     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  3     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  4     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  5     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  6     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  7     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  8     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  9     Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory
 flmping-Sf4R2o  10    Succeed  Completed on <396003ae48dd> in <0> milliseconds with <0> memory


<10> tasks was completed in <153 ms>.

You can check session status using flmctl. Explore more examples here:

$ flmctl list -s
 ID              State   App      Slots  Priority  Pending  Running  Succeed  Failed  Created
 flmping-Sf4R2o  Closed  flmping  1      0         0        0        10       0       13:33:30

CLI Tools

Flame provides two separate command-line tools:

flmctl: User-facing CLI for submitting jobs, managing sessions, and querying cluster state
flmadm: Administrator CLI for installing, configuring, and managing Flame clusters

Installing Flame with flmadm

For multi-node bare-metal or VM deployments, use flmadm to install Flame components on each node:

# On control plane node
sudo flmadm install --control-plane --enable

# On worker nodes
sudo flmadm install --worker --enable

# On cache nodes (optional, can be co-located with workers)
sudo flmadm install --cache --enable

# Or deploy all components on a single node
sudo flmadm install --all --enable

For more details, see the flmadm README.

Documentation

API Reference

Frontend API: frontend.proto
Shim API: shim.proto

Contributing

We welcome contributions through GitHub issues and pull requests.

License

This project is licensed under the terms specified in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 1,100 Commits
.github		.github
ci		ci
common		common
cri		cri
docker		docker
docs		docs
e2e		e2e
examples		examples
executor_manager		executor_manager
flmadm		flmadm
flmctl		flmctl
flmexec		flmexec
flmping		flmping
hack		hack
logs		logs
object_cache		object_cache
perf		perf
rpc		rpc
sdk		sdk
session_manager		session_manager
stdng		stdng
.cursorignore		.cursorignore
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG_FLMADM.md		CHANGELOG_FLMADM.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
codecov.yml		codecov.yml
compose.yaml		compose.yaml
migrations		migrations
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flame: A Distributed Engine for AI

Motivation

Performance

Architecture Overview

Core Concepts

How It Works

Quick Start Guide

Option 1: Docker Compose (Recommended for First-Time Users)

Option 2: Local Installation with flmadm (Faster for Development)

Verify the Installation

CLI Tools

Installing Flame with flmadm

Documentation

API Reference

Contributing

License

Star History

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flame: A Distributed Engine for AI

Motivation

Performance

Architecture Overview

Core Concepts

How It Works

Quick Start Guide

Option 1: Docker Compose (Recommended for First-Time Users)

Option 2: Local Installation with flmadm (Faster for Development)

Verify the Installation

CLI Tools

Installing Flame with flmadm

Documentation

API Reference

Contributing

License

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages