Skip to content

00·Inference

Built to think in real time.

Cinder is a serverless runtime for any model that fits on a GPU. Deploy with a single primitive. Scale down to zero between requests. Pay only when the model thinks.

Runtime
Python 3.10 — 3.13
Backends
vLLM · SGLang · Triton
Regions
us-east-1 · us-west-2 · eu-central-1 · ap-northeast-1
P50 latency
28ms
Cold start
0.4s
Throughput
12k rps

Measured on Llama-3.1 8B across 28 days of production traffic. Hardware: H100 SXM5, 80 GB. Cinder telemetry, January 2026.

01·Runtime

One primitive that scales down to zero.

A Cinder function is a Python file with a decorator and a model definition. The runtime handles bin-packing, autoscaling, and cold starts. Nothing else.

a · snapshot

Snapshot cold starts

Model weights and Python state restored from a CUDA memory snapshot. Cold start at four-tenths of a second, regardless of model size up to 70 B.

b · batching

Continuous batching

Requests join an in-flight batch on arrival. P50 latency stays flat under load, then degrades gracefully past the GPU saturation point.

c · billing

Bin-packed billing

You pay for the milliseconds a GPU is reserved for your function. Idle replicas are reclaimed inside three seconds. No minimum spend.

02·Code

Three files. One deploy.

A quickstart that runs against a real model in production. Copy it, paste it, run cinder deploy.

~/reactor/main.py

build d3f1c8

01 import cinder
02 from transformers import AutoTokenizer, AutoModelForCausalLM
03
04 app = cinder.App("reactor")
05 gpu = cinder.GPU("H100", memory=80)
06
07 @app.cls(gpu=gpu, scaledown=3)
08 class Reactor:
09     @cinder.enter()
10     def load(self):
11         self.tok = AutoTokenizer.from_pretrained("meta/Llama-3.1-8B")
12         self.model = AutoModelForCausalLM.from_pretrained("meta/Llama-3.1-8B")
13
14     @cinder.method()
15     def think(self, prompt: str) -> str:
16         ids = self.tok.encode(prompt, return_tensors="pt")
17         out = self.model.generate(ids, max_new_tokens=512)
18         return self.tok.decode(out[0])

03·Pricing

Per-second, per-GPU. Stop at any time.

Three plans. No reserved-capacity contracts. The free tier runs on shared A10G; paid tiers run on dedicated H100. Bring your own VPC on Enterprise.

Free

$0/mo

$0.00012 / sec on shared A10G after 30 GPU-hours.

  • 30 GPU-hours / month
  • A10G shared pool
  • Community support
Start free

Reactor

$0.0004/sec

Dedicated H100. Bin-packed across your fleet.

  • H100 SXM5, 80 GB
  • Sub-second cold starts
  • Snapshot resume
  • Email support, four-hour response
Start a project

Enterprise

Talk · custom

Reserved capacity, BYO VPC, SLA on P99 latency.

  • Reserved H100 / H200 clusters
  • BYO VPC, BYO networking
  • Custom SLA on P99
  • Solutions engineer assigned
Talk to us

04·Changelog

Shipped this quarter.

  1. Snapshot resume. Cold starts on Llama-3.1 8B down to 380 ms — a 64 % drop vs. cold-load. Available on all paid tiers from today.

  2. Continuous batching for vLLM. Throughput on H100 sustains 12 k rps for Llama-3.1 8B under saturating load.

  3. H200 in preview. 141 GB memory per GPU. Enterprise preview opens via solutions engineering.

  4. Region: Frankfurt. Cinder is now in eu-central-1 alongside us-east-1, us-west-2, and ap-northeast-1.