Modal Labs
Serverless cloud compute for AI inference, training, and agent workloads
Modal Labs is a serverless compute platform built for Python developers running AI workloads. You write Python functions, decorate them with Modal decorators, and Modal handles packaging, deployment, scaling, and GPU provisioning. It's used for model inference endpoints, fine-tuning jobs, batch processing pipelines, and any compute-heavy AI task where managing cloud infrastructure would be the bottleneck.
Modal Labs was founded in 2021 by Erik Bernhardsson and a team with background in infrastructure and machine learning. The product launched publicly in 2022 with a clear thesis: Python developers building AI applications should be able to run cloud compute without learning cloud infrastructure.
The bet was that the emerging class of AI developers writing Python wasn't the same audience as the infrastructure engineers who built their careers on AWS and Kubernetes. Writing YAML for container orchestration, managing instance reservations, and wiring up auto-scaling groups are skills that take years to develop. Modal's approach was to make all of that irrelevant by letting you express compute requirements in Python code.
How Modal works
The core concept is simple. You write a Python function. You decorate it with @app.function() and specify what compute it needs. You run modal run from your terminal. Modal packages your function and its dependencies into a container, runs it in the cloud, streams the output back to your terminal, and tears down the infrastructure when it's done.
There's no YAML. No Dockerfile (you can also define container images in Python if you prefer). No SSH into instances. No managing clusters. The developer experience is meant to feel like running a local Python script that happens to have access to a lot of compute.
For AI workloads specifically, this matters a lot. Fine-tuning a model on 8 H100s for three hours is not something most developers want to set up manually in AWS. On Modal, it's a function call with gpu="H100" in the decorator.
Container images defined in Python
One of Modal's more distinctive design choices is how container environments are defined. Instead of a Dockerfile, you define your container image in Python code using Modal's image builder API.
This lets you specify pip packages, system packages, and custom build steps in code that lives alongside your application logic. The image definition is checked into your repository like any other code. Rebuilds happen automatically when the definition changes.
For Python developers unfamiliar with Docker, this is genuinely easier. For teams that do know Docker, Modal also accepts Dockerfiles if you prefer that workflow. The two approaches can coexist.
GPU access
GPU availability is one of Modal's clearer practical advantages over general-purpose serverless platforms. AWS Lambda doesn't support GPU instances. Modal does.
Available hardware includes A10G for smaller inference tasks, A100 in both 40GB and 80GB variants for larger models, L40S, and H100 for the most demanding workloads. You request the hardware you need in the function decorator. Modal handles provisioning and teardown.
The per-second billing model matters for GPU compute. An H100 costs about $6.67 per hour on Modal. For a fine-tuning job that runs for three hours, that's a predictable $20. There's no minimum reservation and no idle cost if the job finishes early or if you don't run anything for a week. For research workloads that are sporadic by nature, this is significantly cheaper than reserved cloud GPU instances.
Persistent volumes
Persistent volumes solve a recurring problem in AI infrastructure: model weights are large, loading them from scratch on every cold start is slow, but ephemeral function instances don't have anywhere to cache them.
Modal's persistent volumes are attached cloud storage that survive across function invocations. You download a model to a persistent volume once. Subsequent invocations on machines that have the volume attached start with the model already local. This brings cold start times for inference workloads from tens of seconds (re-downloading weights every time) down to a few seconds (loading from local volume).
For open weights models in the 7B-70B parameter range, this is a practical necessity. Loading a 14GB model from Hugging Face on every cold start is not viable. With persistent volumes, the download happens once per volume, and the weights are available immediately on subsequent starts.
Autoscaling and concurrency
Modal endpoints autoscale based on traffic. During a traffic spike, Modal spins up additional containers to handle concurrent requests. During quiet periods, the replica count can drop to zero. When traffic resumes, Modal starts new containers on demand.
The autoscaling behavior is configurable. You can set minimum and maximum replica counts, specify concurrency per container, and control scaledown behavior. For latency-sensitive endpoints, keeping a minimum of one warm container eliminates cold starts for users.
The scale-to-zero behavior is what makes Modal cost-effective for applications with uneven traffic. A model inference endpoint that serves users during business hours and sits idle overnight doesn't accumulate GPU costs during the idle hours. The contrast with reserved compute, which you pay for whether you use it or not, is significant for applications at the smaller end of the production scale.
Web endpoints and ASGI
Functions decorated with Modal's web endpoint decorator become HTTP APIs. You send a POST request to the endpoint URL, Modal routes it to a running container, and the function returns a JSON response.
Modal also supports ASGI apps, which means you can deploy a full FastAPI or Starlette application on Modal as if it were a normal web deployment. This is useful for teams that want the autoscaling and GPU access of Modal but also want to build a proper API layer with authentication, routing, and middleware.
The combination of ASGI support and GPU access makes Modal usable for building inference APIs without a separate hosting layer. A FastAPI app that calls a locally-loaded model, deployed on Modal, gives you a full production inference endpoint without a separate GPU server.
Cron jobs and batch workflows
Cron scheduling is built in. You decorate a function with a schedule, and Modal runs it on that schedule. This is the natural fit for batch processing jobs, periodic data pipelines, model retraining on a schedule, and any other task that needs to run regularly without a server sitting idle between runs.
For AI applications, common patterns include nightly batch inference runs, daily data preprocessing pipelines, scheduled evaluation jobs that run new model versions against test sets, and periodic cache warming for expensive computations.
Comparing Modal to managed AI platforms
Modal sits in a different category than platforms like AWS SageMaker or Google Vertex AI. Those platforms provide managed pipelines for ML workflows, model registries, experiment tracking, and productionization features. Modal is lower-level infrastructure.
The trade-off is flexibility. Modal lets you run any Python code with any dependencies on any hardware you specify. SageMaker and Vertex have more structure and more managed services, which is useful if your workflows fit their patterns and constraining if they don't.
For teams that want to move fast and control their own code without learning cloud-specific ML platform APIs, Modal's minimal-overhead model is often the right choice. For teams at larger scale who need managed pipelines, experiment tracking, and enterprise features, the managed platforms have capabilities Modal doesn't try to provide.
Open-source client
The Modal client library is open source at github.com/modal-labs/modal-client. The client handles communication with Modal's cloud infrastructure, container packaging, and the CLI. The server-side infrastructure that runs your code is not open source.
This is a different open-source posture than E2B, where the sandbox runtime is also open source. Modal's open client means you can audit and contribute to the developer tooling, but you're depending on Modal's hosted infrastructure to run workloads.
Getting started
The quickest path to running on Modal is installing the Python client, setting up an API token, and running a decorated function. The documentation covers a progression from simple CPU functions to GPU inference to persistent volume usage.
The free tier of $30/month is enough to run meaningful experiments. An A10G inference endpoint, which is sufficient for most 7B-13B parameter models, costs about $1.10 per hour. Thirty dollars buys about 27 hours of A10G time, more than enough for significant development work.
For production use, the main operational considerations are cold start latency for latency-sensitive endpoints (managed with minimum replica counts), persistent volume capacity planning (priced at $0.10/GB/month), and rate limits for scaling during traffic spikes.
Key features
- Run any Python function serverlessly without managing infrastructure
- GPU access including A10G, A100, H100, and L40S for inference and training
- Container images defined in code with pip packages and system dependencies
- Autoscaling from zero: no idle costs when your application isn't receiving traffic
- Persistent volumes for storing model weights and data across runs
- Cron scheduling for batch jobs and periodic tasks
- Web endpoints for exposing functions as HTTP APIs
- Secrets management for API keys and credentials
Pros and cons
Pros
- + Zero infrastructure management: deploy a function in a single command
- + GPU access without reserving instances or managing cluster config
- + Autoscaling to zero means no costs when your endpoint is idle
- + Container environments are defined in Python code, not YAML or Dockerfiles
- + Cold start times are faster than most serverless platforms for Python workloads
- + Persistent volumes solve the model weight caching problem cleanly
Cons
- − Vendor lock-in is real. Modal's decorator API is not portable to other platforms
- − Very large GPU jobs with strict latency requirements may need reserved hardware
- − Less mature than AWS/GCP equivalents for non-Python workloads
Who is Modal Labs for?
- Teams running model inference without DevOps overhead
- Researchers who need GPU bursts for experiments without reserved instances
- AI applications needing autoscaling endpoints with zero idle cost
- Batch processing pipelines that run on schedule or event trigger
Alternatives to Modal Labs
If Modal Labs isn't quite the right fit, the closest alternatives are e2b , and replit-agent . See our full Modal Labs alternatives page for side-by-side comparisons.
Frequently Asked Questions
What is Modal Labs?
How does Modal pricing work?
How does Modal compare to AWS Lambda?
Can Modal host model inference endpoints?
Does Modal support fine-tuning?
Related agents
Ada
Enterprise AI customer service platform used by Square, Meta, and Verizon
Adobe Firefly
Adobe's commercially safe AI image generator, built into Photoshop, Illustrator, and Express
Amazon Bedrock Agents
AWS-native AI agent platform built on Bedrock with Lambda actions and Guardrails