developer-toolsapienterprise Status: active

Modal Labs

Serverless cloud compute for AI inference, training, and agent workloads

Modal Labs is a serverless compute platform built for Python developers running AI workloads. You write Python functions, decorate them with Modal decorators, and Modal handles packaging, deployment, scaling, and GPU provisioning. It's used for model inference endpoints, fine-tuning jobs, batch processing pipelines, and any compute-heavy AI task where managing cloud infrastructure would be the bottleneck.

Modal Labs was founded in 2021 by Erik Bernhardsson and a team with background in infrastructure and machine learning. The product launched publicly in 2022 with a clear thesis: Python developers building AI applications should be able to run cloud compute without learning cloud infrastructure.

The bet was that the emerging class of AI developers writing Python wasn't the same audience as the infrastructure engineers who built their careers on AWS and Kubernetes. Writing YAML for container orchestration, managing instance reservations, and wiring up auto-scaling groups are skills that take years to develop. Modal's approach was to make all of that irrelevant by letting you express compute requirements in Python code.

The core concept is simple. You write a Python function. You decorate it with @app.function() and specify what compute it needs. You run modal run from your terminal. Modal packages your function and its dependencies into a container, runs it in the cloud, streams the output back to your terminal, and tears down the infrastructure when it's done.

There's no YAML. No Dockerfile (you can also define container images in Python if you prefer). No SSH into instances. No managing clusters. The developer experience is meant to feel like running a local Python script that happens to have access to a lot of compute.

For AI workloads specifically, this matters a lot. Fine-tuning a model on 8 H100s for three hours is not something most developers want to set up manually in AWS. On Modal, it's a function call with gpu="H100" in the decorator.

Container images defined in Python

One of Modal's more distinctive design choices is how container environments are defined. Instead of a Dockerfile, you define your container image in Python code using Modal's image builder API.

This lets you specify pip packages, system packages, and custom build steps in code that lives alongside your application logic. The image definition is checked into your repository like any other code. Rebuilds happen automatically when the definition changes.

For Python developers unfamiliar with Docker, this is genuinely easier. For teams that do know Docker, Modal also accepts Dockerfiles if you prefer that workflow. The two approaches can coexist.

GPU access

GPU availability is one of Modal's clearer practical advantages over general-purpose serverless platforms. AWS Lambda doesn't support GPU instances. Modal does.

Available hardware includes A10G for smaller inference tasks, A100 in both 40GB and 80GB variants for larger models, L40S, and H100 for the most demanding workloads. You request the hardware you need in the function decorator. Modal handles provisioning and teardown.

The per-second billing model matters for GPU compute. An H100 costs about $6.67 per hour on Modal. For a fine-tuning job that runs for three hours, that's a predictable $20. There's no minimum reservation and no idle cost if the job finishes early or if you don't run anything for a week. For research workloads that are sporadic by nature, this is significantly cheaper than reserved cloud GPU instances.

Persistent volumes

Persistent volumes solve a recurring problem in AI infrastructure: model weights are large, loading them from scratch on every cold start is slow, but ephemeral function instances don't have anywhere to cache them.

Modal's persistent volumes are attached cloud storage that survive across function invocations. You download a model to a persistent volume once. Subsequent invocations on machines that have the volume attached start with the model already local. This brings cold start times for inference workloads from tens of seconds (re-downloading weights every time) down to a few seconds (loading from local volume).

For open weights models in the 7B-70B parameter range, this is a practical necessity. Loading a 14GB model from Hugging Face on every cold start is not viable. With persistent volumes, the download happens once per volume, and the weights are available immediately on subsequent starts.

Autoscaling and concurrency

Modal endpoints autoscale based on traffic. During a traffic spike, Modal spins up additional containers to handle concurrent requests. During quiet periods, the replica count can drop to zero. When traffic resumes, Modal starts new containers on demand.

The autoscaling behavior is configurable. You can set minimum and maximum replica counts, specify concurrency per container, and control scaledown behavior. For latency-sensitive endpoints, keeping a minimum of one warm container eliminates cold starts for users.

The scale-to-zero behavior is what makes Modal cost-effective for applications with uneven traffic. A model inference endpoint that serves users during business hours and sits idle overnight doesn't accumulate GPU costs during the idle hours. The contrast with reserved compute, which you pay for whether you use it or not, is significant for applications at the smaller end of the production scale.

Web endpoints and ASGI

Functions decorated with Modal's web endpoint decorator become HTTP APIs. You send a POST request to the endpoint URL, Modal routes it to a running container, and the function returns a JSON response.

Modal also supports ASGI apps, which means you can deploy a full FastAPI or Starlette application on Modal as if it were a normal web deployment. This is useful for teams that want the autoscaling and GPU access of Modal but also want to build a proper API layer with authentication, routing, and middleware.

The combination of ASGI support and GPU access makes Modal usable for building inference APIs without a separate hosting layer. A FastAPI app that calls a locally-loaded model, deployed on Modal, gives you a full production inference endpoint without a separate GPU server.

Cron jobs and batch workflows

Cron scheduling is built in. You decorate a function with a schedule, and Modal runs it on that schedule. This is the natural fit for batch processing jobs, periodic data pipelines, model retraining on a schedule, and any other task that needs to run regularly without a server sitting idle between runs.

For AI applications, common patterns include nightly batch inference runs, daily data preprocessing pipelines, scheduled evaluation jobs that run new model versions against test sets, and periodic cache warming for expensive computations.

Modal sits in a different category than platforms like AWS SageMaker or Google Vertex AI. Those platforms provide managed pipelines for ML workflows, model registries, experiment tracking, and productionization features. Modal is lower-level infrastructure.

The trade-off is flexibility. Modal lets you run any Python code with any dependencies on any hardware you specify. SageMaker and Vertex have more structure and more managed services, which is useful if your workflows fit their patterns and constraining if they don't.

For teams that want to move fast and control their own code without learning cloud-specific ML platform APIs, Modal's minimal-overhead model is often the right choice. For teams at larger scale who need managed pipelines, experiment tracking, and enterprise features, the managed platforms have capabilities Modal doesn't try to provide.

Open-source client

The Modal client library is open source at github.com/modal-labs/modal-client. The client handles communication with Modal's cloud infrastructure, container packaging, and the CLI. The server-side infrastructure that runs your code is not open source.

This is a different open-source posture than E2B, where the sandbox runtime is also open source. Modal's open client means you can audit and contribute to the developer tooling, but you're depending on Modal's hosted infrastructure to run workloads.

Getting started

The quickest path to running on Modal is installing the Python client, setting up an API token, and running a decorated function. The documentation covers a progression from simple CPU functions to GPU inference to persistent volume usage.

The free tier of $30/month is enough to run meaningful experiments. An A10G inference endpoint, which is sufficient for most 7B-13B parameter models, costs about $1.10 per hour. Thirty dollars buys about 27 hours of A10G time, more than enough for significant development work.

For production use, the main operational considerations are cold start latency for latency-sensitive endpoints (managed with minimum replica counts), persistent volume capacity planning (priced at $0.10/GB/month), and rate limits for scaling during traffic spikes.

Key features

Run any Python function serverlessly without managing infrastructure
GPU access including A10G, A100, H100, and L40S for inference and training
Container images defined in code with pip packages and system dependencies
Autoscaling from zero: no idle costs when your application isn't receiving traffic
Persistent volumes for storing model weights and data across runs
Cron scheduling for batch jobs and periodic tasks
Web endpoints for exposing functions as HTTP APIs
Secrets management for API keys and credentials

Pros and cons

Pros

+ Zero infrastructure management: deploy a function in a single command
+ GPU access without reserving instances or managing cluster config
+ Autoscaling to zero means no costs when your endpoint is idle
+ Container environments are defined in Python code, not YAML or Dockerfiles
+ Cold start times are faster than most serverless platforms for Python workloads
+ Persistent volumes solve the model weight caching problem cleanly

Cons

− Vendor lock-in is real. Modal's decorator API is not portable to other platforms
− Very large GPU jobs with strict latency requirements may need reserved hardware
− Less mature than AWS/GCP equivalents for non-Python workloads

Who is Modal Labs for?

Teams running model inference without DevOps overhead
Researchers who need GPU bursts for experiments without reserved instances
AI applications needing autoscaling endpoints with zero idle cost
Batch processing pipelines that run on schedule or event trigger

Alternatives to Modal Labs

If Modal Labs isn't quite the right fit, the closest alternatives are e2b , and replit-agent . See our full Modal Labs alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is Modal Labs?

Modal Labs is a serverless cloud compute platform designed for Python AI workloads. You write Python functions, use Modal's decorators to specify compute requirements (CPU, memory, GPU type), and run them with a single command. Modal handles container packaging, cloud provisioning, scaling, and teardown. It's used for model inference, fine-tuning jobs, data pipelines, and any task where you need cloud compute without managing infrastructure.

How does Modal pricing work?

Modal charges only for compute used. There are no platform fees or seat charges. CPU compute costs $0.000194 per vCPU-second. GPU pricing varies by hardware: A10G is $0.000306 per second, A100 40GB is $0.00064 per second, H100 is $0.00185 per second. Persistent volume storage is $0.10 per GB per month. The free tier includes $30 of compute each month, which is enough for development and moderate testing. Costs scale linearly with usage and drop to zero when idle.

How does Modal compare to AWS Lambda?

AWS Lambda is a general-purpose serverless function runtime. Modal is specifically optimized for Python AI workloads. Key differences: Modal supports GPU instances natively, which Lambda does not. Modal's container images are defined in Python code, which is more natural for Python developers than Lambda's packaging model. Modal handles model weight caching via persistent volumes, which has no clean equivalent in Lambda. Cold start performance for Python with large dependencies is generally better on Modal. Lambda has a larger ecosystem of integrations and is more appropriate for event-driven applications in the AWS ecosystem.

Can Modal host model inference endpoints?

Yes. Modal's web endpoint feature lets you expose any Python function as an HTTP API. A common pattern is loading a model from a persistent volume or Hugging Face Hub into a Modal function, decorating it as a web endpoint, and deploying it. The endpoint autoscales from zero to handle load spikes and scales back down during quiet periods. For production inference endpoints that need to handle variable traffic without paying for idle GPU time, this is a cost-effective pattern.

Does Modal support fine-tuning?

Yes. Fine-tuning jobs are one of the core use cases. You write the training script in Python, specify a GPU type (A100 or H100 for large models), and run it on Modal. Persistent volumes let you store checkpoints and resume training without downloading weights repeatedly. The autoscaling model works well for batch training jobs that run once and terminate. Modal isn't a managed fine-tuning service with a UI. It's infrastructure for running your own training code.

Related agents

Ada

Enterprise AI customer service platform used by Square, Meta, and Verizon

customer-supportenterprise Enterprise

Adobe Firefly

Adobe's commercially safe AI image generator, built into Photoshop, Illustrator, and Express

image-generationdesign From $10/mo

Amazon Bedrock Agents

AWS-native AI agent platform built on Bedrock with Lambda actions and Guardrails

autonomousenterprise Paid