Production ML API with 3× throughput and 50% lower latency
How we took machine learning models out of Jupyter notebooks and into a scalable, production-ready API service with measurable performance improvements.
The situation
A data analytics product had trained several ML models for predictive use cases — churn prediction, demand forecasting, and anomaly detection — but all lived in Jupyter notebooks run manually by a data scientist. When the product team wanted to surface predictions in the customer-facing dashboard, there was no API to call. When a model was retrained, the deployment process was "copy the pickle file to a folder and restart the script."
Inference latency on the ad-hoc script approach averaged 800ms per request, too slow for real-time dashboard widgets. Infrastructure was a single EC2 instance with no autoscaling. The team had no visibility into model performance drift over time.
What we built
FastAPI prediction service
A typed, documented REST API exposing endpoints for each model. Request validation, response schemas, and OpenAPI docs generated automatically. Each model is loaded once at service startup and held in memory — no cold-start overhead per request.
Redis feature caching layer
The most expensive part of prediction is often feature lookup — querying a database for the latest customer behaviour data before running inference. We implemented a Redis cache with a 5-minute TTL on feature vectors, meaning repeated predictions for the same entity hit cache rather than the database. Eliminated the DB round-trip for ~70% of requests in production.
Batch inference endpoint
A secondary endpoint accepting arrays of prediction requests, processed in parallel using Python's ProcessPoolExecutor. Clients that need predictions for many entities (e.g., nightly churn scoring for the full customer base) use this endpoint and get results in a single round trip at significantly higher throughput.
Autoscaling deployment on AWS ECS
Containerised with Docker, deployed to ECS Fargate with a target-tracking autoscaling policy based on CPU utilisation. Scales from 1 to 8 containers in under 90 seconds. No EC2 management required.
MLflow model registry
All model versions are tracked in MLflow. The API service loads the model marked as "Production" in the registry. Promoting a new model version to Production is a registry tag change — no redeployment required, and the running service picks up the new model within 60 seconds.
Results
- ~50% reduction in model response latency (800ms → ~380ms p95)
- 3× improvement in request handling throughput
- ~30% lower infrastructure overhead through autoscaling vs. always-on EC2
- Model deployment time reduced from hours of manual effort to a registry tag change
- Full observability — latency, error rate, and model version in production are now dashboarded