Loading...
Built a fully automated pipeline that takes data files from multiple sources, validates and transforms them, and loads them into a database — replacing hours of manual work with a system that handles thousands of records in minutes, with zero human intervention.
Client: Enterprise Client
The challenge
The client received data exports from multiple external systems as CSV files — sometimes several times a day, sometimes in bulk batches of hundreds of files at once. Each source had its own format, column names, and quirks. The operations team was spending hours manually importing these into their database, fixing formatting issues, chasing missing fields, and dealing with errors that only surfaced days later.
When a large batch arrived (sometimes 200+ files in one go), the team would fall behind. Records got missed, duplicates crept in, and there was no clear audit trail of what had been processed and what hadn't. They needed a system that could handle any volume, from any source, without anyone having to touch it.
Architecture diagram: event-driven pipeline from file upload to database
Screenshot coming soon
What we built
We designed and built a cloud-native data pipeline that automatically detects new files the moment they arrive, validates the data, transforms it into a consistent format, and loads it into the database — all without human intervention.
The system is smart about scale. Small files are processed immediately and directly. Large files (over 50MB) are automatically split into chunks and processed in parallel across multiple workers, so even the biggest uploads complete in minutes rather than hours.
Each data source has its own dedicated processing path with custom field mappings, validation rules, and error handling. When something goes wrong — a malformed row, a missing required field, a file in an unexpected format — the system catches it, logs it, and alerts the team with a clear explanation of what failed and why. Failed records are held in a retry queue and automatically reprocessed once the issue is resolved.
The pipeline includes a concurrency controller that monitors system capacity in real time. During large batch uploads, it automatically throttles processing to prevent overload, then ramps back up when capacity is available. This means the system never crashes under load — it just queues work intelligently.
Every processing run generates a summary report showing exactly what was processed, what succeeded, and what needs attention. The operations team went from spending hours on manual imports to simply reviewing a daily summary email.
CloudWatch dashboard: real-time processing metrics and system health
Screenshot coming soon
Step Functions execution: parallel file processing with Distributed Map
Screenshot coming soon
Technical detail
This section is for readers with a technical background who want to understand the architecture and implementation choices.
The pipeline is built entirely on AWS serverless services, meaning there are no servers to manage and it scales automatically with demand.
Architecture: Files land in Amazon S3, which emits events to Amazon EventBridge. EventBridge routes each file to the correct SQS queue based on its source path. Each queue triggers a dedicated AWS Lambda function for that data source.
Parallel Processing: For large files, AWS Step Functions orchestrates a Distributed Map workflow. The state machine reads the CSV natively, batches rows into groups of 1,000, and distributes them across up to 8 concurrent Lambda workers. Each worker uses 10 concurrent DynamoDB write threads for high throughput.
Data Transformation: A column mapping engine transforms 50+ source fields to the target schema. Type conversion handles strings, numbers, booleans, dates, and decimal precision. Date normalisation converts multiple input formats to a consistent output. Scientific notation recovery fixes phone numbers corrupted by spreadsheet software.
Concurrency Control: A Global Concurrency Controller Lambda monitors account-level concurrent executions via the Lambda API. When available capacity drops below a configurable threshold (default: 20 slots), it delays batch processing with a 60-second backoff. Custom CloudWatch metrics track system utilisation in real time.
Error Handling: Each queue has a dead-letter queue (DLQ) with a max receive count of 3. A DLQ reprocessor Lambda automatically retries failed messages with exponential backoff (5 min, 10 min, 20 min). After 3 retries, a permanent failure notification is sent via SNS to an error aggregation Lambda that produces consolidated human-readable summaries.
Monitoring: CloudWatch alarms cover Lambda errors, queue depth, DLQ accumulation, DynamoDB throttling, and Step Functions failures. A processing dashboard shows real-time metrics including files processed, records imported, tables created, and error rates. Custom metric filters extract processing statistics from Lambda logs.
Encryption: All data is encrypted at rest using AWS KMS customer-managed keys — separate keys for S3, SQS, DynamoDB, and CloudWatch Logs. All keys rotate automatically on an annual schedule.
The results
Processing summary report: batch results and error breakdown
Screenshot coming soon
Interested in something similar?
Book a free 30-minute discovery call. We'll listen to what you need, tell you what's realistic, and give you a straight answer on whether we can help.