Aditya Chaudhary

Hiii, I'm Aditya

I build and break Data Pipelines at Prod.

About

Data Engineer, building distributed systems and large-scale data infrastructure. I design multi-node Airflow clusters, architect HA databases with Patroni/Keepalived, and build parallel execution frameworks with Celery + RabbitMQ. Previously at Decentro (YC S20), where I engineered data archival pipelines at a huge scale (multiple TBs). I love solving distributed systems challenges and building cloud-native solutions that scale.

Work Experience

Cerebralzip Technologies

Jul 2025 - Present

Data Engineer

Designed and deployed a multi-node Apache Airflow cluster with distributed schedulers, metadata DB replication, monitoring, and self-healing.
Architected and deployed highly available APIs, databases, and Airflow services using Keepalived (VIPs), Patroni (PostgreSQL HA), and shared storage via GlusterFS/NFS.
Built a Celery + RabbitMQ distributed execution framework enabling parallel task execution.
Automated VM-level parallel workloads on AWS, significantly reducing execution time for compute-heavy pipelines.
Led API and infrastructure restructuring with Nginx load balancing and Prometheus–Grafana observability, improving reliability and latency.
Developed and optimized Airflow DAGs, AWS Lambda functions, and SQS workflows, achieving 40% performance improvement.
Executed cloud migrations with zero data loss and minimal downtime.

Decentro (YC S20)
YC S20

Apr 2025 - Jul 2025

Data Engineer Intern

Engineered robust archival and deletion pipelines for 21TB multi-product database using Polars and Apache Airflow, ensuring data governance compliance.
Successfully migrated archived data to S3 in Hive format and configured AWS Athena for efficient querying, reducing query costs by 60%.
Built analytical pipelines delivering actionable business insights through real-time visualization dashboards for stakeholder decision-making.

Cerebralzip Technologies

Jun 2024 - Jul 2025

Data Engineer Intern

Built and configured multi-node Cassandra cluster across EC2 instances enabling high availability, fault tolerance, and horizontal scalability.
Containerized entire technology stack using Docker, enabling consistent deployment environments and improved development workflow.
Deployed distributed services with Nginx reverse proxy, Route53 DNS management, and Azure DNS for global load distribution.
Orchestrated seamless cloud migrations with zero data loss and minimal service interruption.

Education

Amity University Haryana

2023 - 2025

MSc, Data Science

University of Delhi, Ramjas College

2020 - 2023

BSc, PCM

Check out my latest work

From distributed data pipelines to cloud infrastructure and open-source tools. Here are some highlights.

Data File Viewer

VS Code extension to view and explore binary data files directly in the editor. Supports 11 formats including pkl, h5, parquet, feather, joblib, npy, npz, msgpack, arrow, avro, nc, and mat files. Implemented a Python backend with isolated virtual environments for safe, on-demand data parsing. Optimized file loading to handle large datasets without editor freezes.

TypeScript

Python

VS Code API

Webpack

Source

AWS Terraform Multi-Environment Template

Production-ready Terraform template supporting dev, staging, and prod environments. Modular IaC architecture with reusable components for VPC, ECS, RDS, ALB, ECR, Route53, and remote state management. Implements multi-environment patterns using for_each loops and environment conditionals.

Terraform

AWS

VPC

ECS

RDS

ALB

ECR

Source

Parallelization Engine

Distributed parallelization engine using Docker, Celery, and RabbitMQ for scalable task execution. Enables dynamic worker scaling across multiple nodes for compute-intensive workloads. Focused on fault tolerance, task retries, and throughput optimization for real-world data pipelines.

Python

Celery

RabbitMQ

Docker

Redis

Source

Motor Vehicle Collision Analysis Pipeline

End-to-end ETL pipeline that processes traffic accident data to identify patterns and insights. Built with Apache Airflow for orchestration and Spark for large-scale data processing. Includes data visualization dashboards for exploring collision trends.

Python

Apache Airflow

Spark

Data Visualization

Source

Real Estate Analysis Pipeline

Data pipeline that aggregates property listing data to generate market insights. Uses DBT for data transformation and Snowflake for cloud data warehousing. Implements dimensional modeling for analytics queries.

Python

DBT

Snowflake

Data Modeling

Source

LinkedIn Network Analyzer

Tool that extracts and processes professional network data to uncover industry trends and connection patterns. Built with Selenium for web automation and MongoDB for storing extracted data.

Python

Selenium

BeautifulSoup

MongoDB

Source

Multi-Node Airflow Cluster

🚧 Under Development

Multi-node Apache Airflow cluster with distributed schedulers, metadata DB replication using Patroni, self-healing capabilities, and Prometheus-Grafana monitoring. Designed for high availability and fault tolerance. (Not publicly available)

Airflow

PostgreSQL

Patroni

Keepalived

GlusterFS

Prometheus

Source

Data Archival/Deletion Pipeline

🚧 Under Development

Large-scale archival and deletion pipelines for multi-product Cassandra database. Migrated archived data to Amazon S3 in Hive format, configured AWS Athena reducing query costs by 60%. Ensured data governance compliance throughout the archival process. (Not publicly available)

Python

Polars

Airflow

Athena

Cassandra

Source

High Availability Infrastructure

🚧 Under Development

Highly available APIs, databases, and Airflow services using Keepalived (VIPs), Patroni (PostgreSQL HA), and shared storage via GlusterFS/NFS. Nginx load balancing with Route53 and Azure DNS for global distribution. (Not publicly available)

Nginx

Patroni

Keepalived

PostgreSQL

Prometheus

Route53

Source

Skills

Python

SQL

Golang

AWS

GCP

Azure

Docker

Kubernetes

Terraform

Apache Airflow

Apache Spark

Apache Kafka

Hadoop

DBT

Snowflake

Databricks

PostgreSQL

MySQL

MongoDB

Cassandra

Elasticsearch

FastAPI

Flask

Celery

RabbitMQ

Nginx

GitHub Actions

Prometheus

Grafana

Tableau

Looker Studio

Contact

Get in Touch

Want to chat? Just shoot me a DM with a direct question on twitter and I'll respond whenever I can. I will ignore all soliciting.

About

Work Experience

Cerebralzip Technologies

Decentro (YC S20)YC S20

Cerebralzip Technologies

Education

Amity University Haryana

University of Delhi, Ramjas College

Check out my latest work

Data File Viewer

AWS Terraform Multi-Environment Template

Parallelization Engine

Motor Vehicle Collision Analysis Pipeline

Real Estate Analysis Pipeline

LinkedIn Network Analyzer

Multi-Node Airflow Cluster

Data Archival/Deletion Pipeline

High Availability Infrastructure

Skills

Get in Touch

Decentro (YC S20)
YC S20