Job Description
What You'll Build
Core Responsibilities
Data Architecture & Infrastructure (40%)
● Design and implement a multi-database architecture (MongoDB, Redis, Milvus, Neo4j, BigQuery)
● Build scalable data pipelines for real-time conversation processing and personalization
● Architect ETL/ELT workflows for data migration from legacy systems
● Implement data partitioning, sharding, and optimization strategies for high-throughput systems
● Create data governance frameworks ensuring quality, security, and compliance
Vector & Graph Database Systems (25%)
● Design and optimize Milvus vector collections for semantic search (1024-dim embeddings)
● Build graph schemas in Neo4j for customer journey mapping and persona relationships
● Implement HNSW indexing strategies and similarity search optimization
● Create hybrid search systems combining vector, full-text, and graph queries
● Monitor and tune database performance (query latency, throughput, resource utilization)
ML Data Infrastructure (20%)
● Build data collection pipelines for LLM fine-tuning (conversation logs, tool executions)
● Create feature stores for GNN training (customer interactions, engagement signals)
● Implement data versioning and lineage tracking for ML experiments
● Design A/B testing data infrastructure with CUPED variance reduction
● Build real-time feature computation pipelines for contextual bandits
Analytics & Monitoring (15%)
● Design BigQuery schemas for marketing analytics and performance tracking
● Create materialized views and aggregation pipelines for real-time dashboards
● Implement data quality monitoring and anomaly detection
● Build observability infrastructure (Prometheus metrics, Grafana dashboards)
● Develop cost optimization strategies for cloud data warehousing
Technical Stack You'll Work With
Databases & Storage
●
MongoDB
(conversation state, active sessions)
●
Redis
(caching, rate limiting, real-time data)
●
Milvus
(vector embeddings, semantic search)
●
Neo4j
(customer journey graphs, persona networks)
●
BigQuery
(analytics warehouse, historical data)
Data Processing & Orchestration
●
Apache Airflow
or
Prefect
(workflow orchestration)
●
Pandas ,
Polars
(data transformation)
●
Apache Spark
(optional - for large-scale processing)
●
dbt
(data transformation and modeling)
ML/AI Data Pipeline
●
vLLM
(LLM inference serving)
●
MLflow
(model registry, experiment tracking)
●
Sentence Transformers
(embedding generation)
●
PyTorch ,
TensorFlow
(ML model training)
Cloud & Infrastructure
●
Google Cloud Platform
(BigQuery, Cloud Storage, Compute)
●
Docker
&
Kubernetes
(containerization, orchestration)
●
Terraform
(infrastructure as code)
●
GitHub Actions
or
GitLab CI
(CI/CD pipelines)
Programming & Tools
●
Python 3.10+
(primary language)
●
SQL
(complex queries, query optimization)
●
Shell scripting
(Bash/Zsh)
●
Git
(version control)
Requirements
Must-Have Skills
●
5+ years
of data engineering experience with production systems
●
Expert-level SQL
and database design skills
●
Strong Python
programming (async/await, type hints, testing)
● Experience with
at least 3 different database technologies
(SQL, NoSQL, Vector, Graph)
● Proven track record building
high-scale data pipelines
(>1M records/day)
● Deep understanding of
data modeling
(dimensional, normalized, denormalized)
● Experience with
cloud data warehouses
(BigQuery, Redshift, or Snowflake)
● Strong knowledge of
data quality, validation, and governance
● Excellent
debugging and optimization
skills
Highly Desirable
● Experience with
vector databases
(Milvus, Pinecone, Weaviate, Qdrant)
● Experience with
graph databases
(Neo4j, ArangoDB, Neptune)
● Knowledge of
embedding models
and semantic search
● Experience with
ML data pipelines
(feature stores, model training data)
● Understanding of
A/B testing
and experimental design
● Experience with
real-time streaming
(Kafka, Pub/Sub, Kinesis)
● Knowledge of
LLMs
and conversational AI systems
● Experience with
data migration
projects (especially large-scale)
● Background in
marketing technology
or customer data platforms
Nice-to-Have
● Experience with
PyTorch Geometric
or graph neural networks
● Knowledge of
marketing analytics
(attribution, segmentation, personalization)
● Familiarity with
LangChain ,
LangGraph , or agent frameworks
● Experience with
cost optimization
in cloud environments
● Contributions to
open-source
data engineering projects
● Experience with
data compliance
(GDPR, CCPA)
Key Projects You'll Own
Phase 1: Foundation
● Migrate 10M+ conversation vectors from Pinecone to Milvus
● Design and implement MongoDB schemas for real-time agent state
● Set up Neo4j graph database with customer journey models
● Create BigQuery data warehouse with partitioned tables
Phase 2: Optimization
● Build automated data quality monitoring system
● Implement caching strategies (Redis) for 10x latency reduction
● Optimize vector search queries (target:
● Create real-time analytics dashboards (Grafana)
Phase 3: ML Infrastructure
● Build LLM fine-tuning data pipeline
● Implement feature store for GNN training
● Create A/B testing data infrastructure
● Design multi-armed bandit state management
Work Environment
●
Collaborative team : Work with ML engineers, backend developers, and data scientists
●
Modern stack : Latest technologies and tools
●
Impact : Your work directly affects millions of marketing interactions
●
Autonomy : Own your projects end-to-end
●
Growth : Clear path to Senior/Lead/Principal roles