Senior Data Engineer
Madfish
We are looking for a Senior Data Engineer to strengthen our data platform capabilities and support major Financial Institutions in preventing money laundering and fraud. In this role, you will help build and evolve resilient, governed, and high-quality data ecosystems that power advanced analytics and financial crime detection. You will be responsible for shaping and scaling our Databricks + AWS lakehouse environment so that investigators, analysts, and product teams can identify suspicious activity and act with confidence.
You will play a key role in the engineering team, focusing on designing and delivering data solutions built on Databricks or Snowflake while ensuring reliability, performance, and compliance.
Requirements
• Deep expertise in SQL, with practical experience using Databricks, Snowflake, Python, and PySpark to design and deliver complex data engineering solutions.
• Strong background in building scalable and reusable data models, pipelines, and frameworks using Hadoop, Apache NiFi, and modern data lake architectures.
• Hands-on experience with orchestration tools such as Airflow (DAG design, task groups, sensors), Databricks Workflows, and AWS Step Functions.
• Solid knowledge of the AWS data ecosystem: S3 layout patterns, IAM (least privilege), Glue Data Catalog, Lake Formation, networking (VPC, endpoints), encryption/KMS.
• Proficiency in CI/CD practices, including Git workflows, PR reviews, automated deployments for Databricks and AWS, and IaC (Terraform or CloudFormation).
• Familiarity with data governance and lineage tools (Unity Catalog, OpenLineage, Atlas, etc.), along with audit and compliance considerations (PII/PCI management, data retention).
• Demonstrated experience building, tuning, and running large-scale Spark/PySpark pipelines on Databricks (clusters, jobs, Delta Lake, Photon).
• Nice-to-have: experience with Python packaging, observability via OpenTelemetry, or exposure to Financial Crime / AML domains.
• Strong collaboration skills and prior work with stakeholders across engineering, analytics, and product to translate business needs into robust data platform solutions across AWS, Azure, and modern cloud warehouses.
• Proven ability to mentor and lead engineering teams, facilitating communication between technical and non-technical groups.
• Practical knowledge of cost optimization strategies (storage policies, right-sizing compute, caching, spot usage).
• Comfortable leading architectural discussions, guiding junior engineers, and working closely with data science, security, compliance, and product teams.
• Pragmatic, outcome-oriented mindset: emphasize resilience, automation, and clear documentation.
• Experience handling incidents, improving observability, and reducing MTTR.
Responsibilities
• Own the architecture, build, optimization, and ongoing support of Spark / PySpark pipelines on Databricks (batch and streaming).
• Define and enforce lakehouse and Medallion architecture standards (bronze/silver/gold), including schema governance, lineage, quality SLAs, and cost controls.
• Implement and maintain ingestion processes using Apache NiFi, APIs, SFTP/FTPS, ensuring secure and repeatable onboarding of diverse datasets.
• Architect secure, compliant AWS data infrastructure (S3, IAM/KMS, Glue, Lake Formation, EC2/EKS, Lambda, Step Functions, CloudWatch, Secrets Manager).
• Develop orchestration workflows using Airflow, Databricks Workflows, and Step Functions, ensuring resilient DAG patterns (idempotency, retries, observability).
• Champion data quality and reliability, using expectations, anomaly detection, reconciliation, and contract tests, and establishing SLIs/SLOs with proper alerting.
• Embed and maintain lineage and metadata through Unity Catalog, Glue, or OpenLineage to support audits, impact assessments, and regulatory transparency.
• Drive CI/CD for data engineering assets: IaC deployment, test automation, versioning, environment promotion strategies.
• Mentor engineers on distributed data performance, partitioning, file organization, Delta Lake optimization, caching, and cost-performance trade-offs.
• Partner with data science, product, and compliance teams to translate analytical and detection requirements into production-ready data models and serving layers.
• Conduct thorough code reviews (SQL, PySpark, IaC templates) and lead architectural reviews and technical decision-making.
• Implement and maintain secure access controls, secret management, data masking/tokenization, and granular permissions.
• Support continuous improvement: backlog refinement, effort estimation, delivery tracking, and stakeholder demonstrations.
• Participate in incident response efforts, including RCA, postmortems, and preventive engineering improvements.