ML Engineer
Madfish
We are looking for a passionate ML Engineer to implement AI solutions aimed at achieving business goals (CLAi automates a multitude of business operations simple and complex alike. That includes document reading, data entry, CRM management, calendar scheduling, and automated appointment booking.).
This role offers the opportunity to work on cutting-edge project and collaborate with a team of talented researchers and engineers in a stimulating and dynamic environment.
Minimum qualifications (must have hands-on experience with all of the below):
- LLMs (generation, alignment, extraction)
- Supervised fine-tuning (SFT) on ≥7B models with PEFT (LoRA/QLoRA) and full fine-tunes.
- Preference optimization (DPO/ORPO/PPO/RLAIF), rejection sampling, reward-model training.
- Constrained/structured decoding (regex/CFG/JSON-Schema), logit biasing, n-gram blocking, spec decoding.
- Distillation/compression (teacher-student, pruning, quant-aware training).
- RAG modeling side: dual-encoders, cross-encoders/rerankers (e.g., ColBERT/SPLADE), hard-negative mining.
Speech (ASR/diarization/VAD; optional TTS):
- Fine-tuning Conformer/Transducer/CTC or Whisper/wav2vec2/HuBERT on domain audio, incl. streaming/chunking.
- Robust segmentation (VAD), diarization (x-vectors/ECAPA-TDNN), punctuation & inverse text normalization.
- Data augmentation (SpecAugment, speed/tempo, noise/reverb), forced/CTC alignment and lexicon handling.
Data & evaluation:
- Large-scale corpus building: language ID, dedup/near-dedup (LSH/MinHash), toxicity/PII filters, perplexity/quality filters.
- Golden sets & adversarial suites; metrics for WER/CER, entity F1, extraction validity, factuality/hallucination, helpful-harmlessness.
- Reproducible experiments: seeds, checkpoints, ablations, learning-curve analysis, compute budgeting; crisp experiment reports.
Systems for training (efficiency)
- Distributed/memory-efficient training with FSDP/DeepSpeed ZeRO, gradient checkpointing, mixed precision, packing/bucketing, sequence-length curricula.
- Dataset pipelines: HF Datasets, WebDataset, streaming Parquet/TFRecords; tokenizer optimization and dataset QA.
Core foundations
- Solid math (linear algebra, probability, optimization) and ability to reason about loss design and bias/variance.
- Expert PyTorch (or JAX), custom modules/losses, profiling (cProfile/torch.profiler), multi-GPU runs.
- Habit of rigorous evals with automated harnesses and regression gates.
Nice to have
- Preference optimization at scale (PPO/DPO), safety classifiers.
- Quantization (GPTQ/AWQ/INT8) with minimal quality loss; ONNX/TensorRT; Triton/CUDA kernels.
- Multilingual modeling, phonetic/lexicon work for low-resource accents.
- Active learning & data programming (cleanlab/Snorkel).