We protect citizens, enterprises, and governments from synthetic media fraud. Everything you see and hear online can now be manipulated — our job is to make sure people can trust what they see. As part of our forensics platform team, you’ll work on the data backbone that makes large-scale detection possible, from ingestion and versioning to training, evaluation, and production.

You’ll join a small, senior team where your work will have immediate impact, and you’ll have ownership over the systems you build.

What You’ll Drive

Data platform architecture: Define unified schemas, lineage, and dataset versioning for large image/video + context data.
Ingestion at scale: Build reliable pipelines from research repos, APIs, and internal generators; automate connectors and jobs.
Quality & governance: Implement deduplication, validation, health dashboards, and drift/coverage checks with auditable lineage.
Curation & access: Deliver one-command dataset builds, deterministic splits, and fast sampling tools for training/eval.
Performance & cost: Tune S3/object storage layouts, partitioning, and lifecycle policies for speed and spend.
Orchestration & ops: Productionize pipelines with CI/CD, containerization, scheduling/monitoring, and safe rollbacks.
Reliability & operations: Build for simplicity and observability; participate in a planned, compensated support rotation.
Engineering productivity: Create internal tools/CLIs, docs, and templates that make everyone faster.

Must haves

Strong software engineering foundation: Master’s in Computer Science, Data Engineering, or a related field.
Production experience: 5–8+ years building and operating data platforms for large unstructured datasets (images/video).
Data lifecycle ownership: Ingest → validate → catalog → version → sample/serve → monitor.
Pipelines & orchestration: Experience with modern schedulers (e.g., Airflow/Prefect) and containerized jobs.
Storage & formats: Hands-on with object storage (e.g., S3), columnar formats/partitioning, and performance tuning.
Versioning & lineage: Experience with dataset versioning and reproducibility (e.g., DVC/lakeFS/Delta or equivalents).
Quality at scale: Deduplication, schema/label checks, and automated QC gates in CI.
Security & privacy: IAM, access controls, and privacy-aware workflows suitable for regulated customers.
Domain awareness: Familiarity with digital forensics, misinformation threats, or synthetic media — and willingness to deepen expertise.
Flexibility: Comfortable moving between data engineering, infra, and tooling tasks when needed.
Mindset & delivery: Thrive in a fast-moving environment; proactive problem-solver; ship, measure, simplify.
Communication: Excellent written and verbal skills; explain complex ideas clearly.
Independence: Deliver quality work on time without constant oversight.
Language: Fluent in English.

Nice-to-haves

Streaming & events: Kafka/Kinesis or similar for near-real-time ingestion.
Vector search: Experience with embedding stores or similarity search at scale.
Synthetic data: Building pipelines to generate/stress-test rare scenarios.
Cloud & on-prem: Terraform/CDK, Kubernetes, and hybrid/on-prem data deployments.
FinOps: Cost monitoring and optimization for data workloads.
Technical track record: Strong GitHub, open-source contributions, publications, patents, or public talks.
Leadership: Mentoring and guiding technical direction.
Dutch language: Fluency is a plus.

Key Deliverables (First 90 Days)

A unified schema + catalog with key datasets onboarded, versioned, and reproducibly built via one command.
Automated QC gates (dedup/validation) with a red/amber/green dataset health dashboard and clear lineage.
Fast sampling/curation tools for the ML team, plus cost controls (storage layouts, lifecycle policies) in place.
Data migration: Inventory and migrate existing/legacy datasets into the new platform; reformat to the new schema, backfill metadata, validate checksums/lineage, and deprecate legacy paths with a rollback plan.

Compensation & benefits

Own the backbone: Define schemas, lineage, and dataset versioning used across research and production.
Company participation: Meaningful equity/virtual shares aligned with company growth.
Flexible work: Hybrid (Delft), flexible hours, minimal ceremony, async-first collaboration.
Data platform mandate: Real say in stack choices (orchestration, catalog, storage/layout) and time to implement them right.
Repro & auditability: Space to enforce deterministic builds, splits, and traceable lineage—no heroics needed.
Quality culture: Backing to implement dedup, drift/coverage checks, and dataset health dashboards org-wide.
FinOps mindset: Budget and support to balance speed, reliability, and total cost.
Pragmatic on-call: Planned, compensated rotation with automation-first recovery and rollback plans.
Growth path: IC track to Staff/Principal; opportunities to mentor and codify data standards.
Learning budget: Annual budget for courses/books + two data/ML-infra conferences per year.
Home office: Modest stipend for an ergonomic setup; commuting support (public transport or mileage).
Relocation + visa: Visa sponsorship and relocation support for internationals.

Join us and be part of a company committed to creating a more secure and trustworthy digital future. Apply today to become part of our mission-driven team!

This job is no longer accepting applications

See open jobs at DuckDuckGoose.See open jobs similar to "Senior Software Engineer — ML Data Platform" Up!Rotterdam.

See more open positions at DuckDuckGoose

Privacy policy Cookie policy

Find A Job ThatFits Your Ambition.

Senior Software Engineer — ML Data Platform

Find A Job That
Fits Your Ambition.