Back
BW
Job

AI Engineer for LLM Ops & Evaluation (m/f/d)

Munich

About this opportunity

Company: Auxilius.ai
Location: Munich
Tags: IT

<p>You'll join an early-stage, AI-native startup with a product that has already proven market fit. We build cutting-edge AI solutions for Governance, Risk and Compliance (GRC) for enterprises around the world.</p>
<p>Our customers are auditors, risk managers, and compliance teams, which means evaluation rigor, auditability, and EU AI Act readiness aren't afterthoughts for us. They're product requirements.</p>
Tasks
<p>As our AI Engineer for LLMOps & Evaluation, you'll own the LLMOps pipeline end-to-end and work directly alongside our founding team.</p>
<p>You will:</p>
<ul>
<li>Own the LLMOps pipeline: Evaluate infrastructure, prompt optimization loop, and the production integration that turns experiments into reliable customer-facing features</li>
<li>Design evaluation strategy per output type: Decide when to use deterministic evals (exact match, schema validation, embeddings) vs. LLM-as-judge, and build the rubrics, test datasets, and human-review loops that make the system trustworthy</li>
<li>Drive prompt engineering and optimization across all LLM operations in the product: Moving from hand-tuned prompts to a measurable, iterative process</li>
<li>Pick the right tool for each problem: Some things are LLM problems, some are embedding + classical NLP problems, some are deterministic logic</li>
<li>Run the production side of AI features: Observability (Langfuse /LangSmith / similar), cost and latency engineering, incident response when an LLM feature degrades</li>
<li>Build human-in-the-loop workflows: Review queues, feedback ingestion, labeling; so production signal feeds back into evals and prompt iteration</li>
<li>Mentor our AI & Analytics Intern and contribute to how we build the AI team over time</li>
</ul>
Requirements
<ul>
<li>3+ years of hands-on experience building and shipping ML/AI systems in production (we care more about what you've shipped than years on a CV)</li>
<li>Have shipped an LLM evaluation or prompt optimization pipeline, not just used LLMs in a project, but owned the loop</li>
<li>Strong hands-on experience with LLM-as-judge, including its variance problems and concrete techniques for controlling them</li>
<li>Solid foundation in classical NLP and ML ops: Embeddings, semantic similarity, entity matching, classification, fuzzy matching</li>
<li>Informed opinions on deterministic vs. LLM-based evals, from experience</li>
<li>Production judgment: You've owned cost and latency tradeoffs, observability, and incident response for an LLM-powered feature. You're familiar with prompt regression and have strategies for managing it</li>
<li>Strong Python</li>
<li>Excellent English communication, written and verbal: We discuss nuanced technical tradeoffs daily with the founding team and customers</li>
<li>Comfort with ambiguity: You can run experiments on real data, build intuition for this domain, and know when to stop iterating</li>
</ul>
<p><strong>Nice to have</strong></p>
<ul>
<li>Hands-on experience with LLM observability and eval tooling (Langfuse, LangSmith, Phoenix/Arize, Helicone, Braintrust, W&B)</li>
<li>Experience with DSPy or similar prompt optimization frameworks, and opinions on where they do and don't work</li>
<li>Experience with Azure OpenAI in EU regions, or with EU-sovereign providers (Mistral, Aleph Alpha)</li>
<li>Exposure to guardrails, content safety, or AI governance</li>
<li>Exposure to enterprise software, ideally GRC, compliance, audit, or regulated industries</li>
<li>Familiarity with Java/Spring Boot or Kubernetes on Azure; enough to integrate cleanly</li>
<li>German</li>
</ul>
Benefits
<ul>
<li>Hands-on ownership of a real AI product used by enterprise customers</li>
<li>Work directly alongside the founding team from day one</li>
<li>Hybrid work model: Munich North, minimum one day per week in the office, otherwise flexible (open to strong candidates elsewhere in the EU for the right fit); onboarding will take in-office</li>
<li>A steep learning curve at the intersection of LLM engineering, enterprise GRC, and startup operations</li>
<li>The chance to shape the AI team as we grow</li>
</ul>
<p>Auxilius .ai is building AI-powered GRC solutions for enterprises. We're early-stage, fast-growing, and backed by real customers. Our tech stack includes Java & Spring Boot, Angular, Kubernetes on Azure, and OpenAI & Anthropic LLMs.</p>
<p>Find Jobs in Germany on Arbeitnow

Related

IN International Institute for Environment and Development
International Institute for Environment and Development
Job

Gender Expert Roster

Remote (global) Remote Not disclosed
4 days ago
FO FOS Feminista
FOS Feminista
Job

Digital Content Officer

Remote (Caribbean &... Remote $28,930
4 days ago
GL Global Energy Monitor
Global Energy Monitor
Job

Researcher / Analyst

Remote Remote $39 /hour
4 days ago
MO Mongabay
Mongabay
Job

Contributing Editor, Ocean Desk

Remote (worldwide) Remote undisclosed
4 days ago