← All Careers
Posted May 29, 2026

AI Model Evaluator (LLM & Agent Systems)

Job Title: AI Model Evaluator (LLM & Agent Systems) Job Type: Contract (Minimum 2 weeks, with potential extension) Location: Remote Job Summary: Join our customer's team as an AI Model Evaluator (LLM & Agent Systems) and play a pivotal role in shaping the future of generative AI and autonomous agents. You'll help benchmark, analyze, and assess cutting-edge AI systems in real-world scenarios, providing structured insights that drive improvements. This position is ideal for analytical professionals passionate about AI quality and real-world impact. Key Responsibilities: • Evaluate outputs from large language models (LLMs) and autonomous agent systems against defined guidelines and rubrics • Review multi-step agent actions, including screenshots and reasoning traces, to determine accuracy and quality • Consistently apply evaluation standards, flagging edge cases and identifying recurring patterns or failure modes • Provide detailed, structured feedback to inform benchmarking, product evolution, and model refinement • Participate in calibration and alignment sessions to ensure consistent application of evaluation criteria • Work collaboratively to adapt to evolving scenarios and ambiguous evaluation situations • Document findings and communicate insights clearly both in writing and verbally to relevant stakeholders Required Skills and Qualifications: • Demonstrated experience with LLM evaluation, AI output analysis, QA/testing, UX research, or similar analytical roles • Strong background in AI model evaluation, benchmarking, and applying rubric-based scoring frameworks • Exceptional attention to detail and sound judgement in ambiguous or edge-case scenarios • Proficiency in English (B2+ or equivalent) with excellent written and verbal communication skills • Ability to adapt quickly to evolving guidelines and work independently • Comfort with remote work and a commitment of at least 20 hours per week for the initial term • Analytical mindset with a focus on actionable, qualitative feedback Preferred Qualifications: • Experience with RLHF, annotation workflows, or AI benchmarking frameworks • Familiarity with autonomous agent systems or workflow automation tools • Background in mobile apps or digital product evaluation processes Required Skills • LLMs • Generative AI • AI Model Evaluation • AI Benchmarking • AI Quality Assessment • Model Performance Evaluation • Prompt Response Evaluation • AI Output Analysis • Rubric-Based Scoring