Job Title: AI Model Evaluator (LLM & Agent Systems)
Job Type: Contract (Minimum 2 weeks, with potential extension)
Location: Remote
Job Summary:
Join our customer's team as an AI Model Evaluator (LLM & Agent Systems) and play a pivotal role in shaping the future of generative AI and autonomous agents. You'll help benchmark, analyze, and assess cutting-edge AI systems in real-world scenarios, providing structured insights that drive improvements. This position is ideal for analytical professionals passionate about AI quality and real-world impact.
Key Responsibilities:
• Evaluate outputs from large language models (LLMs) and autonomous agent systems against defined guidelines and rubrics
• Review multi-step agent actions, including screenshots and reasoning traces, to determine accuracy and quality
• Consistently apply evaluation standards, flagging edge cases and identifying recurring patterns or failure modes
• Provide detailed, structured feedback to inform benchmarking, product evolution, and model refinement
• Participate in calibration and alignment sessions to ensure consistent application of evaluation criteria
• Work collaboratively to adapt to evolving scenarios and ambiguous evaluation situations
• Document findings and communicate insights clearly both in writing and verbally to relevant stakeholders
Required Skills and Qualifications:
• Demonstrated experience with LLM evaluation, AI output analysis, QA/testing, UX research, or similar analytical roles
• Strong background in AI model evaluation, benchmarking, and applying rubric-based scoring frameworks
• Exceptional attention to detail and sound judgement in ambiguous or edge-case scenarios
• Proficiency in English (B2+ or equivalent) with excellent written and verbal communication skills
• Ability to adapt quickly to evolving guidelines and work independently
• Comfort with remote work and a commitment of at least 20 hours per week for the initial term
• Analytical mindset with a focus on actionable, qualitative feedback
Preferred Qualifications:
• Experience with RLHF, annotation workflows, or AI benchmarking frameworks
• Familiarity with autonomous agent systems or workflow automation tools
• Background in mobile apps or digital product evaluation processes
Required Skills
• LLMs
• Generative AI
• AI Model Evaluation
• AI Benchmarking
• AI Quality Assessment
• Model Performance Evaluation
• Prompt Response Evaluation
• AI Output Analysis
• Rubric-Based Scoring