[Evaluation] Normalize evaluator validation errors to EvaluationException with USER_ERROR blame#47735
[Evaluation] Normalize evaluator validation errors to EvaluationException with USER_ERROR blame#47735m7md7sien wants to merge 4 commits into
Conversation
…R_ERROR blame Convert raw ValueError/TypeError input and configuration validation failures in ContentSafety, QA, Rouge, DocumentRetrieval and TaskNavigationEfficiency evaluators to EvaluationException, and ensure user-validation errors across the evaluator base consistently set blame=ErrorBlame.USER_ERROR with appropriate category/target. Adds QA/Rouge/DocumentRetrieval ErrorTarget enum members and updates the task navigation test.
…ument_retrieval evaluator - Added QA_EVALUATOR, ROUGE_EVALUATOR, DOCUMENT_RETRIEVAL_EVALUATOR to ErrorTarget enum - Improved EvaluationException calls with blame/category/target in _base_eval.py, _content_safety.py, _qa.py, _rouge.py, _task_navigation_efficiency.py, _document_retrieval.py - Fixed ordering: isinstance type checks now run BEFORE comparison in DocumentRetrievalEvaluator.__init__ (azureml-assets c45cc1a fix) - Replaced bare ValueError/TypeError with EvaluationException in task_navigation_efficiency and document_retrieval evaluators - Updated test to expect EvaluationException instead of ValueError Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR normalizes input/configuration validation error handling in the azure-ai-evaluation package so that user-caused validation failures are consistently raised as EvaluationException with blame=ErrorBlame.USER_ERROR (plus an appropriate category and target). Previously several evaluators raised bare ValueError/TypeError, and a few existing EvaluationException raises omitted blame, causing them to display as (InternalError) even when caused by user input (since EvaluationException.__str__ maps any non-USER_ERROR blame to "InternalError"). This change improves error classification surfaced to users and aligns with the package-wide validation convention already used in the common validators.
Changes:
- Converted raw
ValueError/TypeErrorvalidation raises toEvaluationException(USER_ERROR)inContentSafetyEvaluator,QAEvaluator,RougeScoreEvaluator,DocumentRetrievalEvaluator, and the task navigation efficiency evaluator. - Added explicit
blame=USER_ERROR(and corrected onecategory=UNKNOWN→INVALID_VALUE) to existingEvaluationExceptionraises in the evaluator base (_base_eval.py), and added three newErrorTargetmembers (QA_EVALUATOR,ROUGE_EVALUATOR,DOCUMENT_RETRIEVAL_EVALUATOR). - Updated the task navigation test to expect
EvaluationException, plus a CHANGELOG entry under 1.17.1 (Unreleased).
I verified the new ErrorTarget values match their class names, no duplicate imports were introduced, ErrorBlame was already imported where used in _base_eval.py, the document-retrieval reordering (type checks now precede the >= comparison) is a correctness improvement and keeps existing test messages intact, and no existing tests still expect ValueError/TypeError for the changed evaluators. One note: changing these raises from ValueError/TypeError to EvaluationException (which is not a subclass of either) is a behavioral change for any downstream code catching those specific exception types — this is intentional and documented in the CHANGELOG.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
_exceptions.py |
Adds QA_EVALUATOR, ROUGE_EVALUATOR, DOCUMENT_RETRIEVAL_EVALUATOR to ErrorTarget. |
_evaluators/_common/_base_eval.py |
Adds USER_ERROR blame to conversation/tool-call/threshold validation raises; one category corrected to INVALID_VALUE. |
_evaluators/_content_safety/_content_safety.py |
Threshold type check now raises EvaluationException(USER_ERROR) instead of TypeError. |
_evaluators/_qa/_qa.py |
Threshold type check now raises EvaluationException(USER_ERROR). |
_evaluators/_rouge/_rouge.py |
Threshold type check now raises EvaluationException(USER_ERROR). |
_evaluators/_document_retrieval/_document_retrieval.py |
Label-bound/input-record validation normalized to EvaluationException; type checks reordered before the bound comparison; internal missing-threshold kept as SYSTEM_ERROR. |
_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py |
matching_mode and ground_truth validation normalized to EvaluationException(USER_ERROR). |
tests/unittests/test_task_navigation_efficiency_evaluators.py |
Updated to expect EvaluationException for invalid matching_mode. |
CHANGELOG.md |
Adds a Bugs Fixed entry under 1.17.1 (Unreleased). |
…in task navigation evaluator (#1) * Fix black formatting and ground_truth empty validation category Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Delete accidentally committed log file Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Summary
Normalizes evaluation/validation error handling in
azure-ai-evaluationso that user input and configuration errors are consistently raised asEvaluationExceptionwithblame=ErrorBlame.USER_ERROR(plus an appropriatecategoryandtarget).Previously several evaluators raised bare
ValueError/TypeErrorfor input/threshold validation, and a few existingEvaluationExceptionraises did not setblame, so they defaulted toUnknown/InternalErroreven though they were caused by user input.Changes
Raw
ValueError/TypeError→EvaluationException(USER_ERROR)ContentSafetyEvaluator— threshold type checkQAEvaluator— threshold type checkRougeScoreEvaluator— threshold type checkDocumentRetrievalEvaluator— ground-truth label and input-record validationmatching_modeandground_truthvalidationExisting
EvaluationExceptionmissingUSER_ERROR_base_eval.py) — conversation message mismatch, malformed tool-call parsing, and threshold-not-a-number checks now setblame=USER_ERROR(onecategory=UNKNOWNcorrected toINVALID_VALUE)Supporting
QA_EVALUATOR,ROUGE_EVALUATOR, andDOCUMENT_RETRIEVAL_EVALUATORmembers toErrorTargetEvaluationExceptionfor an invalidmatching_modeIntentionally left unchanged (not user errors)
SYSTEM_ERROR(malformed LLM output, not user input)._conversation_aggregators.pyUNKNOWN,_base_rai_svc_eval.py"Not implemented") are unchanged.Validation
black(pinned 24.4.0, repo config) passes on all modified files.