Skip to content

[Evaluation] Normalize evaluator validation errors to EvaluationException with USER_ERROR blame#47735

Open
m7md7sien wants to merge 4 commits into
Azure:mainfrom
m7md7sien:mohessie/normalize-evaluator-exceptions
Open

[Evaluation] Normalize evaluator validation errors to EvaluationException with USER_ERROR blame#47735
m7md7sien wants to merge 4 commits into
Azure:mainfrom
m7md7sien:mohessie/normalize-evaluator-exceptions

Conversation

@m7md7sien

Copy link
Copy Markdown
Contributor

Summary

Normalizes evaluation/validation error handling in azure-ai-evaluation so that user input and configuration errors are consistently raised as EvaluationException with blame=ErrorBlame.USER_ERROR (plus an appropriate category and target).

Previously several evaluators raised bare ValueError/TypeError for input/threshold validation, and a few existing EvaluationException raises did not set blame, so they defaulted to Unknown/InternalError even though they were caused by user input.

Changes

Raw ValueError/TypeErrorEvaluationException(USER_ERROR)

  • ContentSafetyEvaluator — threshold type check
  • QAEvaluator — threshold type check
  • RougeScoreEvaluator — threshold type check
  • DocumentRetrievalEvaluator — ground-truth label and input-record validation
  • Task navigation efficiency evaluator — matching_mode and ground_truth validation

Existing EvaluationException missing USER_ERROR

  • Evaluator base (_base_eval.py) — conversation message mismatch, malformed tool-call parsing, and threshold-not-a-number checks now set blame=USER_ERROR (one category=UNKNOWN corrected to INVALID_VALUE)

Supporting

  • Added QA_EVALUATOR, ROUGE_EVALUATOR, and DOCUMENT_RETRIEVAL_EVALUATOR members to ErrorTarget
  • Updated the task navigation test to expect EvaluationException for an invalid matching_mode
  • CHANGELOG entry under 1.17.1 (Unreleased)

Intentionally left unchanged (not user errors)

  • "Evaluator returned invalid output" / "Invalid score value" across the prompty and tool evaluators remain SYSTEM_ERROR (malformed LLM output, not user input).
  • Internal/defensive checks (_conversation_aggregators.py UNKNOWN, _base_rai_svc_eval.py "Not implemented") are unchanged.

Validation

  • All affected unit tests pass (document retrieval, task navigation, threshold behavior, common validators, built-in & agent evaluators).
  • black (pinned 24.4.0, repo config) passes on all modified files.

…R_ERROR blame

Convert raw ValueError/TypeError input and configuration validation failures in ContentSafety, QA, Rouge, DocumentRetrieval and TaskNavigationEfficiency evaluators to EvaluationException, and ensure user-validation errors across the evaluator base consistently set blame=ErrorBlame.USER_ERROR with appropriate category/target. Adds QA/Rouge/DocumentRetrieval ErrorTarget enum members and updates the task navigation test.
@github-actions github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jun 28, 2026
Copilot AI added a commit that referenced this pull request Jun 29, 2026
…ument_retrieval evaluator

- Added QA_EVALUATOR, ROUGE_EVALUATOR, DOCUMENT_RETRIEVAL_EVALUATOR to ErrorTarget enum
- Improved EvaluationException calls with blame/category/target in _base_eval.py, _content_safety.py, _qa.py, _rouge.py, _task_navigation_efficiency.py, _document_retrieval.py
- Fixed ordering: isinstance type checks now run BEFORE comparison in DocumentRetrievalEvaluator.__init__ (azureml-assets c45cc1a fix)
- Replaced bare ValueError/TypeError with EvaluationException in task_navigation_efficiency and document_retrieval evaluators
- Updated test to expect EvaluationException instead of ValueError

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
@m7md7sien m7md7sien requested review from aprilk-ms and Copilot June 29, 2026 15:35
@m7md7sien m7md7sien marked this pull request as ready for review June 29, 2026 15:36
@m7md7sien m7md7sien requested a review from a team as a code owner June 29, 2026 15:36
@m7md7sien m7md7sien enabled auto-merge (squash) June 29, 2026 15:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR normalizes input/configuration validation error handling in the azure-ai-evaluation package so that user-caused validation failures are consistently raised as EvaluationException with blame=ErrorBlame.USER_ERROR (plus an appropriate category and target). Previously several evaluators raised bare ValueError/TypeError, and a few existing EvaluationException raises omitted blame, causing them to display as (InternalError) even when caused by user input (since EvaluationException.__str__ maps any non-USER_ERROR blame to "InternalError"). This change improves error classification surfaced to users and aligns with the package-wide validation convention already used in the common validators.

Changes:

  • Converted raw ValueError/TypeError validation raises to EvaluationException(USER_ERROR) in ContentSafetyEvaluator, QAEvaluator, RougeScoreEvaluator, DocumentRetrievalEvaluator, and the task navigation efficiency evaluator.
  • Added explicit blame=USER_ERROR (and corrected one category=UNKNOWNINVALID_VALUE) to existing EvaluationException raises in the evaluator base (_base_eval.py), and added three new ErrorTarget members (QA_EVALUATOR, ROUGE_EVALUATOR, DOCUMENT_RETRIEVAL_EVALUATOR).
  • Updated the task navigation test to expect EvaluationException, plus a CHANGELOG entry under 1.17.1 (Unreleased).

I verified the new ErrorTarget values match their class names, no duplicate imports were introduced, ErrorBlame was already imported where used in _base_eval.py, the document-retrieval reordering (type checks now precede the >= comparison) is a correctness improvement and keeps existing test messages intact, and no existing tests still expect ValueError/TypeError for the changed evaluators. One note: changing these raises from ValueError/TypeError to EvaluationException (which is not a subclass of either) is a behavioral change for any downstream code catching those specific exception types — this is intentional and documented in the CHANGELOG.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
_exceptions.py Adds QA_EVALUATOR, ROUGE_EVALUATOR, DOCUMENT_RETRIEVAL_EVALUATOR to ErrorTarget.
_evaluators/_common/_base_eval.py Adds USER_ERROR blame to conversation/tool-call/threshold validation raises; one category corrected to INVALID_VALUE.
_evaluators/_content_safety/_content_safety.py Threshold type check now raises EvaluationException(USER_ERROR) instead of TypeError.
_evaluators/_qa/_qa.py Threshold type check now raises EvaluationException(USER_ERROR).
_evaluators/_rouge/_rouge.py Threshold type check now raises EvaluationException(USER_ERROR).
_evaluators/_document_retrieval/_document_retrieval.py Label-bound/input-record validation normalized to EvaluationException; type checks reordered before the bound comparison; internal missing-threshold kept as SYSTEM_ERROR.
_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py matching_mode and ground_truth validation normalized to EvaluationException(USER_ERROR).
tests/unittests/test_task_navigation_efficiency_evaluators.py Updated to expect EvaluationException for invalid matching_mode.
CHANGELOG.md Adds a Bugs Fixed entry under 1.17.1 (Unreleased).

@m7md7sien m7md7sien self-assigned this Jun 29, 2026
…in task navigation evaluator (#1)

* Fix black formatting and ground_truth empty validation category

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Delete accidentally committed log file

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
@m7md7sien m7md7sien removed the request for review from aprilk-ms July 1, 2026 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants