BFCL V4 Release#1019
Merged
Merged
Conversation
Fanjia-Yan
reviewed
Aug 13, 2025
Fanjia-Yan
reviewed
Aug 13, 2025
Fanjia-Yan
requested changes
Aug 20, 2025
Fanjia-Yan
approved these changes
Aug 20, 2025
Collaborator
Fanjia-Yan
left a comment
There was a problem hiding this comment.
Added some comments. Address if see fit
Fanjia-Yan
approved these changes
Aug 20, 2025
Collaborator
Fanjia-Yan
left a comment
There was a problem hiding this comment.
Added some comments here. Address if see fit.
CharlieJCJ
reviewed
Aug 21, 2025
CharlieJCJ
reviewed
Aug 21, 2025
CharlieJCJ
reviewed
Aug 21, 2025
CharlieJCJ
approved these changes
Aug 21, 2025
Collaborator
CharlieJCJ
left a comment
There was a problem hiding this comment.
Overall, nothing blocking, nits in general. looks great
CharlieJCJ
reviewed
Aug 22, 2025
…oc in format sensitivity
This was
linked to
issues
Aug 25, 2025
|
Hi @HuanzhiMao , thanks for adding the v4 content. Do you mind also updating the data README here: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/data/README.md ? |
iamskeole
added a commit
to iamskeole/gorilla
that referenced
this pull request
May 9, 2026
> ❗️**Important**: This PR introduces breaking changes and is **NOT** backward-compatible. # BFCL V4 💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including: 🔍 Web search with multi-hop reasoning and error recovery 🧠 Evaluating Tool-Calling for Memory⚠️ Evaluating Format Sensitivity ## Change Log 1. **New agentic domain** - Introduces the agentic domain with two categories: Web Search and Memory Management. - For more information, please see our accompanying [blog posts](https://gorilla.cs.berkeley.edu/blog.html). 2. **Revised overall-accuracy formula** - As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks. | Segment | Old % | New % | | ----------- | ----: | -----: | | Live | 33 | **10** | | Non-Live | 33 | **10** | | Irrelevance | 0 | **10** | | Multi-Turn | 33 | **30** | | Agentic | 0 | **40** | 3. **Leaderboard / model cleanup** - Retires several deprecated models from the leaderboard. - Removes unused model handlers to improve maintainability. 4. **Address ShishirPatil#602** - `Non-Live Acc` and `Live Acc` score calculation now excludes the Irrelevance/Relevance category scores. 5. **Resolve ShishirPatil#1094.** 6. **Codebase refactor** - Reorganizes the response-generation pipeline and related modules for easier maintenance. - Simplify the response-generation pipeline logic for locally-hosted models. - Introduce `enums.py` 7. **Test category rename** The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns. - `simple` --> `simple_python` - `java` --> `simple_java` - `javascript` --> `simple_javascript` 8. **Directory layout overhaul** Results and scores now use a _two-level_ hierarchy: ```text result/<model>/<general_category>/<category>.json score/<model>/<general_category>/<category>.json ``` `general_category` ∈ { **non_live**, **live**, **multi_turn**, **agentic**, **format_sensitivity** } • For _agentic-memory_ tasks, an extra level distinguishes the memory backend: ```text result/<model>/agentic/<memory_backend>/<category>.json ``` Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files. 9. **New model support** Adds support for the following models: - `claude-opus-4-1-20250805` - `gpt-5-2025-08-07` - `gpt-5-mini-2025-08-07` - `gpt-5-nano-2025-08-07` - `Qwen/Qwen3-30B-A3B-Instruct-2507` - `Qwen/Qwen3-235B-A22B-Instruct-2507` - `Qwen/Qwen3-4B-Instruct-2507`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BFCL V4
💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including:
🔍 Web search with multi-hop reasoning and error recovery
⚠️ Evaluating Format Sensitivity
🧠 Evaluating Tool-Calling for Memory
Change Log
New agentic domain
Revised overall-accuracy formula
Leaderboard / model cleanup
Address [BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602
Non-Live AccandLive Accscore calculation now excludes the Irrelevance/Relevance category scores.Resolve [BFCL] Verification Needed for Live-relevance Data and Ground Truth #1094.
Codebase refactor
enums.pyTest category rename
The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.
simple-->simple_pythonjava-->simple_javajavascript-->simple_javascriptDirectory layout overhaul
Results and scores now use a two-level hierarchy:
general_category∈ { non_live, live, multi_turn, agentic, format_sensitivity }• For agentic-memory tasks, an extra level distinguishes the memory backend:
Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.
New model support
Adds support for the following models:
claude-opus-4-1-20250805gpt-5-2025-08-07gpt-5-mini-2025-08-07gpt-5-nano-2025-08-07Qwen/Qwen3-30B-A3B-Instruct-2507Qwen/Qwen3-235B-A22B-Instruct-2507Qwen/Qwen3-4B-Instruct-2507