BFCL V4 Release by HuanzhiMao · Pull Request #1019 · ShishirPatil/gorilla

HuanzhiMao · 2025-05-11T00:00:24Z

❗️Important: This PR introduces breaking changes and is NOT backward-compatible.

BFCL V4

💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including:

🔍 Web search with multi-hop reasoning and error recovery
🧠 Evaluating Tool-Calling for Memory
⚠️ Evaluating Format Sensitivity

Change Log

New agentic domain
- Introduces the agentic domain with two categories: Web Search and Memory Management.
- For more information, please see our accompanying blog posts.
Revised overall-accuracy formula
- As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks.
Segment Old % New %

Live 33 10

Non-Live 33 10

Irrelevance 0 10

Multi-Turn 33 30

Agentic 0 40
Leaderboard / model cleanup
- Retires several deprecated models from the leaderboard.
- Removes unused model handlers to improve maintainability.
Address [BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602
- Non-Live Acc and Live Acc score calculation now excludes the Irrelevance/Relevance category scores.
Resolve [BFCL] Verification Needed for Live-relevance Data and Ground Truth #1094.
Codebase refactor
- Reorganizes the response-generation pipeline and related modules for easier maintenance.
- Simplify the response-generation pipeline logic for locally-hosted models.
- Introduce enums.py
Test category rename
The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.
- simple --> simple_python
- java --> simple_java
- javascript --> simple_javascript
Directory layout overhaul
Results and scores now use a two-level hierarchy:
```
result/<model>/<general_category>/<category>.json
score/<model>/<general_category>/<category>.json
```
general_category ∈ { non_live, live, multi_turn, agentic, format_sensitivity }

• For agentic-memory tasks, an extra level distinguishes the memory backend:
```
result/<model>/agentic/<memory_backend>/<category>.json
```
Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.
New model support
Adds support for the following models:
- claude-opus-4-1-20250805
- gpt-5-2025-08-07
- gpt-5-mini-2025-08-07
- gpt-5-nano-2025-08-07
- Qwen/Qwen3-30B-A3B-Instruct-2507
- Qwen/Qwen3-235B-A22B-Instruct-2507
- Qwen/Qwen3-4B-Instruct-2507

…ation

Fanjia-Yan

Added some comments. Address if see fit

Fanjia-Yan

Added some comments here. Address if see fit.

CharlieJCJ

Overall, nothing blocking, nits in general. looks great

…oc in format sensitivity

D-X-Y · 2025-09-02T23:32:07Z

Hi @HuanzhiMao , thanks for adding the v4 content. Do you mind also updating the data README here: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/data/README.md ?

> ❗️**Important**: This PR introduces breaking changes and is **NOT** backward-compatible. # BFCL V4 💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including: 🔍 Web search with multi-hop reasoning and error recovery 🧠 Evaluating Tool-Calling for Memory ⚠️ Evaluating Format Sensitivity ## Change Log 1. **New agentic domain** - Introduces the agentic domain with two categories: Web Search and Memory Management. - For more information, please see our accompanying [blog posts](https://gorilla.cs.berkeley.edu/blog.html). 2. **Revised overall-accuracy formula** - As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks. | Segment | Old % | New % | | ----------- | ----: | -----: | | Live | 33 | **10** | | Non-Live | 33 | **10** | | Irrelevance | 0 | **10** | | Multi-Turn | 33 | **30** | | Agentic | 0 | **40** | 3. **Leaderboard / model cleanup** - Retires several deprecated models from the leaderboard. - Removes unused model handlers to improve maintainability. 4. **Address ShishirPatil#602** - `Non-Live Acc` and `Live Acc` score calculation now excludes the Irrelevance/Relevance category scores. 5. **Resolve ShishirPatil#1094.** 6. **Codebase refactor** - Reorganizes the response-generation pipeline and related modules for easier maintenance. - Simplify the response-generation pipeline logic for locally-hosted models. - Introduce `enums.py` 7. **Test category rename** The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns. - `simple` --> `simple_python` - `java` --> `simple_java` - `javascript` --> `simple_javascript` 8. **Directory layout overhaul** Results and scores now use a _two-level_ hierarchy: ```text result/<model>/<general_category>/<category>.json score/<model>/<general_category>/<category>.json ``` `general_category` ∈ { **non_live**, **live**, **multi_turn**, **agentic**, **format_sensitivity** } • For _agentic-memory_ tasks, an extra level distinguishes the memory backend: ```text result/<model>/agentic/<memory_backend>/<category>.json ``` Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files. 9. **New model support** Adds support for the following models: - `claude-opus-4-1-20250805` - `gpt-5-2025-08-07` - `gpt-5-mini-2025-08-07` - `gpt-5-nano-2025-08-07` - `Qwen/Qwen3-30B-A3B-Instruct-2507` - `Qwen/Qwen3-235B-A22B-Instruct-2507` - `Qwen/Qwen3-4B-Instruct-2507`

HuanzhiMao added BFCL-General General BFCL Issue BFCL-Dataset BFCL Dataset-Related Issue labels May 11, 2025

HuanzhiMao added 17 commits June 21, 2025 16:55

update pyproject.toml

94e1961

update compile script

2e157ad

update default system prompt

4b9667c

update eval config, category mapping

6a34c9e

add agentic checker

b8ce6fb

update generation pipeline logic

31595af

update helper function in utils

2f1905b

update evaluation pipeline

0dc53e3

add web search backend implementation

253a21f

add memory backend implementation

9ddb404

upload backend function doc

04180cb

upload dataset and ground truth

f42b535

update generation pipeline logic and utils

ca794fd

update api_inference handlers

19b6bb2

Merge remote-tracking branch 'upstream/main' into agentic

bf7f60e

retire deprecated models

8c469fa

update local_inference handlers

afc0505

HuanzhiMao force-pushed the BFCL-V4 branch from a15f4fc to afc0505 Compare July 17, 2025 07:08

HuanzhiMao added 6 commits July 17, 2025 12:50

update web_search categories

5c7e6ca

update overall score calculation formula

cdf5927

fix live-relevance_13, live-relevance_15

f39a7d2

exclude irrelevance from the overall_live and overall_non_live calcul…

45b6bf8

…ation

Merge remote-tracking branch 'upstream/main' into BFCL-V4

a565a38

update change log

73de363

HuanzhiMao force-pushed the BFCL-V4 branch from f81cb6c to 73de363 Compare July 17, 2025 21:38

update to bfcl_v4 prefix

57665f5

HuanzhiMao marked this pull request as ready for review July 17, 2025 22:08

update csv column headers

e137ad0

HuanzhiMao added 8 commits August 8, 2025 17:33

fix typo

e43f035

add auto-retry logic to mistral models

e21bdf2

update Nova model configuration to use AWS SSO for authentication

2f08811

fix potential race-condition in result file writing

4b43364

clean up

3a44eaf

remove already-generated entries from dependency lists

31e12e9

add gpt-5 series support

d949302

update changelog

b0b3d19

Fanjia-Yan reviewed Aug 13, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/bfcl_eval/constants/category_mapping.py

Fanjia-Yan reviewed Aug 13, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/bfcl_eval/_llm_response_generation.py Outdated

Fanjia-Yan requested changes Aug 20, 2025

View reviewed changes

Fanjia-Yan approved these changes Aug 20, 2025

View reviewed changes

HuanzhiMao added 2 commits August 20, 2025 17:38

improve retry logic for web search api; avoid thundering-herd wake-ups

1fb1b68

remove duplicate code; clean up

7b773b9

CharlieJCJ reviewed Aug 21, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/README.md

Comment thread berkeley-function-call-leaderboard/bfcl_eval/data/BFCL_v4_irrelevance.json

Comment thread berkeley-function-call-leaderboard/bfcl_eval/data/BFCL_v4_simple_java.json

CharlieJCJ reviewed Aug 21, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/pyproject.toml

Comment thread berkeley-function-call-leaderboard/README.md

CharlieJCJ reviewed Aug 21, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/bfcl_eval/eval_checker/agentic_eval/agentic_checker.py Outdated

Comment thread berkeley-function-call-leaderboard/bfcl_eval/eval_checker/agentic_eval/agentic_checker.py

CharlieJCJ reviewed Aug 21, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/bfcl_eval/utils.py

Comment thread ...ction-call-leaderboard/bfcl_eval/eval_checker/multi_turn_eval/func_source_code/web_search.py

CharlieJCJ approved these changes Aug 21, 2025

View reviewed changes

CharlieJCJ reviewed Aug 22, 2025

View reviewed changes

Comment thread berkeley-function-call-leaderboard/README.md

HuanzhiMao added 2 commits August 25, 2025 00:05

improve nested parameter handling for xml and python style function d…

bc94a1b

…oc in format sensitivity

clean up

c02c048

HuanzhiMao merged commit 58f57e9 into ShishirPatil:main Aug 25, 2025

This was linked to issues Aug 25, 2025

[BFCL] When is the V4 benchmark scheduled for release? #1153

Closed

[BFCL] Fix: OpenAI API BadRequestError for non-reasoning models with encrypted content param #1148

Closed

HuanzhiMao mentioned this pull request Aug 25, 2025

[BFCL] Fix: OpenAI API BadRequestError for non-reasoning models with encrypted content param #1148

Closed

HuanzhiMao mentioned this pull request Oct 27, 2025

Fix bfcl-generate: Error for OAI non-reasoning models #1150

Closed

Conversation

HuanzhiMao commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BFCL V4

Change Log

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fanjia-Yan left a comment

Choose a reason for hiding this comment

Uh oh!

Fanjia-Yan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieJCJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

D-X-Y commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuanzhiMao commented May 11, 2025 •

edited

Loading