Skip to content

BFCL V4 Release#1019

Merged
HuanzhiMao merged 61 commits into
ShishirPatil:mainfrom
HuanzhiMao:BFCL-V4
Aug 25, 2025
Merged

BFCL V4 Release#1019
HuanzhiMao merged 61 commits into
ShishirPatil:mainfrom
HuanzhiMao:BFCL-V4

Conversation

@HuanzhiMao
Copy link
Copy Markdown
Collaborator

@HuanzhiMao HuanzhiMao commented May 11, 2025

❗️Important: This PR introduces breaking changes and is NOT backward-compatible.

BFCL V4

💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including:

🔍 Web search with multi-hop reasoning and error recovery
🧠 Evaluating Tool-Calling for Memory
⚠️ Evaluating Format Sensitivity

Change Log

  1. New agentic domain

    • Introduces the agentic domain with two categories: Web Search and Memory Management.
    • For more information, please see our accompanying blog posts.
  2. Revised overall-accuracy formula

    • As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks.
    Segment Old % New %
    Live 33 10
    Non-Live 33 10
    Irrelevance 0 10
    Multi-Turn 33 30
    Agentic 0 40
  3. Leaderboard / model cleanup

    • Retires several deprecated models from the leaderboard.
    • Removes unused model handlers to improve maintainability.
  4. Address [BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602

    • Non-Live Acc and Live Acc score calculation now excludes the Irrelevance/Relevance category scores.
  5. Resolve [BFCL] Verification Needed for Live-relevance Data and Ground Truth #1094.

  6. Codebase refactor

    • Reorganizes the response-generation pipeline and related modules for easier maintenance.
    • Simplify the response-generation pipeline logic for locally-hosted models.
    • Introduce enums.py
  7. Test category rename
    The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.

    • simple --> simple_python
    • java --> simple_java
    • javascript --> simple_javascript
  8. Directory layout overhaul
    Results and scores now use a two-level hierarchy:

    result/<model>/<general_category>/<category>.json
    score/<model>/<general_category>/<category>.json
    

    general_category ∈ { non_live, live, multi_turn, agentic, format_sensitivity }

    • For agentic-memory tasks, an extra level distinguishes the memory backend:

    result/<model>/agentic/<memory_backend>/<category>.json
    

    Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.

  9. New model support
    Adds support for the following models:

    • claude-opus-4-1-20250805
    • gpt-5-2025-08-07
    • gpt-5-mini-2025-08-07
    • gpt-5-nano-2025-08-07
    • Qwen/Qwen3-30B-A3B-Instruct-2507
    • Qwen/Qwen3-235B-A22B-Instruct-2507
    • Qwen/Qwen3-4B-Instruct-2507

@HuanzhiMao HuanzhiMao added BFCL-General General BFCL Issue BFCL-Dataset BFCL Dataset-Related Issue labels May 11, 2025
@HuanzhiMao HuanzhiMao marked this pull request as ready for review July 17, 2025 22:08
Comment thread berkeley-function-call-leaderboard/bfcl_eval/_llm_response_generation.py Outdated
Comment thread berkeley-function-call-leaderboard/README.md
Comment thread berkeley-function-call-leaderboard/bfcl_eval/utils.py Outdated
Comment thread berkeley-function-call-leaderboard/bfcl_eval/utils.py
Comment thread berkeley-function-call-leaderboard/bfcl_eval/model_handler/utils.py
Comment thread berkeley-function-call-leaderboard/bfcl_eval/_llm_response_generation.py Outdated
Copy link
Copy Markdown
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Address if see fit

Copy link
Copy Markdown
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments here. Address if see fit.

Comment thread berkeley-function-call-leaderboard/README.md
Comment thread berkeley-function-call-leaderboard/pyproject.toml
Comment thread berkeley-function-call-leaderboard/README.md
Comment thread berkeley-function-call-leaderboard/bfcl_eval/utils.py
Copy link
Copy Markdown
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, nothing blocking, nits in general. looks great

Comment thread berkeley-function-call-leaderboard/README.md
@D-X-Y
Copy link
Copy Markdown

D-X-Y commented Sep 2, 2025

Hi @HuanzhiMao , thanks for adding the v4 content. Do you mind also updating the data README here: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/data/README.md ?

iamskeole added a commit to iamskeole/gorilla that referenced this pull request May 9, 2026
> ❗️**Important**: This PR introduces breaking changes and is **NOT**
backward-compatible.

# BFCL V4

💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new
benchmark focused on tool-calling in real-world agentic settings —
including:

🔍 Web search with multi-hop reasoning and error recovery
🧠 Evaluating Tool-Calling for Memory
⚠️ Evaluating Format Sensitivity 

## Change Log

1. **New agentic domain**
- Introduces the agentic domain with two categories: Web Search and
Memory Management.
- For more information, please see our accompanying [blog
posts](https://gorilla.cs.berkeley.edu/blog.html).
2. **Revised overall-accuracy formula**

- As single-turn tasks approach saturation, weighting now favors
complex, multi-step agentic tasks.

   | Segment     | Old % |  New % |
   | ----------- | ----: | -----: |
   | Live        |    33 | **10** |
   | Non-Live    |    33 | **10** |
   | Irrelevance |     0 | **10** |
   | Multi-Turn  |    33 | **30** |
   | Agentic     |     0 | **40** |

3. **Leaderboard / model cleanup**
   - Retires several deprecated models from the leaderboard.
   - Removes unused model handlers to improve maintainability.
4. **Address ShishirPatil#602**
- `Non-Live Acc` and `Live Acc` score calculation now excludes the
Irrelevance/Relevance category scores.
5. **Resolve ShishirPatil#1094.**
6. **Codebase refactor**
- Reorganizes the response-generation pipeline and related modules for
easier maintenance.
- Simplify the response-generation pipeline logic for locally-hosted
models.
   - Introduce `enums.py`
7. **Test category rename**
The following categories have been renamed to avoid confusion. This
applies to both dataset file names and leaderboard website columns.
   - `simple` --> `simple_python`
   - `java` --> `simple_java`
   - `javascript` --> `simple_javascript`
8. **Directory layout overhaul**
   Results and scores now use a _two-level_ hierarchy:

   ```text
   result/<model>/<general_category>/<category>.json
   score/<model>/<general_category>/<category>.json
   ```

`general_category` ∈ { **non_live**, **live**, **multi_turn**,
**agentic**, **format_sensitivity** }

• For _agentic-memory_ tasks, an extra level distinguishes the memory
backend:

   ```text
   result/<model>/agentic/<memory_backend>/<category>.json
   ```

Migrate existing outputs to this structure before upgrading, otherwise
the evaluation pipeline will fail to locate files.
9. **New model support**
   Adds support for the following models:
   - `claude-opus-4-1-20250805`
   - `gpt-5-2025-08-07`
   - `gpt-5-mini-2025-08-07`
   - `gpt-5-nano-2025-08-07`
   - `Qwen/Qwen3-30B-A3B-Instruct-2507`
   - `Qwen/Qwen3-235B-A22B-Instruct-2507`
   - `Qwen/Qwen3-4B-Instruct-2507`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFCL-Dataset BFCL Dataset-Related Issue BFCL-General General BFCL Issue BFCL-New Model Add New Model to BFCL

Projects

None yet

4 participants