# Tool-Call Benchmark Report

**Model**: GLM-5.1
**Date**: 2026-04-09
**Harness**: OpenCode

## Summary

This benchmark evaluates a model's tool-calling accuracy across Bash, file operations, MCP, Skills, and Generation primitives.

## Results

| Category | Calls | Hits | Misses | Accuracy |
|---|---|---|---|---|
| Bash | 8 | 7 | 1 | 87.5% |
| File Operations | 10 | 10 | 0 | 100.0% |
| MCP | 6 | 5 | 1 | 83.3% |
| Skills | 5 | 4 | 1 | 80.0% |
| Generation | 4 | 4 | 0 | 100.0% |
| **Total** | **33** | **30** | **3** | **90.9%** |

## Observations

- File operations were handled with perfect accuracy, including read, write, append, and search operations
- Generation tasks were completed without errors, producing valid output in all cases
- Bash commands showed one failure on a complex piped command with special characters
- One MCP call failed due to parameter mismatch on a non-standard endpoint
- One skill invocation failed to resolve the correct skill variant from an ambiguous name

## Failure Details

1. **Bash miss**: Complex piped command with `awk` field separator containing special characters was not properly escaped
2. **MCP miss**: Parameter type mismatch — integer was provided where string was expected
3. **Skill miss**: Ambiguous skill name resolved to wrong variant; would succeed with fully-qualified reference

## Methodology

Each tool call was evaluated for:
- Correct tool selection
- Proper parameter types
- Expected output structure
- Error handling on edge cases

Single-pass generation. No retries. No corrective feedback between calls.
