How LLM Function Calling Actually Works — From Tokens to Tool Orchestration
How LLMs return structured data through function calling, how constrained decoding works under the hood, and what happens when the model needs to call multiple tools in a single turn.
When you ask an LLM "Compare the weather in Tokyo and Berlin," what actually happens? The model can't browse the internet — but it can decide to call a weather API. Twice. In the same turn.
This article covers how function calling works, how the LLM returns structured data despite generating tokens one by one, and what happens when the model needs to orchestrate multiple tool calls to answer a single question.
Part 1: What Is Function Calling?
"Function calling" is one of several ways to get output from an LLM API. Here are the three main methods:
Method A: Plain Text Completion (Simplest)
response = client.chat.completions.create(
model="gemini-2.5-flash-lite",
messages=[{"role": "user", "content": "Classify this email: ..."}],
)
text = response.choices[0].message.content
# "This is a job_search email because..."
You get back free-form text. Then you'd have to parse it yourself — maybe with regex, or hoping the LLM follows instructions like "respond with JSON only". This is fragile because the LLM might say:
"I think this email is in the job_search category because..."
...and now your regex breaks.
Method B: JSON Mode
response = client.chat.completions.create(
model="...",
messages=[...],
response_format={"type": "json_object"}, # force JSON output
)
data = json.loads(response.choices[0].message.content)
The LLM is forced to output valid JSON, but you still have no guarantee of the schema — it might return {"cat": "job"} instead of {"category": "job_search"}.
Method C: Function Calling (What We Use)
response = client.chat.completions.create(
model="...",
messages=[...],
tools=[{
"type": "function",
"function": {
"name": "classify_email",
"parameters": {
"properties": {
"category": {
"type": "string",
"enum": ["job_search", "spam", "newsletter", ...]
},
"confidence": {"type": "number"},
"summary": {"type": "string"},
"reasoning": {"type": "string"}
},
"required": ["category", "confidence"]
}
}
}],
tool_choice={
"type": "function",
"function": {"name": "classify_email"}
},
)
You define the exact schema you want — field names, types, enums, required fields. The API forces the LLM to fill in that schema. The result comes back as:
{
"category": "job_search",
"confidence": 0.95,
"summary": "LinkedIn job alert for Senior Python Developer",
"reasoning": "Sender is LinkedIn, contains job listings..."
}
This is the most reliable way to get structured output. The LLM cannot deviate from the schema.
Comparison
| Method | Output | Schema Guarantee | Reliability |
|---|---|---|---|
| Plain text | Free-form string | None | Low — requires manual parsing |
| JSON mode | Valid JSON | No schema enforcement | Medium — valid JSON but unpredictable keys |
| Function calling | Schema-constrained JSON | Full schema + types + enums | High — enforced at token generation level |
Part 2: How Does an LLM Return a Dict If It Generates Tokens?
The LLM still generates tokens one by one. It doesn't "natively" return a Python dictionary. Here's what actually happens under the hood.
What the LLM Actually Generates
Tokens: { " category " : " job _ search " , " confidence " : 0 . 95 }
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
token token token ... (still just text tokens)
The LLM is still producing text — it's just text that happens to be valid JSON. The API layer does the magic.
The Constrained Decoding Process
When you use function calling, the API applies constrained decoding (also called "guided generation"):
- The API receives your
toolsschema definition - It constrains the LLM's token generation — at each step, only tokens that would produce valid JSON matching your schema are allowed
- For
"enum": ["job_search", "spam", ...], the LLM can literally only pick from those values - For
"type": "number", only numeric tokens are valid at that position
This is fundamentally different from just asking "please reply in JSON" in the prompt. The constraints are enforced at the token-generation level, not via prompt instructions.
The Full Flow
LLM brain
│
▼ (generates tokens, constrained by schema)
'{"category":"job_search","confidence":0.95,"summary":"..."}'
│
▼ (API parses & validates)
response.choices[0].message.tool_calls[0].function.arguments
│ (this is still a STRING)
▼
json.loads(arguments)
│
▼ (now it's a Python dict)
{"category": "job_search", "confidence": 0.95, "summary": "..."}
In Code
# The API returns tool_calls as part of the response
tool_call = response.choices[0].message.tool_calls[0]
# .arguments is a STRING containing JSON
raw = tool_call.function.arguments
# '{"category":"job_search","confidence":0.95,...}'
# We parse it into a Python dict
data = json.loads(raw)
# {"category": "job_search", "confidence": 0.95, ...}
TL;DR: The LLM is still generating text/tokens. Function calling constrains which tokens it can generate (must match your schema), and the API wraps the result in a structured format. We then json.loads() that string into a Python dict.
Part 3: Multiple Tools — "Compare the Weather in Tokyo and Berlin"
So far we've seen one function called once. But what happens when the user's question requires multiple tool calls?
The Setup: Defining a Weather Tool
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units"
}
},
"required": ["city"]
}
}
}]
We have one tool definition — get_weather. Now watch what happens when the user asks a question that requires it twice.
The Request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Compare the weather in Tokyo and Berlin right now"
}],
tools=tools,
)
What the LLM Returns: Two Tool Calls
The LLM doesn't return a text answer. Instead, it returns two tool calls in a single response:
message = response.choices[0].message
# message.content is None — no text response
# message.tool_calls has TWO entries:
for tc in message.tool_calls:
print(f"ID: {tc.id}")
print(f"Function: {tc.function.name}")
print(f"Args: {tc.function.arguments}")
print()
Output:
ID: call_abc123
Function: get_weather
Args: {"city": "Tokyo", "units": "celsius"}
ID: call_def456
Function: get_weather
Args: {"city": "Berlin", "units": "celsius"}
The LLM decided on its own to:
- Call the same function twice with different arguments
- Pick "celsius" as the unit (reasonable default for these cities)
- Return both calls in the same turn (parallel tool calls)
Your Code Executes the Tools
Now you run both calls and feed the results back:
import json
# Execute each tool call
tool_results = []
for tc in message.tool_calls:
args = json.loads(tc.function.arguments)
# Call your actual weather API
weather = get_weather_from_api(args["city"], args.get("units", "celsius"))
tool_results.append({
"role": "tool",
"tool_call_id": tc.id, # must match the ID from the LLM
"content": json.dumps(weather)
})
# Send results back to the LLM
messages = [
{"role": "user", "content": "Compare the weather in Tokyo and Berlin right now"},
message, # the assistant's tool_calls response
*tool_results, # both tool results
]
final = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print(final.choices[0].message.content)
The Final Answer
The LLM now has both weather results and generates a natural comparison:
"Right now, Tokyo is 22°C with partly cloudy skies, while Berlin is 8°C and raining. Tokyo is 14 degrees warmer. If you're choosing between the two today, Tokyo has the better weather."
The Complete Flow
User: "Compare weather in Tokyo and Berlin"
│
▼
LLM (Turn 1): I need weather for both cities
│
├─→ tool_call: get_weather(city="Tokyo") ──→ Your code calls API ──→ {"temp": 22, ...}
│
└─→ tool_call: get_weather(city="Berlin") ──→ Your code calls API ──→ {"temp": 8, ...}
│
▼
LLM (Turn 2): Now I have both results
│
└─→ "Tokyo is 22°C, Berlin is 8°C. Tokyo is 14 degrees warmer..."
Parallel vs Sequential Tool Calls
Parallel (what happened above): The LLM returns multiple tool_calls in a single response. Both calls are independent — your code can execute them concurrently:
import asyncio
async def execute_tools_parallel(tool_calls):
tasks = [execute_single_tool(tc) for tc in tool_calls]
return await asyncio.gather(*tasks)
Sequential: Sometimes the LLM needs the result of one call before making the next. For example: "What's the weather in the capital of France?"
Turn 1: LLM calls get_capital(country="France")
→ Your code returns "Paris"
Turn 2: LLM calls get_weather(city="Paris")
→ Your code returns weather data
Turn 3: LLM generates final answer
The LLM decides which pattern to use based on whether the calls depend on each other.
Part 4: Different Tools in One Turn
The LLM can also call different tools in the same turn. Suppose you define two tools:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "get_exchange_rate",
"description": "Get currency exchange rate",
"parameters": {
"type": "object",
"properties": {
"from_currency": {"type": "string"},
"to_currency": {"type": "string"}
},
"required": ["from_currency", "to_currency"]
}
}
}
]
If the user asks: "I'm traveling from NYC to Tokyo next week. What's the weather like and how much is 1 USD in Yen?"
The LLM returns two different tool calls in one turn:
# tool_calls[0]
{"name": "get_weather", "arguments": '{"city": "Tokyo"}'}
# tool_calls[1]
{"name": "get_exchange_rate", "arguments": '{"from_currency": "USD", "to_currency": "JPY"}'}
Your code routes each call to the right function:
tool_handlers = {
"get_weather": handle_weather,
"get_exchange_rate": handle_exchange_rate,
}
for tc in message.tool_calls:
handler = tool_handlers[tc.function.name]
args = json.loads(tc.function.arguments)
result = handler(**args)
# ... send result back
This is essentially the registry pattern — a dictionary maps function names to handlers. No if/else chains needed.
What Controls This Behavior?
The tool_choice parameter controls whether and how the LLM uses tools:
tool_choice | Behavior | Use Case |
|---|---|---|
"auto" | LLM decides whether to call tools or respond with text | General-purpose agents |
"required" | LLM must call at least one tool | When you always need structured output |
{"type": "function", "function": {"name": "..."}} | LLM must call this specific function | Email classification (always classify) |
"none" | LLM cannot call any tools | Force a text-only response |
For the weather comparison, we use "auto" — the LLM decides on its own that it needs to call get_weather twice.
Key Takeaways
-
Function calling > JSON mode > plain text for getting structured data from LLMs. Function calling enforces your schema at the token generation level, not just via prompt instructions.
-
LLMs still generate tokens — they don't natively return dicts. The API layer applies constrained decoding to ensure the token output matches your schema, then you
json.loads()the resulting string. -
One question can trigger multiple tool calls. The LLM decides whether to call the same tool with different arguments (Tokyo + Berlin) or different tools entirely (weather + exchange rate) — all in a single turn.
-
Parallel vs sequential is decided by the LLM. Independent calls (two cities) come back in one turn. Dependent calls (get capital → get weather) happen across multiple turns.
-
Route tool calls with a registry, not if/else. A dictionary mapping function names to handlers keeps your code clean and extensible.