Natural Language Processing Systems • Prompt Engineering & ManagementMedium⏱️ ~3 min
Prompt Engineering Techniques: Chain of Thought and Tool Use
Chain of Thought (CoT) and tool use are advanced prompt engineering techniques that dramatically improve reasoning and reduce hallucinations for complex tasks. Chain of Thought instructs the model to show its reasoning steps before arriving at a final answer. Instead of asking directly "What is the result of this calculation?", you prompt "Let's think step by step" and the model generates intermediate reasoning. This technique improves accuracy on multi step reasoning tasks by 15 to 35 percent compared to direct prompting, but comes with trade-offs.
The primary cost of Chain of Thought is token inflation. A simple question that would take 50 tokens to answer directly might require 200 to 400 tokens when intermediate reasoning is included. This increases both latency and cost proportionally. In production, teams often use hidden scratchpads where the reasoning tokens are generated but not returned to the user, or self consistency sampling where the model generates 2 to 3 reasoning paths and the system selects the most common answer. Self consistency improves accuracy by another 5 to 12 percent but multiplies compute cost by the number of samples.
Tool use, also called function calling, pushes the model to invoke external functions with structured arguments instead of generating answers from parametric knowledge alone. When a user asks "What is the weather in Seattle?", the model generates a function call like get_weather(location="Seattle") rather than hallucinating an answer. The system executes the function, retrieves real data, and the model incorporates the result into its response. OpenAI and Google expose tool use patterns where you define function schemas in the prompt and the model returns JSON with function names and arguments.
Tool use is particularly powerful for structured tasks and reduces hallucinations by grounding responses in real data. For tasks like database queries, API calls, or calculator operations, tool use can improve task success rates from 60 to 70 percent with pure prompting to over 90 percent. The trade-off is added complexity. You must define schemas, validate arguments, handle execution failures, and secure function calls to prevent abuse. Anthropic's Claude and Meta's Llama models support similar patterns, and production systems typically constrain which tools are available based on user permissions and task context.
💡 Key Takeaways
•Chain of Thought improves multi step reasoning accuracy by 15 to 35 percent but increases token usage from 50 to 200 or 400 tokens per response, proportionally raising latency and cost
•Self consistency sampling generates 2 to 3 reasoning paths and selects the most common answer, improving accuracy by another 5 to 12 percent at the cost of multiplying compute by the sample count
•Tool use allows models to invoke external functions with structured arguments, improving task success rates from 60 to 70 percent to over 90 percent for structured tasks like database queries
•Hidden scratchpads generate reasoning tokens internally without returning them to users, reducing perceived latency and preventing leakage of internal model rationales
•Production tool use requires schema definitions, argument validation, execution failure handling, and permission based access controls to prevent abuse and secure function calls
📌 Examples
OpenAI GPT 4 function calling defines a get_current_weather function schema with parameters like location and unit, and the model returns JSON {"name": "get_current_weather", "arguments": {"location": "Boston", "unit": "celsius"}}Google's Gemini API uses tool use to call a database query function when asked about inventory, returning SELECT * FROM products WHERE stock > 0 as a structured query that the system executes safely
Anthropic Claude with Chain of Thought generates 280 tokens of reasoning for a complex logic puzzle that would be answered incorrectly in 45 tokens with direct prompting, improving accuracy from 55 to 82 percent