This tutorial demonstrates how to evaluate agent trajectories - the sequences of tool calls, reasoning steps, and decisions agents make to complete tasks. Learn to evaluate not just what agents produce, but how they think and execute.
- Understanding agent trajectories (tool calls, reasoning steps, sequences)
- Visualizing execution trajectories (Session → Trace → Spans)
- Using TrajectoryEvaluator with custom rubrics
- Evaluating optimal, suboptimal, and incorrect trajectories
- Implementing trajectory scoring functions (exact_match, in_order, any_order)
- Analyzing HOW agents think, not just WHAT they produce
- Python 3.11 or higher
- AWS account with Amazon Bedrock access
- Anthropic Claude 3.7 Sonnet enabled on Amazon Bedrock
- IAM permissions for Amazon Bedrock API access
- Basic understanding of AI agents and multi-agent systems
- Familiarity with Jupyter notebooks
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate# Install required packages
pip install -r requirements.txtEnsure your AWS credentials are configured with access to Amazon Bedrock:
# Option 1: Configure AWS CLI
aws configure
# Option 2: Set environment variables
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
# Option 3: Use AWS credentials file (~/.aws/credentials)Ensure you have access to the required model:
- Model:
us.anthropic.claude-sonnet-4-0-20250514-v1:0 - Service: Amazon Bedrock
- Required permissions:
bedrock:InvokeModel
# Ensure virtual environment is activated
source venv/bin/activate # macOS/Linux
# or
venv\Scripts\activate # Windows
# Start Jupyter
jupyter notebookNavigate to 04-trajectory-evaluation.ipynb in the Jupyter interface.
Run cells sequentially from top to bottom:
- Use
Shift + Enterto execute each cell - Review outputs and visualizations
- Experiment with modifying parameters
- What is a trajectory?
- Why evaluate trajectories?
- Trajectory hierarchy (Session → Trace → Spans)
- Financial advisor agent
- Technical architect agent
- Market researcher agent
- Risk analyst agent
- Graph-based orchestration
- exact_match_scorer: Strict sequence matching
- in_order_match_scorer: Allows extra steps
- any_order_match_scorer: Ignores order
Optimal: Correct tools in correct order
- Expected execution path
- Maximum efficiency
- Perfect scores
Suboptimal: Extra or redundant steps
- Redundant tool calls
- Correct output but inefficient
- Increased costs and latency
Incorrect: Wrong tools or wrong order
- Missing critical steps
- Wrong execution sequence
- Incomplete analysis
- Custom rubric definition
- LLM-based trajectory assessment
- Automated evaluation workflow
- When to evaluate trajectories
- Choosing evaluation criteria
- Combining trajectory and output evaluation
- Production monitoring strategies
- Trajectory sequences for each scenario
- Scoring results (exact, in-order, any-order)
- Efficiency calculations
- Performance metrics
- TrajectoryEvaluator scores with explanations
- Comparative analysis across scenarios
- Detailed trajectory visualizations
- Actionable improvement recommendations
- Understanding optimal execution paths
- Identifying inefficiencies and redundancies
- Detecting missing or incorrect steps
- Optimization opportunities
Error: NoCredentialsError or Unable to locate credentials
Solution:
# Configure AWS credentials
aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1Error: AccessDeniedException or Model not found
Solution:
- Verify model access in Amazon Bedrock console
- Enable Claude Sonnet 4.0 model in your region
- Check IAM permissions include
bedrock:InvokeModel - Confirm you're using the correct model ID
Error: ModuleNotFoundError: No module named 'strands'
Solution:
# Ensure virtual environment is activated
source venv/bin/activate # macOS/Linux
# Reinstall dependencies
pip install -r requirements.txt
# Verify installations
pip list | grep strandsError: Kernel not available or not connecting
Solution:
# Install ipykernel in virtual environment
pip install ipykernel
# Register kernel
python -m ipykernel install --user --name=venv
# Select this kernel in JupyterError: AttributeError: 'GraphResult' object has no attribute 'execution_order'
Solution:
- Ensure you're using the latest version of strands-agents
- Check that the graph is properly built and executed
- Verify the result object is from a graph execution
Problem: Cells take a long time to execute
Solution:
- Graph agent executions involve multiple LLM calls (expected behavior)
- Each agent in the graph makes its own inference
- Consider using a smaller model for testing
- Reduce the number of test cases if experimenting
The complete sequence of actions (tool calls, reasoning steps, decisions) an agent takes to solve a task.
Assessing the quality, efficiency, and correctness of an agent's execution path, not just its final output.
Mathematical functions that compare expected vs actual trajectories:
- Exact Match: Binary score (1.0 or 0.0) for exact sequence match
- In-Order Match: Proportion of expected steps present in correct order
- Any-Order Match: Proportion of expected steps present regardless of order
In multi-agent systems, trajectories show the coordination and handoffs between agents, revealing collaboration patterns.
- Tutorial 01: Built-in Evaluators - Foundation concepts
- Tutorial 02: Custom Evaluators - Creating domain-specific metrics
- Tutorial 03: Dataset Generation - Automated test case creation
- Tutorial 05: Multi-turn Evaluation - Actor simulation
- Tutorial 06: Multi-Agent Evaluation - Evaluating agent collaboration
- Combine trajectory and output evaluation for complete assessment
- Use exact_match for compliance-critical applications
- Use in_order_match for general correctness with flexibility
- Use any_order_match for independent parallel operations
- Monitor trajectory metrics in production
- Set efficiency benchmarks and track improvements
For issues or questions:
- Check the troubleshooting section above
- Review Strands documentation
- Verify AWS and Bedrock configuration
- Check GitHub issues for similar problems
- Contact AWS support for Bedrock-specific issues
This tutorial is provided as-is for educational purposes.
- Tutorial Version: 1.0
- Last Updated: 2025-11-25
- Compatible with: strands-agents >= 0.1.0, strands-evals >= 0.1.0