trajectory-evaluation

Tutorial 04: Trajectory Evaluation

Overview

This tutorial demonstrates how to evaluate agent trajectories - the sequences of tool calls, reasoning steps, and decisions agents make to complete tasks. Learn to evaluate not just what agents produce, but how they think and execute.

What You'll Learn

Understanding agent trajectories (tool calls, reasoning steps, sequences)
Visualizing execution trajectories (Session → Trace → Spans)
Using TrajectoryEvaluator with custom rubrics
Evaluating optimal, suboptimal, and incorrect trajectories
Implementing trajectory scoring functions (exact_match, in_order, any_order)
Analyzing HOW agents think, not just WHAT they produce

Prerequisites

Python 3.11 or higher
AWS account with Amazon Bedrock access
Anthropic Claude 3.7 Sonnet enabled on Amazon Bedrock
IAM permissions for Amazon Bedrock API access
Basic understanding of AI agents and multi-agent systems
Familiarity with Jupyter notebooks

Installation

1. Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

2. Install Dependencies

# Install required packages
pip install -r requirements.txt

3. Configure AWS Credentials

Ensure your AWS credentials are configured with access to Amazon Bedrock:

# Option 1: Configure AWS CLI
aws configure

# Option 2: Set environment variables
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

# Option 3: Use AWS credentials file (~/.aws/credentials)

4. Verify Amazon Bedrock Access

Ensure you have access to the required model:

Model: us.anthropic.claude-sonnet-4-0-20250514-v1:0
Service: Amazon Bedrock
Required permissions: bedrock:InvokeModel

Running the Tutorial

1. Start Jupyter Notebook

# Ensure virtual environment is activated
source venv/bin/activate  # macOS/Linux
# or
venv\Scripts\activate  # Windows

# Start Jupyter
jupyter notebook

2. Open the Notebook

Navigate to 04-trajectory-evaluation.ipynb in the Jupyter interface.

3. Execute Cells

Run cells sequentially from top to bottom:

Use Shift + Enter to execute each cell
Review outputs and visualizations
Experiment with modifying parameters

Tutorial Structure

1. Introduction to Trajectories

What is a trajectory?
Why evaluate trajectories?
Trajectory hierarchy (Session → Trace → Spans)

2. Multi-Agent System Setup

Financial advisor agent
Technical architect agent
Market researcher agent
Risk analyst agent
Graph-based orchestration

3. Trajectory Scoring Functions

exact_match_scorer: Strict sequence matching
in_order_match_scorer: Allows extra steps
any_order_match_scorer: Ignores order

4. Three Trajectory Scenarios

Optimal: Correct tools in correct order

Expected execution path
Maximum efficiency
Perfect scores

Suboptimal: Extra or redundant steps

Redundant tool calls
Correct output but inefficient
Increased costs and latency

Incorrect: Wrong tools or wrong order

Missing critical steps
Wrong execution sequence
Incomplete analysis

5. TrajectoryEvaluator

Custom rubric definition
LLM-based trajectory assessment
Automated evaluation workflow

6. Best Practices

When to evaluate trajectories
Choosing evaluation criteria
Combining trajectory and output evaluation
Production monitoring strategies

Expected Outputs

Execution Metrics

Trajectory sequences for each scenario
Scoring results (exact, in-order, any-order)
Efficiency calculations
Performance metrics

Evaluation Results

TrajectoryEvaluator scores with explanations
Comparative analysis across scenarios
Detailed trajectory visualizations
Actionable improvement recommendations

Key Insights

Understanding optimal execution paths
Identifying inefficiencies and redundancies
Detecting missing or incorrect steps
Optimization opportunities

Troubleshooting

Issue: AWS Credentials Not Found

Error: NoCredentialsError or Unable to locate credentials

Solution:

# Configure AWS credentials
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1

Issue: Model Access Denied

Error: AccessDeniedException or Model not found

Solution:

Verify model access in Amazon Bedrock console
Enable Claude Sonnet 4.0 model in your region
Check IAM permissions include bedrock:InvokeModel
Confirm you're using the correct model ID

Issue: Import Errors

Error: ModuleNotFoundError: No module named 'strands'

Solution:

# Ensure virtual environment is activated
source venv/bin/activate  # macOS/Linux

# Reinstall dependencies
pip install -r requirements.txt

# Verify installations
pip list | grep strands

Issue: Jupyter Kernel Not Found

Error: Kernel not available or not connecting

Solution:

# Install ipykernel in virtual environment
pip install ipykernel

# Register kernel
python -m ipykernel install --user --name=venv

# Select this kernel in Jupyter

Issue: Trajectory Extraction Fails

Error: AttributeError: 'GraphResult' object has no attribute 'execution_order'

Solution:

Ensure you're using the latest version of strands-agents
Check that the graph is properly built and executed
Verify the result object is from a graph execution

Issue: Slow Execution

Problem: Cells take a long time to execute

Solution:

Graph agent executions involve multiple LLM calls (expected behavior)
Each agent in the graph makes its own inference
Consider using a smaller model for testing
Reduce the number of test cases if experimenting

Key Concepts

Trajectory

The complete sequence of actions (tool calls, reasoning steps, decisions) an agent takes to solve a task.