Skip to content

Latest commit

 

History

History
 
 

README.md

Embedder Lambda

This is a service responsible for embedding uploaded images to the Grid and writing those embeddings to the S3 Vector Store.

Documentation

Running Locally with Localstack

The ImageEmbedder queue can be run locally by polling the localstack SQS queue and executing the lambda code with the help of the localRun script.

What Actually Runs

The local runner:

  • Polls the localstack SQS queue for messages
  • Executes the lambda handler code
  • Fetches images from localstack S3
  • ⚠️ Calls AWS Bedrock for embeddings (requires AWS credentials)
  • ⚠️ Stores vectors in AWS S3 Vectors (requires AWS credentials)

Important: Bedrock and S3 Vectors are not available in localstack, so the lambda will connect to AWS services. Make sure you have AWS credentials configured.

Prerequisites

  1. Core CloudFormation stack deployed to localstack (run dev/script/setup.sh)
  2. Localstack running via dev/script/start.sh
  3. Dependencies installed: npm install
  4. AWS credentials configured for Bedrock and S3 Vectors access

Start the Local Runner

npm run local

This will:

  • Build the lambda code
  • Start polling the localstack SQS queue (image-embedder-DEV)
  • Execute the lambda handler when messages arrive
  • Delete messages from the queue on success
  • Allow failed messages to retry or go to the DLQ

Configuration

Environment variables you can set:

  • LOCALSTACK_ENDPOINT - Localstack endpoint (default: http://localhost:4566)
  • QUEUE_URL - Full SQS queue URL (default: http://localhost:4566/000000000000/image-embedder-DEV)
  • POLL_INTERVAL_MS - How often to poll for messages (default: 5000)
  • AWS_PROFILE - AWS credentials profile to use for Bedrock/S3 Vectors

Example:

AWS_PROFILE=media-service POLL_INTERVAL_MS=2000 npm run local

Testing

To test the lambda locally, first get the name of the image bucket within localstack (its final part is autogenerated when creating the stack):

LOCALSTACK_IMAGE_BUCKET=$(aws cloudformation describe-stack-resources \
  --stack-name grid-dev-core \
  --endpoint-url http://localhost:4566 \
  --profile media-service \
  --region eu-west-1 \
  | jq -r '.StackResources[] | select(.LogicalResourceId == "ImageBucket") | .PhysicalResourceId')

Then send a message to the queue:

TODO

But how do we actually get an image into the bucket? The easiest way is to run the app and upload. But that should trigger the lambda anyway right? So what's the point of this "manually putting stuff on the queue" bit?

aws sqs send-message \
  --queue-url http://localhost:4566/000000000000/image-embedder-DEV \
  --message-body "{\"imageId\":\"test-123\",\"s3Bucket\":\"$LOCALSTACK_IMAGE_BUCKET\",\"s3Key\":\"test/image.jpg\",\"fileType\":\"image/jpeg\"}" \
  --endpoint-url http://localhost:4566 \
  --profile media-service \
  --region eu-west-1

The lambda will:

  1. Receive the message from localstack
  2. Try to fetch the image from the S3 bucket within localstack
  3. Send the image to AWS Bedrock (the real thing, not localstack) for embedding
  4. Store the embedding in AWS S3 Vectors (the real thing, not localstack)

Exercising the real lambda in AWS TEST

scripts/embedder-deploy/send-message.sh \
  ce05aa50f210261bb2b830daf6c5ce6f11a16f84 \
  image/jpeg \
  media-service-test-imagebucket-1qt2lbcwnpgl0 \
  c/e/0/5/a/a/ce05aa50f210261bb2b830daf6c5ce6f11a16f84
scripts/embedder-deploy/send-message-batch.sh \
  ce05aa50f210261bb2b830daf6c5ce6f11a16f84 \
  image/jpeg \
  media-service-test-imagebucket-1qt2lbcwnpgl0 \
  c/e/0/5/a/a/ce05aa50f210261bb2b830daf6c5ce6f11a16f84

Integration Tests

Run manually (not in CI) with npm run test:integration. Requires AWS credentials with S3 and Bedrock access.

  • Input images are downloaded from S3 once and cached locally in test-data/input/
  • Output images (e.g. downscaled) are written to test-data/output/
  • Both directories are gitignored; having files locally allows quick visual inspection to check images haven't been mangled

Backfilling lambda

In order to process all the documents that existed before the introduction of the embedding lambda, we need a system pick documents and send them to being processed. This lambda shares the same cdk stack as the main embedding lambda.

In order to identify images that need processing, the lamabda will query elastic search for documents that:

  • have no embeddings
  • have not been soft-deleted

This selection is random (among possible candidates) to limit the risk of processing the same document twice (though it is possible), while remaining stateless.

It scheduled regularly (frequency TBD, pending on token limits and amount per image).