Beetroot LogoBeetroot

Photo Idempotency in DynamoDB

Create a stable photoId from S3 uploads and write exactly one record per photo into DynamoDB.

Goal

When a photo is uploaded to s3://beetroot-raw/photos-raw/, the ingestion Lambda should:

  1. Extract bucket and key from the S3 event
  2. Ignore any uploads not under photos-raw/
  3. Compute a stable photoId
  4. Write one item into the Photos table
  5. Avoid double-processing if the same upload triggers more than once

Why idempotency matters

S3 events can sometimes be delivered more than once. If we write to DynamoDB without protection, we may create duplicates and corrupt our data.

We’ll solve this by using:

  • A deterministic photoId (same file path → same ID)
  • A conditional DynamoDB write (attribute_not_exists(photoId))

How we avoid duplicates (idempotency)

We make uploads safe to retry by doing two things:

  • Deterministic photoId: the ID is generated from bucket + key, so same S3 path always produces same photoId.

  • Conditional write: DynamoDB writes the item only if it doesn’t already exist using attribute_not_exists(photoId). If the same upload triggers again, we skip it instead of creating a duplicate.

Add Environment Variables (Lambda Console)

Go to Lambda → beetroot-ingest → Configuration → Environment variables and add:

  • PHOTOS_TABLE = Photos
  • RAW_PREFIX = photos-raw/

Why add environment variables?

Hard-coding resources names makes code harder to reuse. Using environment variables keeps it clean and configurable.

Lambda Code

What this code does?

  • It reads the S3 event
  • Generates a stable photoId
  • Inserts a record into DynamoDB only if it doesn’t already exist

Part 1: Imports

This section brings in everything we need:

  • standard Python utilities (JSON, hashing, timestamps)
  • URL decoding for S3 keys
    • AWS SDK (boto3) + error type for clean handling
import json
import os
import hashlib
from datetime import datetime, timezone
from urllib.parse import unquote_plus

import boto3
from botocore.exceptions import ClientError

Why unquote_plus?

S3 object keys in event payloads can be URL-encoded (for example, spaces may appear as +). We decode them so we always hash and store the real path.

Part 2: AWS clients + env vars

We create the DynamoDB resource once (outside the handler) so it can be reused across invocations.

We also read configuration from environment variables (with safe defaults).

ddb = boto3.resource("dynamodb")

# Use env var if present, otherwise default to "Photos"
PHOTOS_TABLE = os.environ.get("PHOTOS_TABLE", "Photos")
RAW_PREFIX = os.environ.get("RAW_PREFIX", "photos-raw/")

photos_table = ddb.Table(PHOTOS_TABLE)

Part 3: photoId (Idempotency key)

We generate a stable ID based on the S3 object path. Same bucket + key will always produce the same ID.

def make_photo_id(bucket: str, key: str) -> str:
    """
    Deterministic, stable photoId.
    We use SHA-256(bucket/key) but keep only first 20 hex chars (short + stable).
    """
    raw = f"{bucket}/{key}".encode("utf-8")
    return hashlib.sha256(raw).hexdigest()[:20]

Why this matters

S3 can sometimes trigger the same upload more than once. A deterministic photoId lets us detect “this is the same file again”.

raw = f"{bucket}/{key}".encode("utf-8")
photo_id = hashlib.sha256(raw).hexdigest()[:20]

Example:

  • bucket = "beetroot-raw"
  • key = "photos-raw/group1.jpg"

Example output:

  • photoId = "a1f09c2b7e3d4a91b6c2"

It will be the same every time for the same bucket/key.

Part 4: Handler entry + read event records

The Lambda handler receives the event payload. For S3 triggers, it includes a list called Records.

def lambda_handler(event, context):
    records = event.get("Records", [])
    if not records:
        print("No Records found; nothing to do.")
        return {"statusCode": 200, "body": "no records"}

Why we check Records

  • If the function is triggered manually (Test button), there may be no S3 records.
  • This avoids errors and keeps the handler safe.

Part 5: Parse bucket + key

We loop through each record (sometimes there can be more than one). Then we extract the bucket and object key.

for r in records:
    s3 = r.get("s3", {})
    bucket = s3.get("bucket", {}).get("name")
    key = s3.get("object", {}).get("key")

    if not bucket or not key:
        print("Skipping record: missing bucket/key")
        continue

    # S3 keys in events are URL-encoded sometimes
    key = unquote_plus(key)

    # Only process uploads under photos-raw/
    if not key.startswith(RAW_PREFIX):
        print(f"Skipping key not under RAW_PREFIX: {key}")
        continue
bucket = s3.get("bucket", {}).get("name")
key = s3.get("object", {}).get("key")
key = unquote_plus(key)

if not key.startswith(RAW_PREFIX):
    continue

Example event record:

  • bucket name: beetroot-raw
  • key: photos-raw/family+photo.jpg

After decoding:

  • key becomes: photos-raw/family photo.jpg

Prefix check:

  • passes because it starts with photos-raw/

Part 6: Build the DynamoDB item

We create a record to store metadata about the uploaded photo.

photo_id = make_photo_id(bucket, key) 
uploaded_at = datetime.now(timezone.utc).isoformat()

item = {
    "photoId": photo_id,
    "s3Bucket": bucket,
    "s3Key": key,
    "uploadedAt": uploaded_at,
}

What gets stored

  • photoId: stable ID (for idempotency)
  • s3Bucket + s3Key: where the photo lives
  • uploadedAt: timestamp for debugging and ordering

Part 7: Conditional write

This is where we prevent duplicates.

We write the item only if it does not already exist.

try:
    photos_table.put_item(
        Item=item,
        ConditionExpression="attribute_not_exists(photoId)", 
    )
    print(f"Photos: inserted photoId={photo_id} key={key}")
except ClientError as e:
    code = e.response.get("Error", {}).get("Code", "Unknown")
    if code == "ConditionalCheckFailedException":
        print(f"Photos: already exists, skipping photoId={photo_id} key={key}")
        continue
    print("DynamoDB put_item failed:", str(e))
    raise
  1. First upload (record does not exist yet)
photos_table.put_item(
    Item=item,
    ConditionExpression="attribute_not_exists(photoId)",
)
  • bucket = "beetroot-raw"
  • key = "photos-raw/group1.jpg"
  • DynamoDB write succeeds
  • Logs show: inserted
  1. Duplicate upload (same photo triggers again)
if code == "ConditionalCheckFailedException":
    print("already exists, skipping")
  • bucket = "beetroot-raw"
  • key = "photos-raw/group1.jpg"

The same S3 key triggers again, hence creating an item with the same photoId which already exists in DynamoDB

  • DynamoDB throws ConditionalCheckFailedException
  • We treat it as a normal "skip" (not a crash)
  • Logs show: already exists

Part 8: Return response

Finally, we return a normal success response.

return {"statusCode": 200, "body": "ingest lambda with s3 trigger ok"}

Why always return 200?

For S3 triggers, Lambda retries on errors. Returning success after handling duplicates avoids unnecessary retries.

Lambda Code

Having this code in your lambda function, click on Deploy to save it.

beetroot-ingest/lambda_function.py
import json
import os
import hashlib
from datetime import datetime, timezone
from urllib.parse import unquote_plus

import boto3
from botocore.exceptions import ClientError


ddb = boto3.resource("dynamodb")

# Use env var if present, otherwise default to "Photos"
PHOTOS_TABLE = os.environ.get("PHOTOS_TABLE", "Photos")
RAW_PREFIX = os.environ.get("RAW_PREFIX", "photos-raw/")

photos_table = ddb.Table(PHOTOS_TABLE)


def make_photo_id(bucket: str, key: str) -> str:
    """
    Deterministic, stable photoId.
    We use SHA-256(bucket/key) but keep only first 20 hex chars (short + stable).
    """
    raw = f"{bucket}/{key}".encode("utf-8")
    return hashlib.sha256(raw).hexdigest()[:20]


def lambda_handler(event, context):
    records = event.get("Records", [])
    if not records:
        print("No Records found; nothing to do.")
        return {"statusCode": 200, "body": "no records"}

    for r in records:
        s3 = r.get("s3", {})
        bucket = s3.get("bucket", {}).get("name")
        key = s3.get("object", {}).get("key")

        if not bucket or not key:
            print("Skipping record: missing bucket/key")
            continue

        # S3 keys in events are URL-encoded sometimes
        key = unquote_plus(key)

        # Only process uploads under photos-raw/
        if not key.startswith(RAW_PREFIX):
            print(f"Skipping key not under RAW_PREFIX: {key}")
            continue

        photo_id = make_photo_id(bucket, key)
        uploaded_at = datetime.now(timezone.utc).isoformat()

        item = {
            "photoId": photo_id,
            "s3Bucket": bucket,
            "s3Key": key,
            "uploadedAt": uploaded_at,
        }

        try:
            photos_table.put_item(
                Item=item,
                ConditionExpression="attribute_not_exists(photoId)",
            )
            print(f"Photos: inserted photoId={photo_id} key={key}")
        except ClientError as e:
            code = e.response.get("Error", {}).get("Code", "Unknown")
            if code == "ConditionalCheckFailedException":
                print(f"Photos: already exists, skipping photoId={photo_id} key={key}")
                continue
            print("DynamoDB put_item failed:", str(e))
            raise

    return {"statusCode": 200, "body": "ingest lambda with s3 trigger ok"}

Step 3: Test (two quick runs)

Upload a new file (insert)

Upload one new photo:

aws s3 cp ./v2-test-photos/group2.jpg s3://beetroot-raw/photos-raw/.jpg --region us-east-1

In CloudWatch logs, you should see:

  • Photos: inserted photoId=...

Upload same key again (skip)

Upload the same key again (same destination path):

aws s3 cp ./v2-test-photos/group2.jpg s3://beetroot-raw/photos-raw/.jpg --region us-east-1

In logs, you should see:

  • Photos: already exists, skipping ...

Where to confirm the record was written

Go to DynamoDB → Tables → Photos → Explore items and confirm a new item exists with:

  • photoId
  • s3Bucket
  • s3Key
  • uploadedAt

Common mistakes

If you see errors such as missing table name, ensure: - PHOTOS_TABLE exists in Lambda environment variables - Your code uses a safe fallback, os.environ.get("PHOTOS_TABLE", "Photos")

If the Lambda logs show “Skipping key not under RAW_PREFIX”, confirm: - Upload path starts with photos-raw/- RAW_PREFIX is set correctly (or left as default)

A full SHA-256 hex is 64 chars. We intentionally shorten it using:

  • hashlib.sha256(...).hexdigest()[:20]

This keeps the id deterministic but easier to read in DynamoDB / logs.

On this page