Photo Idempotency in DynamoDB
Create a stable photoId from S3 uploads and write exactly one record per photo into DynamoDB.
Goal
When a photo is uploaded to s3://beetroot-raw/photos-raw/, the ingestion Lambda should:
- Extract
bucketandkeyfrom the S3 event - Ignore any uploads not under
photos-raw/ - Compute a stable
photoId - Write one item into the
Photostable - Avoid double-processing if the same upload triggers more than once
Why idempotency matters
S3 events can sometimes be delivered more than once. If we write to DynamoDB without protection, we may create duplicates and corrupt our data.
We’ll solve this by using:
- A deterministic
photoId(same file path → same ID) - A conditional DynamoDB write (
attribute_not_exists(photoId))
How we avoid duplicates (idempotency)
We make uploads safe to retry by doing two things:
-
Deterministic
photoId: the ID is generated frombucket + key, so same S3 path always produces samephotoId. -
Conditional write: DynamoDB writes the item only if it doesn’t already exist using
attribute_not_exists(photoId). If the same upload triggers again, we skip it instead of creating a duplicate.
Add Environment Variables (Lambda Console)
Go to Lambda → beetroot-ingest → Configuration → Environment variables and add:
PHOTOS_TABLE=PhotosRAW_PREFIX=photos-raw/
Why add environment variables?
Hard-coding resources names makes code harder to reuse. Using environment variables keeps it clean and configurable.
Lambda Code
What this code does?
- It reads the S3 event
- Generates a stable
photoId - Inserts a record into DynamoDB only if it doesn’t already exist
Part 1: Imports
This section brings in everything we need:
- standard Python utilities (JSON, hashing, timestamps)
- URL decoding for S3 keys
- AWS SDK (
boto3) + error type for clean handling
- AWS SDK (
import json
import os
import hashlib
from datetime import datetime, timezone
from urllib.parse import unquote_plus
import boto3
from botocore.exceptions import ClientErrorWhy unquote_plus?
S3 object keys in event payloads can be URL-encoded (for example, spaces may
appear as +). We decode them so we always hash and store the real
path.
Part 2: AWS clients + env vars
We create the DynamoDB resource once (outside the handler) so it can be reused across invocations.
We also read configuration from environment variables (with safe defaults).
ddb = boto3.resource("dynamodb")
# Use env var if present, otherwise default to "Photos"
PHOTOS_TABLE = os.environ.get("PHOTOS_TABLE", "Photos")
RAW_PREFIX = os.environ.get("RAW_PREFIX", "photos-raw/")
photos_table = ddb.Table(PHOTOS_TABLE)Part 3: photoId (Idempotency key)
We generate a stable ID based on the S3 object path.
Same bucket + key will always produce the same ID.
def make_photo_id(bucket: str, key: str) -> str:
"""
Deterministic, stable photoId.
We use SHA-256(bucket/key) but keep only first 20 hex chars (short + stable).
"""
raw = f"{bucket}/{key}".encode("utf-8")
return hashlib.sha256(raw).hexdigest()[:20]Why this matters
S3 can sometimes trigger the same upload more than once.
A deterministic photoId lets us detect “this is the same file again”.
raw = f"{bucket}/{key}".encode("utf-8")
photo_id = hashlib.sha256(raw).hexdigest()[:20]Example:
-
bucket = "beetroot-raw" -
key = "photos-raw/group1.jpg"
Example output:
-
photoId = "a1f09c2b7e3d4a91b6c2"
It will be the same every time for the same bucket/key.
Part 4: Handler entry + read event records
The Lambda handler receives the event payload.
For S3 triggers, it includes a list called Records.
def lambda_handler(event, context):
records = event.get("Records", [])
if not records:
print("No Records found; nothing to do.")
return {"statusCode": 200, "body": "no records"}Why we check Records
- If the function is triggered manually (Test button), there may be no S3 records.
- This avoids errors and keeps the handler safe.
Part 5: Parse bucket + key
We loop through each record (sometimes there can be more than one). Then we extract the bucket and object key.
for r in records:
s3 = r.get("s3", {})
bucket = s3.get("bucket", {}).get("name")
key = s3.get("object", {}).get("key")
if not bucket or not key:
print("Skipping record: missing bucket/key")
continue
# S3 keys in events are URL-encoded sometimes
key = unquote_plus(key)
# Only process uploads under photos-raw/
if not key.startswith(RAW_PREFIX):
print(f"Skipping key not under RAW_PREFIX: {key}")
continuebucket = s3.get("bucket", {}).get("name")
key = s3.get("object", {}).get("key")
key = unquote_plus(key)
if not key.startswith(RAW_PREFIX):
continueExample event record:
- bucket name:
beetroot-raw - key:
photos-raw/family+photo.jpg
After decoding:
- key becomes:
photos-raw/family photo.jpg
Prefix check:
- passes because it starts with
photos-raw/
Part 6: Build the DynamoDB item
We create a record to store metadata about the uploaded photo.
photo_id = make_photo_id(bucket, key)
uploaded_at = datetime.now(timezone.utc).isoformat()
item = {
"photoId": photo_id,
"s3Bucket": bucket,
"s3Key": key,
"uploadedAt": uploaded_at,
}What gets stored
photoId: stable ID (for idempotency)s3Bucket+s3Key: where the photo livesuploadedAt: timestamp for debugging and ordering
Part 7: Conditional write
This is where we prevent duplicates.
We write the item only if it does not already exist.
try:
photos_table.put_item(
Item=item,
ConditionExpression="attribute_not_exists(photoId)",
)
print(f"Photos: inserted photoId={photo_id} key={key}")
except ClientError as e:
code = e.response.get("Error", {}).get("Code", "Unknown")
if code == "ConditionalCheckFailedException":
print(f"Photos: already exists, skipping photoId={photo_id} key={key}")
continue
print("DynamoDB put_item failed:", str(e))
raise- First upload (record does not exist yet)
photos_table.put_item(
Item=item,
ConditionExpression="attribute_not_exists(photoId)",
)-
bucket = "beetroot-raw" -
key = "photos-raw/group1.jpg"
- DynamoDB write succeeds
- Logs show:
inserted
- Duplicate upload (same photo triggers again)
if code == "ConditionalCheckFailedException":
print("already exists, skipping")-
bucket = "beetroot-raw" -
key = "photos-raw/group1.jpg"
The same S3 key triggers again, hence creating an item with the same photoId which already exists in DynamoDB
- DynamoDB throws
ConditionalCheckFailedException - We treat it as a normal "skip" (not a crash)
- Logs show:
already exists
Part 8: Return response
Finally, we return a normal success response.
return {"statusCode": 200, "body": "ingest lambda with s3 trigger ok"}Why always return 200?
For S3 triggers, Lambda retries on errors. Returning success after handling duplicates avoids unnecessary retries.
Lambda Code
Having this code in your lambda function, click on Deploy to save it.
import json
import os
import hashlib
from datetime import datetime, timezone
from urllib.parse import unquote_plus
import boto3
from botocore.exceptions import ClientError
ddb = boto3.resource("dynamodb")
# Use env var if present, otherwise default to "Photos"
PHOTOS_TABLE = os.environ.get("PHOTOS_TABLE", "Photos")
RAW_PREFIX = os.environ.get("RAW_PREFIX", "photos-raw/")
photos_table = ddb.Table(PHOTOS_TABLE)
def make_photo_id(bucket: str, key: str) -> str:
"""
Deterministic, stable photoId.
We use SHA-256(bucket/key) but keep only first 20 hex chars (short + stable).
"""
raw = f"{bucket}/{key}".encode("utf-8")
return hashlib.sha256(raw).hexdigest()[:20]
def lambda_handler(event, context):
records = event.get("Records", [])
if not records:
print("No Records found; nothing to do.")
return {"statusCode": 200, "body": "no records"}
for r in records:
s3 = r.get("s3", {})
bucket = s3.get("bucket", {}).get("name")
key = s3.get("object", {}).get("key")
if not bucket or not key:
print("Skipping record: missing bucket/key")
continue
# S3 keys in events are URL-encoded sometimes
key = unquote_plus(key)
# Only process uploads under photos-raw/
if not key.startswith(RAW_PREFIX):
print(f"Skipping key not under RAW_PREFIX: {key}")
continue
photo_id = make_photo_id(bucket, key)
uploaded_at = datetime.now(timezone.utc).isoformat()
item = {
"photoId": photo_id,
"s3Bucket": bucket,
"s3Key": key,
"uploadedAt": uploaded_at,
}
try:
photos_table.put_item(
Item=item,
ConditionExpression="attribute_not_exists(photoId)",
)
print(f"Photos: inserted photoId={photo_id} key={key}")
except ClientError as e:
code = e.response.get("Error", {}).get("Code", "Unknown")
if code == "ConditionalCheckFailedException":
print(f"Photos: already exists, skipping photoId={photo_id} key={key}")
continue
print("DynamoDB put_item failed:", str(e))
raise
return {"statusCode": 200, "body": "ingest lambda with s3 trigger ok"}Step 3: Test (two quick runs)
Upload a new file (insert)
Upload one new photo:
aws s3 cp ./v2-test-photos/group2.jpg s3://beetroot-raw/photos-raw/.jpg --region us-east-1In CloudWatch logs, you should see:
-
Photos: inserted photoId=...
Upload same key again (skip)
Upload the same key again (same destination path):
aws s3 cp ./v2-test-photos/group2.jpg s3://beetroot-raw/photos-raw/.jpg --region us-east-1In logs, you should see:
-
Photos: already exists, skipping ...
Where to confirm the record was written
Go to DynamoDB → Tables → Photos → Explore items and confirm a new item exists with:
photoIds3Buckets3KeyuploadedAt
Common mistakes
If you see errors such as missing table name, ensure: -
PHOTOS_TABLE exists in Lambda environment variables - Your code
uses a safe fallback, os.environ.get("PHOTOS_TABLE", "Photos")
If the Lambda logs show “Skipping key not under RAW_PREFIX”, confirm: - Upload
path starts with photos-raw/- RAW_PREFIX is set
correctly (or left as default)
A full SHA-256 hex is 64 chars. We intentionally shorten it using:
-
hashlib.sha256(...).hexdigest()[:20]
This keeps the id deterministic but easier to read in DynamoDB / logs.