📘 AWS SQS: Designing Correct & Scalable Systems with Queues
What SQS guarantees, what it doesn’t, and how systems actually fail
SQS is not “just a queue”.
It is a distributed coordination primitive with sharp edges.
1️⃣ The First Myth: “SQS Decouples Systems”
Truth
SQS decouples in time, not in correctness.
It does not:
Prevent overload
Guarantee ordering (standard)
Guarantee exactly-once delivery
Apply backpressure automatically
If you design assuming it does — your system will be wrong.
2️⃣ Core SQS Mental Model (Important)
What SQS actually guarantees
| Property | Guarantee |
|---|---|
| Delivery | At-least-once |
| Ordering | ❌ (Standard), ✅ (FIFO) |
| Durability | ✅ |
| Availability | ✅ |
| Latency | Best-effort |
Everything else is your responsibility.
3️⃣ Standard Queue vs FIFO Queue (Design, Not Feature)
Standard Queue
Massive throughput
Best-effort ordering
Message duplication possible
FIFO Queue
Strict ordering per group
Lower throughput
Higher coordination cost
Choosing FIFO is a correctness decision, not a performance one.
4️⃣ At-Least-Once Delivery → Duplicate Processing
❌ Naive Consumer (Broken)
while (true) {
const msg = await sqs.receiveMessage();
await process(msg);
await sqs.deleteMessage(msg);
}
Failure Scenario
Consumer processes message
Crashes before delete
Message becomes visible again
Processed twice ❌
5️⃣ Correctness Rule #1 — Consumers Must Be Idempotent
✅ Idempotent Consumer
if (alreadyProcessed(msg.id)) return;
process(msg);
markProcessed(msg.id);
Where markProcessed is:
DB unique key
Redis SETNX
DynamoDB conditional write
SQS pushes correctness into your application.
6️⃣ Visibility Timeout Is Not a Timeout
Common Mistake ❌
“Visibility timeout is how long processing takes.”
Wrong.
Visibility timeout is:
How long the message is hidden
NOT how long processing is allowed
If processing exceeds visibility timeout:
Message reappears
Another worker processes it
Duplicate work happens
✅ Correct Pattern — Extend Visibility
await sqs.changeMessageVisibility({
ReceiptHandle,
VisibilityTimeout: 300
});
Long-running tasks must heartbeat.
7️⃣ Visibility Timeout & Tail Latency (Hidden Trap)
If visibility timeout is too long:
Failed workers stall progress
Throughput collapses
If too short:
Duplicate processing
Thundering herds
Visibility timeout is a capacity & correctness knob.
8️⃣ SQS Is Not Backpressure
❌ Common Assumption
“SQS will buffer spikes safely.”
Reality:
Producers keep sending
Queue depth grows
Consumers fall behind
Processing latency becomes hours
9️⃣ Correct Pattern — Backpressure at Producer
if (queueDepth > MAX_DEPTH) {
rejectOrDelay();
}
Backpressure must exist before SQS, not after.
🔟 SQS + Autoscaling = Delay Amplification
Failure Pattern
Queue depth increases
Autoscaling reacts (slow)
New consumers cold-start
Visibility timeouts expire
Duplicate work explodes
Autoscaling reacts too late to save correctness.
1️⃣1️⃣ FIFO Queues & Message Groups (Subtle Bug)
FIFO ordering is per MessageGroupId, not global.
❌ Buggy Design
All messages → same group
Result:
- One slow message blocks entire queue
✅ Correct Design
Group = entityId
Parallelism with correctness.
1️⃣2️⃣ Dead-Letter Queues (DLQ) Are Not Optional
What DLQs really do
Prevent poison messages
Preserve forward progress
Limit blast radius
maxReceiveCount = 5
After 5 failures → DLQ.
DLQs are correctness boundaries.
1️⃣3️⃣ Ordering vs Throughput Tradeoff (Design Choice)
| Requirement | Choice |
|---|---|
| Max throughput | Standard |
| Exact order | FIFO |
| Entity-level order | FIFO + group |
| Simplicity | Standard |
1️⃣4️⃣ Multi-Tenancy + SQS (Noisy Neighbor Again)
❌ One Queue for All Tenants
One tenant floods queue
Others wait indefinitely
✅ Isolation Strategies
Queue per tenant
MessageGroupId per tenant
Per-tenant consumer pools
Per-tenant rate limits
1️⃣5️⃣ Exactly-Once Is a Lie (Again)
Even FIFO SQS:
Can redeliver
Can duplicate on visibility timeout
Can duplicate on retries
SQS gives at-least-once, forever.
