Loading...

12-Hour Money-Back Guarantee

📘 AWS SQS: Designing Correct & Scalable Systems with Queues

📘 AWS SQS: Designing Correct & Scalable Systems with Queues

📘 AWS SQS: Designing Correct & Scalable Systems with Queues

30 Mar 20223 min read

What SQS guarantees, what it doesn’t, and how systems actually fail

SQS is not “just a queue”.
It is a distributed coordination primitive with sharp edges.

1️⃣ The First Myth: “SQS Decouples Systems”

Truth

SQS decouples in time, not in correctness.

It does not:

  • Prevent overload

  • Guarantee ordering (standard)

  • Guarantee exactly-once delivery

  • Apply backpressure automatically

If you design assuming it does — your system will be wrong.

2️⃣ Core SQS Mental Model (Important)

What SQS actually guarantees

Property Guarantee
Delivery At-least-once
Ordering ❌ (Standard), ✅ (FIFO)
Durability
Availability
Latency Best-effort

Everything else is your responsibility.

3️⃣ Standard Queue vs FIFO Queue (Design, Not Feature)

Standard Queue

  • Massive throughput

  • Best-effort ordering

  • Message duplication possible

FIFO Queue

  • Strict ordering per group

  • Lower throughput

  • Higher coordination cost

Choosing FIFO is a correctness decision, not a performance one.

4️⃣ At-Least-Once Delivery → Duplicate Processing

❌ Naive Consumer (Broken)

while (true) {
  const msg = await sqs.receiveMessage();
  await process(msg);
  await sqs.deleteMessage(msg);
}

Failure Scenario

  • Consumer processes message

  • Crashes before delete

  • Message becomes visible again

  • Processed twice

5️⃣ Correctness Rule #1 — Consumers Must Be Idempotent

✅ Idempotent Consumer

if (alreadyProcessed(msg.id)) return;

process(msg);
markProcessed(msg.id);

Where markProcessed is:

  • DB unique key

  • Redis SETNX

  • DynamoDB conditional write

SQS pushes correctness into your application.

6️⃣ Visibility Timeout Is Not a Timeout

Common Mistake ❌

“Visibility timeout is how long processing takes.”

Wrong.

Visibility timeout is:

  • How long the message is hidden

  • NOT how long processing is allowed

If processing exceeds visibility timeout:

  • Message reappears

  • Another worker processes it

  • Duplicate work happens

✅ Correct Pattern — Extend Visibility

await sqs.changeMessageVisibility({
  ReceiptHandle,
  VisibilityTimeout: 300
});

Long-running tasks must heartbeat.

7️⃣ Visibility Timeout & Tail Latency (Hidden Trap)

If visibility timeout is too long:

  • Failed workers stall progress

  • Throughput collapses

If too short:

  • Duplicate processing

  • Thundering herds

Visibility timeout is a capacity & correctness knob.

8️⃣ SQS Is Not Backpressure

❌ Common Assumption

“SQS will buffer spikes safely.”

Reality:

  • Producers keep sending

  • Queue depth grows

  • Consumers fall behind

  • Processing latency becomes hours

9️⃣ Correct Pattern — Backpressure at Producer

if (queueDepth > MAX_DEPTH) {
  rejectOrDelay();
}

Backpressure must exist before SQS, not after.

🔟 SQS + Autoscaling = Delay Amplification

Failure Pattern

  1. Queue depth increases

  2. Autoscaling reacts (slow)

  3. New consumers cold-start

  4. Visibility timeouts expire

  5. Duplicate work explodes

Autoscaling reacts too late to save correctness.

1️⃣1️⃣ FIFO Queues & Message Groups (Subtle Bug)

FIFO ordering is per MessageGroupId, not global.

❌ Buggy Design

All messages → same group

Result:

  • One slow message blocks entire queue

✅ Correct Design

Group = entityId

Parallelism with correctness.

1️⃣2️⃣ Dead-Letter Queues (DLQ) Are Not Optional

What DLQs really do

  • Prevent poison messages

  • Preserve forward progress

  • Limit blast radius

maxReceiveCount = 5

After 5 failures → DLQ.

DLQs are correctness boundaries.

1️⃣3️⃣ Ordering vs Throughput Tradeoff (Design Choice)

Requirement Choice
Max throughput Standard
Exact order FIFO
Entity-level order FIFO + group
Simplicity Standard

1️⃣4️⃣ Multi-Tenancy + SQS (Noisy Neighbor Again)

❌ One Queue for All Tenants

  • One tenant floods queue

  • Others wait indefinitely

✅ Isolation Strategies

  • Queue per tenant

  • MessageGroupId per tenant

  • Per-tenant consumer pools

  • Per-tenant rate limits

1️⃣5️⃣ Exactly-Once Is a Lie (Again)

Even FIFO SQS:

  • Can redeliver

  • Can duplicate on visibility timeout

  • Can duplicate on retries

SQS gives at-least-once, forever.