š SQS vs Kafka: Correctness Tradeoffs
Choosing between āevent deliveryā and āevent historyā
SQS and Kafka solve different correctness problems.
Throughput is secondary.
1ļøā£ The Wrong Question People Ask
ā Wrong
āWhich is faster? SQS or Kafka?ā
ā Right
āWhat correctness guarantees does my system actually need?ā
Because once you pick:
SQS ā you accept lossy history
Kafka ā you accept coordination cost
You cannot escape the tradeoff.
2ļøā£ Core Mental Model (This Matters)
š¦ SQS Mental Model
A distributed work queue
Message exists until processed
After deletion ā gone forever
Queue represents pending work
š„ Kafka Mental Model
An append-only distributed log
Messages are immutable
Consumers track position
Log represents event history
This difference explains every correctness tradeoff.
3ļøā£ Delivery Semantics (The Foundation)
| Property | SQS | Kafka |
|---|---|---|
| Delivery | At-least-once | At-least-once |
| Exactly-once | ā | ā ļø (within constraints) |
| Ordering | FIFO only | Partition-level |
| Message retention | Until consumed | Time/size-based |
| Replay | ā | ā |
Kafka remembers.
SQS forgets.
4ļøā£ Duplicate Processing (Both Have It, Differently)
SQS Duplicates
Visibility timeout expires
Consumer crashes
Message reappears
Process ā crash ā process again
Kafka Duplicates
Consumer crashes before commit
Offset not committed
Message replayed
Process ā crash ā replay from offset
Both are at-least-once.
But Kafka lets you replay intentionally.
5ļøā£ Correctness Implication #1 ā Idempotency
In SQS
Idempotency is mandatory.
No replay = no recovery
If you mess up:
- Data is wrong forever
In Kafka
Idempotency is strongly recommended.
But you have:
Rewind
Reprocess
Fix-forward
Kafka tolerates mistakes.
SQS does not.
6ļøā£ Ordering Guarantees (Subtle but Huge)
SQS
Standard: ā no ordering
FIFO: ordering per MessageGroupId
If one message is slow:
- Entire group blocks
Kafka
Ordering per partition
You control partitioning key
orderId ā same partition
Parallelism + ordering = possible.
7ļøā£ Correctness Implication #2 ā Causality
SQS
Once message is deleted:
Causality is lost
History is gone
You cannot answer:
āWhat happened before this?ā
Kafka
Log preserves causality.
You can:
Reconstruct timelines
Debug bugs
Validate invariants
Kafka is debuggable.
SQS is operational.
8ļøā£ Failure Recovery (Huge Difference)
SQS Failure Recovery
| Failure | Outcome |
|---|---|
| Buggy consumer | Data corrupted |
| Bad deploy | Messages gone |
| Wrong logic | No rewind |
Only option:
Manual repair
Re-run upstream jobs
Kafka Failure Recovery
| Failure | Outcome |
|---|---|
| Buggy consumer | Reset offset |
| Bad deploy | Replay |
| New logic | Reprocess |
Kafka gives you time travel.
9ļøā£ DLQ vs Replay (Philosophical Difference)
SQS DLQ
āThis message is brokenā
Isolate and move on
Correctness boundary
Kafka Replay
āProcessing was wrongā
Fix logic
Re-run history
Correctness recovery
10ļøā£ Backpressure & Load (Correctness Angle)
SQS
Producers keep producing
Queue depth grows
Processing delay grows silently
Correctness risk:
- Time-sensitive events become meaningless
Kafka
Consumers lag
Lag is observable
Replay window bounded
Correctness risk:
Lag-based staleness
But visible and measurable
1ļøā£1ļøā£ Exactly-Once Semantics (Reality Check)
SQS
ā Impossible by design
Kafka
ā ļø Possible only if:
Idempotent producers
Transactional writes
Single Kafka cluster
Controlled sinks
Even then:
Exactly-once is contextual, not absolute.
1ļøā£2ļøā£ Multi-Consumer Correctness
SQS
Each message ā one consumer
Bad for:
Fan-out
Independent consumers
You must duplicate queues.
Kafka
Many consumers can read same log.
Good for:
Analytics
Auditing
Side effects
Kafka separates event storage from event consumption.
1ļøā£3ļøā£ D2 ā Mental Model Comparison
SQS
Kafka
1ļøā£4ļøā£ When SQS Is the Correct Choice
Use SQS when:
You want work distribution
You donāt care about history
You want minimal ops
You can tolerate duplicates
You want managed simplicity
Examples:
Email sending
Image processing
Background jobs
Async side effects
1ļøā£5ļøā£ When Kafka Is the Correct Choice
Use Kafka when:
You need event history
Replay matters
Debuggability matters
Multiple consumers exist
Correctness > simplicity
Examples:
Payments
Order lifecycle
Analytics
CDC
Audit logs
1ļøā£6ļøā£ The Hidden Cost Tradeoff
| Dimension | SQS | Kafka |
|---|---|---|
| Ops cost | Low | High |
| Correctness recovery | ā | ā |
| Debuggability | ā | ā |
| Time travel | ā | ā |
| Simplicity | ā | ā |
Kafka charges operational complexity
in exchange for correctness leverage.
1ļøā£7ļøā£ Design Rule (Hard-Won)
If you cannot afford to lose history, do not use SQS.
If you cannot afford operational complexity, do not use Kafka.
