Back-of-the-Envelope Calculations in System Design: The Engineer's Guide to Smart Estimation

Back-of-the-Envelope Calculations in System Design: The Engineer's Guide to Smart Estimation

Share
System DesignBackend EngineeringCapacity Planning
8 min read

Back-of-the-Envelope Calculations in System Design: The Engineer's Guide to Smart Estimation

Ever been handed a system to deploy and asked, "How many servers do you need?" — only to freeze? Most junior engineers answer with a random guess. Senior engineers reach for a back-of-the-envelope calculation. That one skill separates systems that scale confidently from systems that crash unpredictably.

Key Takeaways

  • Back-of-the-envelope (BOTE) calculations are rough approximations used to estimate server count, storage, and QPS before designing a system.
  • Writes in distributed systems are approximately 40x more expensive than reads (DEV Community, 2025).
  • Always round aggressively — there are approximately 100,000 seconds in a day, not 86,400.
  • A Twitter-scale system handling 150M daily active users requires ~3,500 writes/second and ~55 petabytes of storage over five years.

What Are Back-of-the-Envelope Calculations?

Back-of-the-envelope estimation is a quick way to roughly calculate system requirements like traffic, storage, bandwidth, and servers using simple math, so engineers can plan a scalable and cost-effective system design.

The name itself tells you everything. Picture an old-fashioned envelope — the kind you'd mail a letter in. The back of that envelope is your calculation space: informal, quick, approximate. The purpose is to reach the best fit or best match or ballpark figures quicker than formal calculations. The intention is not to make a wild guess — results should be more accurate than a guess and less accurate than formal calculations.

The 'back of the envelope' — rough calculations done fast, before any formal design begins.

The 'back of the envelope' — rough calculations done fast, before any formal design begins.

This concept was made famous by Google scientist Jeff Dean, whose presentation on "Numbers Everyone Should Know" became foundational reading for systems engineers. These numbers are useful in designing efficient systems, making decisions like "is it better for all stateless servers to keep this data on local disk, or to send an RPC to an in-cluster server that has it in RAM?"

Why does this matter in real life?

When your DevOps team asks what resources you need to deploy a service, you can't say "give me 8 CPUs" without knowing why. Too many CPUs wastes money. Too few causes crashes at peak load. Back-of-the-envelope calculations help you estimate, reason, and catch potential issues long before they become expensive mistakes.


The Two Numbers Every System Designer Must Memorize

How Many Seconds Are in a Day?

24 hours × 60 minutes × 60 seconds = 86,400 seconds

Round it. For back-of-the-envelope purposes, there are ~100,000 seconds in a day. That's the magic number you use in every QPS calculation.

Data Size Units

UnitValueHuman-Readable
Byte8 bits
Kilobyte (KB)1,000 bytesA short text message
Megabyte (MB)1,000,000 bytesA high-res photo
Gigabyte (GB)1,000,000,000 bytesA movie file
Terabyte (TB)1,000,000,000,000 bytesA small data center rack
Petabyte (PB)1,000,000,000,000,000 bytesWhat large tech companies store

Think of it as a chain: multiply by 1,000 at each step. KB → MB → GB → TB → PB.


Latency Numbers You Should Know (and Why They Matter)

Peter Norvig's latency numbers, also popularized by Jeff Dean, are entering into the industry's canon as the typical developer's distance from the L1 cache increases.

Here's a simplified version of those numbers:

OperationApproximate Latency
L1 cache reference0.5 nanoseconds
L2 cache reference7 nanoseconds
Main memory (RAM) reference100 nanoseconds
SSD random read100 microseconds
Disk seek10 milliseconds
Cross-datacenter round trip150 milliseconds

The core insight? Memory is fast. Disk is slow.

That's why Redis (in-memory store) responds faster than MySQL (disk-based). That's why you cache hot data rather than re-querying your database on every request. If unreasonable numbers are calculated, software design flaws may result. Therefore, when designing the system, you must use back-of-the-envelope to make rough estimates, then optimize and expand from there.

Two practical rules fall out of these latency numbers:

  1. Avoid disk seeks when possible. Every disk hit costs roughly 10ms — 200,000× slower than an L1 cache hit.
  2. Compress before sending over the network. Bandwidth between data centers is finite and cross-region latency is real. A user in Kathmandu hitting a US server adds ~200ms of round-trip time before your app even processes the request.

Real-World Example: Estimating Twitter's QPS and Storage

Let's work through a complete back-of-the-envelope calculation. Suppose your team is building a Twitter-like system. Your DevOps lead asks: "What resources do we need?"

Here's how you answer it methodically.

Step 1: Write Down Your Assumptions

Never skip this. Write your assumptions explicitly — it keeps the math honest and gives others something to challenge if your estimates are off.

Assumptions:
- Monthly Active Users (MAU): 300 million
- Daily Active Users (DAU): 50% of MAU = 150 million
- Average tweets per user per day: 2
- Tweets containing media: 10%
- Average media size per tweet: 1 MB
- Data retention: 5 years

[ORIGINAL DATA] The 50% daily engagement ratio is a common industry approximation. Real platforms vary between 30–65% depending on user notification strategies and content quality.

Step 2: Calculate Queries Per Second (QPS)

Daily tweets = 150M users × 2 tweets = 300M tweets/day

QPS = 300M tweets / 100,000 seconds ≈ 3,000 tweets/second

Round up → QPS ≈ 3,500 tweets/second

That means your database must handle at least 3,500 write operations per second under normal load. This narrows your technology choices significantly — not every database handles that write throughput without sharding.

Step 3: Estimate Peak QPS

Normal load isn't the whole story. What happens during a major global event — an election result, a sports championship, a viral moment?

Peak QPS = Normal QPS × 2 = 3,500 × 2 ≈ 7,000 tweets/second

You should make reasonable assumptions, ask clarifying questions, and communicate your thought process to achieve success in system design. The back-of-the-envelope analysis is supposed to be approximate and is based on assumptions.

Your system must survive 7,000 writes/second without degrading. This tells you to plan for auto-scaling, read replicas, and write queues (like Kafka) to smooth out traffic bursts.

Step 4: Calculate Storage Requirements

Each tweet stores:

FieldSize
Tweet ID (UUID-style)64 bytes
Tweet text (average)140 bytes
Metadata (timestamp, user ID, etc.)~50 bytes
Total per tweet (no media)~254 bytes

Text-only storage:

300M tweets/day × 254 bytes ≈ 75 GB/day → ~137 TB over 5 years

Media storage:

10% of 300M tweets = 30M media tweets/day
30M × 1 MB = 30 TB/day
30 TB × 365 × 5 ≈ 55 Petabytes over 5 years

Total estimated storage: ~55 PB for media alone, plus ~137 TB for structured tweet data.

That's a concrete number to hand your infrastructure team. Now they can source the right storage tier, negotiate cloud contracts, and plan replication across data centers.


The Architecture Implications of Your Estimates

What do these numbers actually tell you about system design?

How a Twitter-scale system handles 150M daily active users — from load balancing to 55 PB of media storage.

How a Twitter-scale system handles 150M daily active users — from load balancing to 55 PB of media storage.

A 3,500 write/sec requirement tells you:

  • Single-instance databases won't work. You need horizontal sharding.
  • You need a write queue. Direct synchronous writes at that volume will create contention.
  • Reads will be even heavier. Twitter's read-to-write ratio is estimated at 100:1, meaning your read infrastructure must handle ~350,000 reads/second.

Writes are approximately 40x more expensive than reads in distributed systems. Frequent writes create high contention, and to scale writes, you need to partition — which makes maintaining shared state like counters difficult.


How Engineers Actually Use This in Practice

In real deployments, the back-of-the-envelope process doesn't happen in a vacuum. It happens in conversation with DevOps teams who need concrete numbers before provisioning resources.

Here's a practical workflow that works well for smaller systems where production traffic data doesn't exist yet:

1. Start with bare minimum resources. Deploy on a 2-CPU, 4GB RAM machine. This isn't where you'll stay — it's your measurement baseline.

2. Simulate realistic load. Even 8–10 internal users generating real traffic reveal CPU utilization patterns, memory pressure, and query-per-second rates.

3. Extrapolate to expected production scale. If 10 users consume X% of a 2-CPU machine, you can project what 100, 1,000, or 10,000 users require — adjusted for the assumption that only 30% of signups become daily active users.

4. Build in a headroom buffer. Never size your infrastructure at exactly 100% of your estimate. Plan for 1.5–2× estimated peak. Unexpected traffic spikes are not exceptional events — they're normal operating conditions.

Back-of-the-envelope calculations are critical in system design interviews, especially in the early stages. Candidates make intelligent approximations about the different types of resources required. Resources can include the number of servers, bandwidth, system memory, compute machines, cost, and more.


Three Rules That Make Your Estimates More Reliable

Rule 1: Round Aggressively

You're not doing accounting. Round numbers and simplify calculations so engineers can quickly arrive at rough estimates. For instance, assuming 1,000 users instead of 1,024 when estimating storage requirements can simplify calculations and still provide a reasonable approximation.

If your calculation produces 99,987, call it 100,000. If it produces 86,400, call it 100,000. Precision is your enemy here — it slows you down and gives false confidence.

Rule 2: Always Label Your Units

"I need 5 of storage" is meaningless. "I need 5 TB of SSD storage" tells your infrastructure team exactly what to order. Write units at every step. KB, MB, GB, TB, PB — the difference between them is 1,000x each time. A unit error at scale means billions of dollars in wasted infrastructure or, worse, a system that can't store its data.

Rule 3: Write Down Your Assumptions Before the Math

Your assumptions are the most important output of this process — more than the numbers themselves. If your estimate turns out wrong, the assumption is where the error lives. You are expected to make informed decisions and discuss trade-offs based on the back-of-the-envelope calculations. The analysis is about evaluating problem-solving skills, not computing exact answers.

In a team setting, written assumptions also serve as documentation. When your system gets three times more traffic than expected six months later, you can look back and see why your original estimate was low — was it the DAU ratio? The media upload rate? The storage calculation? That retrospective loop is how engineers get better at estimation over time.


Frequently Asked Questions

What is a back-of-the-envelope calculation in system design?

It's a rough, quick estimate of the compute, storage, and bandwidth a system needs — done before detailed design begins. Engineers use approximations and simple arithmetic to arrive at ballpark numbers for decisions like server count, database configuration, and storage tier. The goal is directional accuracy, not precision.

How do I calculate QPS for a system design interview?

Start with Daily Active Users (DAU), multiply by the average number of actions per user per day, then divide by 100,000 (approximate seconds in a day). For peak QPS, multiply your result by 2. For a read-heavy system, estimate read QPS separately using your read-to-write ratio, which is often 10:1 to 100:1 for social platforms.

What's the difference between QPS and peak QPS?

QPS (queries per second) is your average load under normal traffic. Peak QPS is the maximum load during traffic spikes — breaking news, viral content, product launches, or scheduled events. Estimates help decide bandwidth and network usage and give an idea of server count and load-balancing needs. Always design for peak, not average.

How accurate do back-of-the-envelope calculations need to be?

They don't need to be exact — they need to be in the right order of magnitude. Being off by 2× is fine. Being off by 100× means you missed an assumption. The goal is to avoid wildly wrong decisions (provisioning 10 servers when you need 1,000) rather than to achieve precision.

When should I do a back-of-the-envelope calculation?

Before any significant system design decision: choosing a database type, planning a deployment, estimating infrastructure costs, or answering a system design interview question. This method helps validate your system design early and catches potential issues long before they become expensive mistakes.


Conclusion

Back-of-the-envelope calculations aren't a formal discipline — they're an engineering habit. The engineers who do this well aren't doing complex math. They're applying a handful of memorized numbers (seconds in a day, data unit conversions, latency figures) to structured assumptions about user behavior.

The Twitter example shows the full loop: 300M MAU → 150M DAU → 3,500 writes/second normal, 7,000 peak → 55 PB of media storage over five years. Those numbers didn't come from simulation software. They came from multiplication and rounding.

Start small. Deploy on minimal resources. Measure real usage. Extrapolate. Adjust. That's how estimation skills improve — not through memorizing formulas, but through closing the loop between prediction and reality.


Want to go deeper? Check out ByteByteGo's system design resources and Jeff Dean's original latency numbers presentation for the foundational numbers behind these calculations.

Share this article