How Video Streaming Actually Works
Most people have noticed the tiny gear icon on YouTube that lets you switch between 360p and 4K. Most people have also clicked it exactly once, out of curiosity, and never thought about it again. But what's actually happening under the hood is more interesting than it looks. Getting video to play smoothly across a smartwatch, a 4K TV, a 3G phone in rural Nepal, and a fiber-connected laptop — all at the same time — is a genuinely hard problem. Here's how it evolved.
A video is just a lot of images
Before anything else: a video is a sequence of still images (frames) played fast enough that your brain reads them as motion. That's it.
- 60 FPS feels smooth and almost hyper-real
- 24 FPS is the cinema standard — still fluid
- 1 FPS is basically a slideshow
Common containers like MP4, MOV, MKV, and AVI all store these frames in compressed form. A short 4K clip can still hit 4–5 GB. That size is what makes streaming hard.
The first attempt: download everything first
In the early 2000s, the approach was simple. The client requests a video, the server sends the whole file, and playback starts when the download finishes.
This is called progressive download, and it was genuinely terrible.
You'd sit there watching a progress bar fill up before you could watch anything. If you bailed halfway through, you'd already downloaded the full file. On a slow connection, it was basically unusable. This is why buffering became a cultural shorthand for frustration — people would walk away from their computers mid-download and come back hoping it had finished.
Then came specialized streaming protocols
To escape progressive download's limitations, engineers built dedicated streaming protocols in the mid-2000s.
| Protocol | Full Name | Developed By |
|---|---|---|
| RTMP | Real-Time Messaging Protocol | Adobe |
| RTSP | Real-Time Streaming Protocol | RealNetworks |
These were a real improvement. Instead of downloading the full file first, video was sent in chunks — you could start watching almost immediately. Live streaming became possible. Bandwidth usage got more efficient because you only downloaded what you actually watched.
One problem remained: quality was fixed. If you were on a slow connection, you got the same 4K chunk as everyone else — 200 MB at a time — and it buffered constantly. There was no adaptation. The protocol had no concept of "maybe send me something lighter."
The modern solution: let the client pick its quality
Adaptive Bitrate Streaming (ABR) is the approach everything uses now — YouTube, Netflix, Twitch, all of it. The core idea is that the client, not the server, decides what quality to request based on its current network speed and screen. Here's how it works in practice.

Step 1: encoding
Before a video is served to anyone, it gets transcoded into multiple quality versions — typically something like 240p, 360p, 480p, 720p, 1080p, and 4K. Each version gets split into short segments, usually 2–10 seconds long.
This is why YouTube videos take 20–30 minutes to process after upload. The encoding isn't instant — it's CPU-intensive transcoding happening in parallel across multiple machines.
Step 2: the manifest file
All those segments get listed in an index file called a manifest. Think of it as a table of contents. It tells the client: here are the quality levels available, here are the URLs for each segment at each quality level, and here's what order to play them.
Two main formats exist:
- HLS (HTTP Live Streaming) — Apple's format, uses
.m3u8manifest files - MPEG-DASH (Dynamic Adaptive Streaming over HTTP) — the open standard, uses
.mpdfiles
They work identically. The file formats differ, the concept doesn't.
Step 3: the player adapts in real time
The video player downloads the manifest first, checks the current network speed and screen resolution, picks an appropriate quality, and starts fetching segments. Every few seconds, it re-evaluates. If your connection degrades, it drops to 480p. When bandwidth recovers, it climbs back to 1080p.
The user usually doesn't notice any of this happening.

Why you'd use a managed service instead of building it yourself
If you're building something, you probably don't want to set this up from scratch. A full ABR pipeline means:
- Storing 5–6 encoded versions of every video
- Running transcoding infrastructure (CPU/GPU intensive)
- Integrating with a CDN to serve segments fast globally
- Maintaining all of it at scale
- A video is just frames played fast enough to look like motion — and those frames add up to enormous file sizes
How it all fits together
| Approach | How it works | Main problem |
|---|---|---|
| Progressive download | Download entire file first | You wait forever; wastes bandwidth |
| RTMP / RTSP | Chunks, but single quality | No adaptation for slow networks |
| ABR (HLS / MPEG-DASH) | Chunks at multiple qualities, client adapts | Encoding complexity (mostly solved by managed services) |
The evolution is pretty clean: each generation fixed the worst problem of the previous one. Progressive download fixed nothing — it just made the server's job simple. RTMP fixed the wait time. ABR fixed the "one-size-fits-all" quality problem.
The result is that a 3G phone in a rural area and a 4K TV on fiber can hit the same video URL and both get a watchable experience. That's not magic — it's just the client requesting the right segments for its situation.
Key takeaways
- A video is a sequence of compressed frames, often gigabytes in size
- Progressive download (download first, watch later) was the original approach — and it was bad
- RTMP and RTSP improved things with chunked delivery, but still sent one fixed quality
- ABR is the current standard: the server pre-encodes multiple quality levels, the client picks which one to fetch based on its network speed, and it adjusts in real time
- HLS (
.m3u8) and MPEG-DASH (.mpd) are the two ABR formats in common use — same concept, different file formats - If you're building with video, tools like Mux, Cloudinary, and ImageKit handle the encoding pipeline so you don't have to