How Video Streaming Works: From Buffering to Adaptive Bitrate (4)

How Video Streaming Actually Works

Most people have noticed the tiny gear icon on YouTube that lets you switch between 360p and 4K. Most people have also clicked it exactly once, out of curiosity, and never thought about it again. But what's actually happening under the hood is more interesting than it looks. Getting video to play smoothly across a smartwatch, a 4K TV, a 3G phone in rural Nepal, and a fiber-connected laptop — all at the same time — is a genuinely hard problem. Here's how it evolved.

A video is just a lot of images

Before anything else: a video is a sequence of still images (frames) played fast enough that your brain reads them as motion. That's it.

60 FPS feels smooth and almost hyper-real
24 FPS is the cinema standard — still fluid
1 FPS is basically a slideshow

Common containers like MP4, MOV, MKV, and AVI all store these frames in compressed form. A short 4K clip can still hit 4–5 GB. That size is what makes streaming hard.

The first attempt: download everything first

In the early 2000s, the approach was simple. The client requests a video, the server sends the whole file, and playback starts when the download finishes.

This is called progressive download, and it was genuinely terrible.

You'd sit there watching a progress bar fill up before you could watch anything. If you bailed halfway through, you'd already downloaded the full file. On a slow connection, it was basically unusable. This is why buffering became a cultural shorthand for frustration — people would walk away from their computers mid-download and come back hoping it had finished.

Then came specialized streaming protocols

To escape progressive download's limitations, engineers built dedicated streaming protocols in the mid-2000s.

Protocol	Full Name	Developed By
RTMP	Real-Time Messaging Protocol	Adobe
RTSP	Real-Time Streaming Protocol	RealNetworks

These were a real improvement. Instead of downloading the full file first, video was sent in chunks — you could start watching almost immediately. Live streaming became possible. Bandwidth usage got more efficient because you only downloaded what you actually watched.

One problem remained: quality was fixed. If you were on a slow connection, you got the same 4K chunk as everyone else — 200 MB at a time — and it buffered constantly. There was no adaptation. The protocol had no concept of "maybe send me something lighter."

The modern solution: let the client pick its quality

Adaptive Bitrate Streaming (ABR) is the approach everything uses now — YouTube, Netflix, Twitch, all of it. The core idea is that the client, not the server, decides what quality to request based on its current network speed and screen. Here's how it works in practice.

Step 1: encoding

Before a video is served to anyone, it gets transcoded into multiple quality versions — typically something like 240p, 360p, 480p, 720p, 1080p, and 4K. Each version gets split into short segments, usually 2–10 seconds long.

This is why YouTube videos take 20–30 minutes to process after upload. The encoding isn't instant — it's CPU-intensive transcoding happening in parallel across multiple machines.

Step 2: the manifest file

All those segments get listed in an index file called a manifest. Think of it as a table of contents. It tells the client: here are the quality levels available, here are the URLs for each segment at each quality level, and here's what order to play them.

Two main formats exist:

HLS (HTTP Live Streaming) — Apple's format, uses .m3u8 manifest files
MPEG-DASH (Dynamic Adaptive Streaming over HTTP) — the open standard, uses .mpd files

They work identically. The file formats differ, the concept doesn't.

Step 3: the player adapts in real time

The video player downloads the manifest first, checks the current network speed and screen resolution, picks an appropriate quality, and starts fetching segments. Every few seconds, it re-evaluates. If your connection degrades, it drops to 480p. When bandwidth recovers, it climbs back to 1080p.

The user usually doesn't notice any of this happening.

Why you'd use a managed service instead of building it yourself

If you're building something, you probably don't want to set this up from scratch. A full ABR pipeline means:

Storing 5–6 encoded versions of every video
Running transcoding infrastructure (CPU/GPU intensive)
Integrating with a CDN to serve segments fast globally
Maintaining all of it at scale
A video is just frames played fast enough to look like motion — and those frames add up to enormous file sizes

How it all fits together

Approach	How it works	Main problem
Progressive download	Download entire file first	You wait forever; wastes bandwidth
RTMP / RTSP	Chunks, but single quality	No adaptation for slow networks
ABR (HLS / MPEG-DASH)	Chunks at multiple qualities, client adapts	Encoding complexity (mostly solved by managed services)

The evolution is pretty clean: each generation fixed the worst problem of the previous one. Progressive download fixed nothing — it just made the server's job simple. RTMP fixed the wait time. ABR fixed the "one-size-fits-all" quality problem.

The result is that a 3G phone in a rural area and a 4K TV on fiber can hit the same video URL and both get a watchable experience. That's not magic — it's just the client requesting the right segments for its situation.

Key takeaways

A video is a sequence of compressed frames, often gigabytes in size
Progressive download (download first, watch later) was the original approach — and it was bad
RTMP and RTSP improved things with chunked delivery, but still sent one fixed quality
ABR is the current standard: the server pre-encodes multiple quality levels, the client picks which one to fetch based on its network speed, and it adjusts in real time
HLS (.m3u8) and MPEG-DASH (.mpd) are the two ABR formats in common use — same concept, different file formats
If you're building with video, tools like Mux, Cloudinary, and ImageKit handle the encoding pipeline so you don't have to

Share this article