Packets, handshakes, congestion control — every byte explained, with animations
What TCP/IP actually is
Quick reality check: there's no such thing as "the internet". There's no central server, no master cable, no big computer in a vault somewhere. There's only a billion devices agreeing on a set of rules for shouting packets at each other.
That set of rules is TCP/IP. Two protocols (well, more like a family of them) that turned a Pentagon experiment in the 1970s into the most successful piece of software ever written. Everything you do online — every video, login, Tinder swipe, encrypted bank wire — is just packets moving through this stack.
Why "TCP/IP" is technically a lie
You'll hear "TCP/IP" as if it's one protocol. It isn't. IP (Internet Protocol) handles addressing and routing. TCP (Transmission Control Protocol) handles reliability on top of IP. They're different layers solving different problems. Most apps speak TCP, but plenty speak UDP, ICMP, SCTP, QUIC, or weirder things — all over IP.
"TCP/IP" persists because in 1983 ARPANET migrated from NCP to two new protocols at the same time. They were always shipped together. The name stuck.
The mental model: layers
Your application (HTTP, SSH, DNS, anything…)
↓ sends bytes
TCP / UDP ← reliable streams vs. fire-and-forget datagrams
↓ wraps in a segment
IP ← addresses + routes between any two machines
↓ wraps in a packet
Ethernet / Wi-Fi / cellular ← actual signal on a single physical link
↓ photons / electrons
THE WIREEach layer adds its own header (envelope), trusts the layer below to deliver something, and hands work to the layer above. Every network problem you ever debug is "which layer is lying to me?"
OSI vs TCP/IP — fight!
| OSI 7-layer (theory) | TCP/IP 4-layer (reality) | Examples |
|---|---|---|
| Application | Application | HTTP, SSH, DNS, SMTP, the apps you write |
| Presentation | (folded into app) | Encoding, encryption (TLS lives roughly here) |
| Session | (folded into app) | Sessions / state (cookies, JWTs) |
| Transport | Transport | TCP, UDP, QUIC, SCTP |
| Network | Internet | IP, ICMP, IPsec |
| Data Link | Link | Ethernet, Wi-Fi, PPP — single-hop framing |
| Physical | Link | Wires, fibres, radio waves |
Networking textbooks teach the OSI 7-layer model. Real networks use the 4-layer TCP/IP model. The OSI model exists mostly so we have words for "presentation" and "session" — useful in conversation, not actually distinct on the wire.
IP — Addressing the Internet
IP is the postman of the internet. It accepts a chunk of data (called a packet), looks at the destination address, and figures out how to forward it toward that address — hop by hop, router by router. It makes no guarantees: packets can be lost, duplicated, delayed, or arrive out of order. Reliability is somebody else's problem (usually TCP's).
IPv4 — the address space we ran out of
IPv4 addresses are 32 bits: 192.0.2.1 is 4 bytes (192, 0, 2, 1). Maximum theoretically 4.3 billion addresses. We allocated them faster than expected; we ran out in 2011 (IANA), then regionally over the next decade. The patches: NAT (sharing) and IPv6 (more bits).
The 20-byte IP header, byte by byte
IPv6 — the 128-bit fix
# IPv4 (32 bits, dotted decimal) 192.0.2.1 # IPv6 (128 bits, colon-hex) 2001:0db8:85a3:0000:0000:8a2e:0370:7334 # Same address compressed (skip leading zeros, replace longest run of 0s with ::) 2001:db8:85a3::8a2e:370:7334
128 bits is 3.4 × 10^38 addresses — roughly an address per atom on the surface of the earth. We will not run out. IPv6 also fixed plenty of warts in IPv4: no fragmentation in routers, no checksum (transport layers handle it), built-in auto-configuration, mandatory IPsec support.
Special address ranges you should recognise
| Range | Used for |
|---|---|
10.0.0.0/8 | Private (RFC 1918) — corporate LANs |
172.16.0.0/12 | Private (RFC 1918) — docker default, smaller LANs |
192.168.0.0/16 | Private (RFC 1918) — most home routers |
127.0.0.0/8 | Loopback — usually 127.0.0.1 (localhost) |
169.254.0.0/16 | Link-local — auto-assigned when no DHCP |
100.64.0.0/10 | Carrier-grade NAT (CGNAT) — ISPs share among customers |
224.0.0.0/4 | Multicast |
0.0.0.0 | All addresses (binding) / unspecified |
255.255.255.255 | Limited broadcast |
Subnetting in 60 seconds
# CIDR notation: address/prefix-length 192.168.1.0/24 ← /24 means first 24 bits are network, last 8 are host So 192.168.1.0 to 192.168.1.255 (256 addresses) Mask: 255.255.255.0 # /16 10.0.0.0/16 ← 10.0.0.0 to 10.0.255.255 (65 536 addresses) # Calculate quickly: /n means 2^(32-n) addresses /24 = 256 /16 = 65 536 /8 = 16 777 216 # The first and last in each subnet aren't usable for hosts: # - first = network address # - last = broadcast # So /24 has 254 usable host IPs.
Routing — how a packet finds its destination
Every machine has a routing table. Outbound packet → find the most-specific matching route → send it to that route's next-hop. Routers do this at line speed — modern ASICs make routing decisions in nanoseconds.
# Look at your machine's routing table ip route # Linux netstat -rn # macOS / BSD / Windows # Sample output: default via 192.168.1.1 dev wlan0 ← "no specific route? send everything to the gateway" 192.168.1.0/24 dev wlan0 ← "local network — send directly" 169.254.0.0/16 dev wlan0 ← "link-local — send directly"
Between ISPs, routes are exchanged via BGP (Border Gateway Protocol) — the protocol that holds the global internet together. When a big BGP misconfiguration happens, half the internet goes dark for hours. (Facebook, October 2021, is the classic example.)
TCP — The Reliable Byte Stream
TCP's job is one sentence: "take a stream of bytes and reliably deliver them in order to a process on a remote machine". Sounds simple. Implementing it on top of an unreliable IP layer involves three-way handshakes, sequence numbers, sliding windows, congestion control, retransmission timers, and 35+ years of accumulated cleverness.
What TCP gives you
| Property | How it works |
|---|---|
| Connection-oriented | A 3-way handshake establishes that both sides are present and agree on initial state. |
| Reliable delivery | Lost packets are detected (via ACK gaps + duplicate ACKs) and retransmitted. |
| In-order delivery | Out-of-order arrivals are buffered until the gap fills in. The app sees a clean byte stream. |
| Flow control | The receiver advertises a "window" — how many bytes it can accept right now. Sender respects it. Prevents drowning slow receivers. |
| Congestion control | Algorithms (Reno, CUBIC, BBR) detect when the network is congested and slow down. |
| Full-duplex | Both sides can send at the same time independently. |
Sequence numbers — the magic glue
Every byte TCP sends has a sequence number. Client: seq=1000, payload "GET /" (5 bytes — covers seq 1000..1004) Server: ack=1005 "I got everything up to (but not including) 1005" If the server receives: seq=1000 payload "GET " (missing — lost packet) seq=1004 payload "/" The server keeps ACKing 1004 ("I'm still waiting for byte 1004!") After 3 such duplicate ACKs, client retransmits.
The initial sequence number (ISN) used to be predictable on old OSes — leading to sequence prediction attacks (Mitnick used this against Shimomura in 1994, leading to one of the most famous hacks in history). Modern OSes use cryptographically-random ISNs.
TCP flags — the bits that change everything
| Flag | Meaning |
|---|---|
| SYN | Synchronize — open a connection. Used only at handshake. |
| ACK | Acknowledge — set on almost every packet after the first SYN. |
| FIN | Finish — graceful close. "I have no more data to send." |
| RST | Reset — abnormal close. "Forget this connection, drop it." |
| PSH | Push — deliver data to the app immediately, don't buffer. |
| URG | Urgent — out-of-band data. Almost nobody uses this anymore. |
TCP state machine — the whole picture
+-------+
| CLOSED|
listen() +-------+
| socket() | connect() → SYN
v v
+-------+ +----------+
|LISTEN |←-────SYN/SYN-ACK───|SYN_SENT |
+-------+ +----------+
| SYN/SYN-ACK | ACK
v v
+-------------+ +--------------+
|SYN_RECEIVED |─────ACK────→| ESTABLISHED |
+-------------+ +--------------+
| |
close()──┘ └──FIN from peer
| |
FIN_WAIT_1 CLOSE_WAIT
| |
FIN_WAIT_2 LAST_ACK
| |
TIME_WAIT CLOSEDFlow Control & Congestion Control
TCP can't just blast bytes as fast as your NIC allows. It would melt routers in the middle, drop everyone else's traffic, and end up retransmitting half of what it sent. Congestion control is the genius that makes the internet possible.
Slow start, then AIMD
TCP starts with cwnd (congestion window) = 1 MSS. Every successful ACK doubles it (slow start — exponential). When cwnd hits the ssthresh, switch to additive growth (+1 per RTT). On packet loss, halve cwnd (multiplicative decrease). This AIMD — Additive Increase, Multiplicative Decrease — produces the iconic sawtooth that you saw in the animation.
The algorithms
| Algorithm | Year | Key idea |
|---|---|---|
| Tahoe | 1988 | First congestion control. Loss → cwnd=1, restart slow start. |
| Reno | 1990 | Tahoe + Fast Recovery: on 3 dup-ACKs, halve cwnd instead of starting from 1. |
| NewReno | 1996 | Better recovery from multiple losses in same window. |
| CUBIC | 2008 | Smoother cubic growth curve. Linux default since ~2007. |
| BBR | 2016 | Google's bandwidth-based: measures throughput + RTT, not loss. Dramatically better on long-fat networks. YouTube uses it. |
Flow control vs congestion control — same idea, different scope
| Term | What it does | Where it lives |
|---|---|---|
| Flow control | Receiver tells sender "my buffer is N bytes — don't send more than this". | Receive Window (rwnd) in every ACK. |
| Congestion control | Sender estimates network capacity and slows itself down. | Maintained internally as cwnd. |
Actual sending = min(rwnd, cwnd). The smaller one wins. Slow receivers → flow control limits. Slow networks → congestion control limits.
UDP — When Fast Beats Reliable
UDP is the other transport protocol you should know about. Tiny header (8 bytes vs TCP's 20). No handshake. No retransmission. No ordering. If a packet gets lost, it stays lost. UDP just fires datagrams and hopes for the best.
UDP header — eight bytes total
0 8 16 24 32 bits +------+------+------+------+ | Source port | Dest port | ← 16 bits each +------+------+------+------+ | Length | Checksum | ← 16 bits each +------+------+------+------+ | Data... | +---------------------------+
That's the entire protocol. Source port, destination port, length, checksum, payload. No state, no reliability, no flow control. Pure speed.
When UDP wins
| Use case | Why UDP |
|---|---|
| DNS | Most queries fit in one packet. Reply, done. TCP would add 1-3 RTT overhead. |
| NTP | Time sync — one tiny exchange. |
| Video / VoIP | Lost packets are stale by the time TCP would retransmit. Better to skip and keep going. |
| Online games | Same as VoIP — old packets are useless. Custom reliability on top. |
| QUIC / HTTP/3 | Modern reinvention — UDP underneath but builds its own reliability + multiplexing on top. |
| SNMP | Polling network gear. Lightweight. |
| WireGuard | VPN — UDP for fast tunnelling. |
UDP's ugly secret: amplification attacks
TCP/IP Attacks — The Greatest Hits
TCP/IP was designed in the friendliest possible threat model: ARPANET was a small research network where everyone knew everyone. Attacking it would be like attacking your friend's LAN party. Security wasn't a goal. Decades later, we're paying for it.
SYN flood
Attacker spams SYN packets (spoofed source IP) at the target: SYN → server (state: SYN_RECEIVED, half-open) SYN → server (state: SYN_RECEIVED, half-open) SYN → server (state: SYN_RECEIVED, half-open) ...thousands per second... Server's connection table fills up. Legitimate connections can't allocate state. Service is down. Defence: - SYN cookies — encode connection state in the SYN-ACK; no state until final ACK. - tcp_max_syn_backlog tuning. - SYN proxies in front of the service.
SYN flood is the OG TCP DoS. SYN cookies (Daniel Bernstein, 1996) are basically free defence and shipped in every modern OS.
TCP sequence prediction (Mitnick attack)
Old TCP stacks generated predictable ISNs (e.g., incrementing by a fixed amount per second). An attacker who could observe one SYN-ACK could guess future ISNs and craft a fake "ACK" to establish a spoofed connection — appearing to come from a trusted source. Kevin Mitnick used this against Tsutomu Shimomura on Christmas 1994. Fix: randomise ISNs cryptographically. Every modern OS does.
RST injection
Anyone on the path who can see the sequence numbers can inject a forged TCP RST packet — instantly tearing down the connection. The Great Firewall of China uses this to censor specific TCP streams. Mitigation: TLS doesn't prevent RST itself, but apps see the abrupt drop and can retry. QUIC over UDP can't be RST'd at the transport layer.
Session hijacking
If an attacker can sniff the SYN-ACK (they're on the same Wi-Fi, or upstream), they know seq + ack numbers and can inject packets into the live session. Defence: end-to-end encryption (TLS, SSH). TCP alone is naked.
Idle scan / Zombie scan (nmap -sI)
Clever stealth port scanning: attacker uses a third-party "zombie" host with predictable IPID. By observing the zombie's IPID before and after spoofed probes to the target, attacker determines if target's port is open — without ever sending packets from their own IP. Works only against hosts with sequential IPIDs (Windows older versions, some embedded gear). Mostly historical now.
Slowloris (HTTP-specific but TCP-shaped)
Open many connections to a web server but send headers very slowly — one byte every 10 seconds. Each connection ties up a thread. Modest hardware can take down an Apache prefork server. Defence: use async / event-driven servers (nginx, Apache event MPM), set strict client-header timeouts.
Tools — Your TCP/IP Toolkit
You'll spend a lot of your engineering career poking at TCP/IP. These are the tools you'll use.
tcpdump — packet capture from the CLI
# Capture HTTP traffic on any interface sudo tcpdump -i any -n port 80 # Capture and save to a file for Wireshark sudo tcpdump -i any -w capture.pcap port 443 # Show full packet contents (verbose, with hex) sudo tcpdump -i any -nvvX port 53 # Specific source / destination sudo tcpdump -i any -n 'host 8.8.8.8 and port 53' # Just SYN packets (handshake initiations) sudo tcpdump -i any -n 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0' # Just RST packets (connection resets) sudo tcpdump -i any -n 'tcp[tcpflags] & tcp-rst != 0'
ss — modern socket statistics
# All TCP connections (Linux replacement for netstat) ss -tnp # Listening sockets ss -tlnp # Show socket info + timer info + extended state ss -tinp # Connection counts by state ss -tan | awk 'NR>1{print $1}' | sort | uniq -c # All UDP ss -unp
Wireshark — the GUI
Same packet capture power as tcpdump, but with a UI. Open a .pcap file, filter with display filters (tcp.flags.syn==1 and tcp.flags.ack==0, tcp.analysis.retransmission), follow TCP streams, decode TLS with SSLKEYLOGFILE. Indispensable.
nmap — scan the world
# Quick TCP SYN scan (default) nmap -sS target.com # Full TCP connect (legitimate handshake — slower, but works without root) nmap -sT target.com # Port version detection nmap -sV -p 22,80,443 target.com # OS detection nmap -O target.com # All TCP ports + UDP (slow!) nmap -p- -sU target.com # Aggressive — version, OS, scripts, traceroute nmap -A target.com
hping3 — craft custom packets
# Send a SYN to port 80 (basic ping that works through firewalls) sudo hping3 -S -p 80 target.com # Generate a TCP flood (for stress-testing — your own servers only!) sudo hping3 -i u1 -S -p 80 --rand-source target.com # Traceroute via TCP (gets through most firewalls) sudo hping3 -T -S -p 80 target.com # Send a raw byte payload sudo hping3 -E payload.txt -p 9999 -S target.com
mtr — visual traceroute + ping in one
# Watch a path live, with stats on every hop mtr target.com # Report mode (10 cycles then exit, e.g. for sharing) mtr -r -c 10 target.com
iperf3 — actual bandwidth measurement
# On the server: iperf3 -s # On the client (transfers max throughput for 10 seconds): iperf3 -c server-ip # Test UDP throughput iperf3 -c server-ip -u -b 100M
OS-level kernel tunables (Linux)
# Current congestion control sysctl net.ipv4.tcp_congestion_control # cubic ← typical default # What's available sysctl net.ipv4.tcp_available_congestion_control # reno cubic bbr ... # Switch to BBR (Google's modern algorithm — often faster on hi-RTT links) echo "net.core.default_qdisc = fq" | sudo tee -a /etc/sysctl.conf echo "net.ipv4.tcp_congestion_control = bbr" | sudo tee -a /etc/sysctl.conf sudo sysctl -p # Increase max connections per port sudo sysctl -w net.core.somaxconn=4096 # Reuse TIME_WAIT sockets faster (be careful with this in production!) sudo sysctl -w net.ipv4.tcp_tw_reuse=1
TCP/IP in the Real World
Theory is clean. Production is messy. Here's the lived reality of TCP/IP that interview questions never cover.
NAT — how 4 billion IPs became enough
Your home network has dozens of devices sharing one public IP. NAT (Network Address Translation) at your router rewrites source IP + port on outbound packets, and reverses it on the way back. To the internet, everything looks like it came from one machine.
| Type | What it does |
|---|---|
| Source NAT (SNAT) | Outbound: rewrite source. Used by routers / firewalls. |
| Destination NAT (DNAT) | Inbound: rewrite dest. Used for port forwarding. |
| PAT / NAPT | NAT that also rewrites ports → many internal clients can share one external IP. |
| CGNAT | Carrier-grade NAT. Your ISP NATs you too. Even your "public IP" is private to the ISP. Common in mobile networks. |
MTU and the fragmentation footgun
MTU = Maximum Transmission Unit. Ethernet defaults to 1500 bytes. PPPoE (DSL): 1492. VPN tunnels: often 1380. If a packet is bigger than the path MTU, it either gets fragmented (IPv4) or dropped (IPv6, or with DF flag set).
TIME_WAIT — the misunderstood state
After a connection closes, the side that called close() first stays in TIME_WAIT for 2× MSL (Maximum Segment Lifetime, usually 60s total). Why? To catch stragglers (delayed packets that might arrive after close) and prevent them from confusing a new connection on the same 4-tuple.
On a busy server, you might see tens of thousands of TIME_WAIT entries. Tempting to "fix" with tcp_tw_reuse=1. Tempting and DANGEROUS — the protections are there for a reason. Better fix: use long-lived connections (keep-alive), or use SO_REUSEPORT, or use HTTP/2/3 multiplexing.
Keepalive — detecting dead peers
If the connection just sits idle, TCP doesn't notice if the peer disappeared. TCP keepalive sends periodic empty probes; no reply for ~9 probes → connection considered dead. Linux defaults are very loose (2 hours idle, then probes). Most apps tighten this with SO_KEEPALIVE + interval tuning, or implement application-layer heartbeats.
The C10K and C10M problems
1999 — Dan Kegel asked "can a single server handle 10 000 concurrent connections?" Spawned the evolution of epoll/kqueue/IOCP, async runtimes (libuv, Tokio), event-driven web servers (nginx). 2013 — Robert Graham asked "can it handle 10 MILLION?" Required kernel bypass (DPDK, XDP). The answer in both cases turned out to be "yes — but the kernel is the bottleneck, route around it".
Quick Reference Cheat Sheet
One-liners and constants that come up constantly.
Common ports you should recognise on sight
| Port | Service |
|---|---|
| 20 / 21 | FTP (data / control) |
| 22 | SSH |
| 23 | Telnet (don't use) |
| 25 | SMTP |
| 53 | DNS (UDP + TCP) |
| 67 / 68 | DHCP (server / client) |
| 80 | HTTP |
| 110 | POP3 |
| 111 | rpcbind / portmap |
| 123 | NTP |
| 143 | IMAP |
| 161 | SNMP |
| 389 | LDAP |
| 443 | HTTPS |
| 445 | SMB / CIFS |
| 465 / 587 | SMTPS / SMTP submission |
| 636 | LDAPS |
| 853 | DoT (DNS over TLS) |
| 993 | IMAPS |
| 995 | POP3S |
| 3306 | MySQL / MariaDB |
| 3389 | RDP |
| 5432 | PostgreSQL |
| 5672 | AMQP / RabbitMQ |
| 6379 | Redis |
| 8080 / 8443 | HTTP / HTTPS alternates |
| 9200 | Elasticsearch |
| 27017 | MongoDB |
Common TCP/UDP one-liners
# Test if a port is open from CLI nc -zv target.com 443 # Or: timeout 2 bash -c 'cat < /dev/tcp/target.com/443' && echo open # Quick HTTP banner echo -e 'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n' | nc -q 1 example.com 80 # Find which process owns a port sudo lsof -i :8080 sudo ss -tlnp | grep :8080 # Show real-time bandwidth per connection sudo iftop sudo nethogs # Test latency to a host ping -c 4 target.com # Find the path to a host traceroute target.com # Or mtr for live updating # Continuously refresh active connections watch -n 1 'ss -tan | head -20'
TCP state cheat-sheet
| State | Meaning |
|---|---|
LISTEN | Server bound, waiting for SYN |
SYN_SENT | Client sent SYN, waiting for SYN-ACK |
SYN_RECEIVED | Server sent SYN-ACK, waiting for ACK |
ESTABLISHED | Data flowing both ways |
FIN_WAIT_1 | Local closed, waiting for peer's ACK |
FIN_WAIT_2 | Local closed and ACKed; waiting for peer's FIN |
CLOSE_WAIT | Peer closed; waiting for local app to call close() |
LAST_ACK | Local FIN sent after CLOSE_WAIT; waiting for final ACK |
TIME_WAIT | 2× MSL wait after close to catch stragglers |
CLOSED | No connection |
Closing Thoughts
TCP/IP is the unsung infrastructure that runs literally everything you do online. The 1970s-vintage protocol holds up 21st-century traffic at 100+ Gbps over fibre, and somehow it still works.
Two big takeaways. Every problem is a layer problem — when something breaks, ask which layer is lying. And caching, retransmission and congestion control are doing far more work than you can see: the packets you send aren't the ones the wire carries — they're reshaped, compressed, fragmented, retried and queued at every step.
Spend a weekend with tcpdump on your laptop. Open a few websites, ssh somewhere, run a video call. Watch the SYNs fly. The internet stops being magic and starts being a beautifully-engineered, slightly-broken machine. That's the goal.