TCP/IP — The Internet's Wiring, Deep Dive

How packets actually move across the internet — the layered stack, IP headers, the iconic TCP 3-way handshake, sequence numbers, congestion control, UDP, attacks (SYN flood, RST injection), modern realities (NAT, MTU, BBR), and the tools every engineer should know.

Packets, handshakes, congestion control — every byte explained, with animations

What TCP/IP actually is

Quick reality check: there's no such thing as "the internet". There's no central server, no master cable, no big computer in a vault somewhere. There's only a billion devices agreeing on a set of rules for shouting packets at each other.

That set of rules is TCP/IP. Two protocols (well, more like a family of them) that turned a Pentagon experiment in the 1970s into the most successful piece of software ever written. Everything you do online — every video, login, Tinder swipe, encrypted bank wire — is just packets moving through this stack.

Why "TCP/IP" is technically a lie

You'll hear "TCP/IP" as if it's one protocol. It isn't. IP (Internet Protocol) handles addressing and routing. TCP (Transmission Control Protocol) handles reliability on top of IP. They're different layers solving different problems. Most apps speak TCP, but plenty speak UDP, ICMP, SCTP, QUIC, or weirder things — all over IP.

"TCP/IP" persists because in 1983 ARPANET migrated from NCP to two new protocols at the same time. They were always shipped together. The name stuck.

The mental model: layers

shell
Your application (HTTP, SSH, DNS, anything…)
       ↓ sends bytes
TCP / UDP                     ← reliable streams vs. fire-and-forget datagrams
       ↓ wraps in a segment
IP                            ← addresses + routes between any two machines
       ↓ wraps in a packet
Ethernet / Wi-Fi / cellular   ← actual signal on a single physical link
       ↓ photons / electrons
THE WIRE

Each layer adds its own header (envelope), trusts the layer below to deliver something, and hands work to the layer above. Every network problem you ever debug is "which layer is lying to me?"

OSI vs TCP/IP — fight!

OSI 7-layer (theory)TCP/IP 4-layer (reality)Examples
ApplicationApplicationHTTP, SSH, DNS, SMTP, the apps you write
Presentation(folded into app)Encoding, encryption (TLS lives roughly here)
Session(folded into app)Sessions / state (cookies, JWTs)
TransportTransportTCP, UDP, QUIC, SCTP
NetworkInternetIP, ICMP, IPsec
Data LinkLinkEthernet, Wi-Fi, PPP — single-hop framing
PhysicalLinkWires, fibres, radio waves

Networking textbooks teach the OSI 7-layer model. Real networks use the 4-layer TCP/IP model. The OSI model exists mostly so we have words for "presentation" and "session" — useful in conversation, not actually distinct on the wire.

IP — Addressing the Internet

IP is the postman of the internet. It accepts a chunk of data (called a packet), looks at the destination address, and figures out how to forward it toward that address — hop by hop, router by router. It makes no guarantees: packets can be lost, duplicated, delayed, or arrive out of order. Reliability is somebody else's problem (usually TCP's).

IPv4 — the address space we ran out of

IPv4 addresses are 32 bits: 192.0.2.1 is 4 bytes (192, 0, 2, 1). Maximum theoretically 4.3 billion addresses. We allocated them faster than expected; we ran out in 2011 (IANA), then regionally over the next decade. The patches: NAT (sharing) and IPv6 (more bits).

The 20-byte IP header, byte by byte

IPv6 — the 128-bit fix

shell
# IPv4 (32 bits, dotted decimal)
192.0.2.1

# IPv6 (128 bits, colon-hex)
2001:0db8:85a3:0000:0000:8a2e:0370:7334

# Same address compressed (skip leading zeros, replace longest run of 0s with ::)
2001:db8:85a3::8a2e:370:7334

128 bits is 3.4 × 10^38 addresses — roughly an address per atom on the surface of the earth. We will not run out. IPv6 also fixed plenty of warts in IPv4: no fragmentation in routers, no checksum (transport layers handle it), built-in auto-configuration, mandatory IPsec support.

IPv6 adoption is ~45% globally (Google's stats, 2024). It's the future you keep being told about. We're also told "next year is the year of IPv6", every year, since about 1998.

Special address ranges you should recognise

RangeUsed for
10.0.0.0/8Private (RFC 1918) — corporate LANs
172.16.0.0/12Private (RFC 1918) — docker default, smaller LANs
192.168.0.0/16Private (RFC 1918) — most home routers
127.0.0.0/8Loopback — usually 127.0.0.1 (localhost)
169.254.0.0/16Link-local — auto-assigned when no DHCP
100.64.0.0/10Carrier-grade NAT (CGNAT) — ISPs share among customers
224.0.0.0/4Multicast
0.0.0.0All addresses (binding) / unspecified
255.255.255.255Limited broadcast

Subnetting in 60 seconds

shell
# CIDR notation: address/prefix-length
192.168.1.0/24       ← /24 means first 24 bits are network, last 8 are host
                       So 192.168.1.0 to 192.168.1.255  (256 addresses)
                       Mask: 255.255.255.0

# /16
10.0.0.0/16          ← 10.0.0.0 to 10.0.255.255  (65 536 addresses)

# Calculate quickly: /n means 2^(32-n) addresses
/24 = 256
/16 = 65 536
/8  = 16 777 216

# The first and last in each subnet aren't usable for hosts:
#  - first = network address
#  - last  = broadcast
# So /24 has 254 usable host IPs.

Routing — how a packet finds its destination

Every machine has a routing table. Outbound packet → find the most-specific matching route → send it to that route's next-hop. Routers do this at line speed — modern ASICs make routing decisions in nanoseconds.

shell
# Look at your machine's routing table
ip route               # Linux
netstat -rn            # macOS / BSD / Windows

# Sample output:
default via 192.168.1.1 dev wlan0      ← "no specific route? send everything to the gateway"
192.168.1.0/24 dev wlan0               ← "local network — send directly"
169.254.0.0/16 dev wlan0               ← "link-local — send directly"

Between ISPs, routes are exchanged via BGP (Border Gateway Protocol) — the protocol that holds the global internet together. When a big BGP misconfiguration happens, half the internet goes dark for hours. (Facebook, October 2021, is the classic example.)

TCP — The Reliable Byte Stream

TCP's job is one sentence: "take a stream of bytes and reliably deliver them in order to a process on a remote machine". Sounds simple. Implementing it on top of an unreliable IP layer involves three-way handshakes, sequence numbers, sliding windows, congestion control, retransmission timers, and 35+ years of accumulated cleverness.

What TCP gives you

PropertyHow it works
Connection-orientedA 3-way handshake establishes that both sides are present and agree on initial state.
Reliable deliveryLost packets are detected (via ACK gaps + duplicate ACKs) and retransmitted.
In-order deliveryOut-of-order arrivals are buffered until the gap fills in. The app sees a clean byte stream.
Flow controlThe receiver advertises a "window" — how many bytes it can accept right now. Sender respects it. Prevents drowning slow receivers.
Congestion controlAlgorithms (Reno, CUBIC, BBR) detect when the network is congested and slow down.
Full-duplexBoth sides can send at the same time independently.

Sequence numbers — the magic glue

shell
Every byte TCP sends has a sequence number.

Client:  seq=1000, payload "GET /"   (5 bytes — covers seq 1000..1004)
Server:  ack=1005  "I got everything up to (but not including) 1005"

If the server receives:
  seq=1000  payload "GET "
                          (missing — lost packet)
  seq=1004  payload "/"

The server keeps ACKing 1004 ("I'm still waiting for byte 1004!")
After 3 such duplicate ACKs, client retransmits.

The initial sequence number (ISN) used to be predictable on old OSes — leading to sequence prediction attacks (Mitnick used this against Shimomura in 1994, leading to one of the most famous hacks in history). Modern OSes use cryptographically-random ISNs.

TCP flags — the bits that change everything

FlagMeaning
SYNSynchronize — open a connection. Used only at handshake.
ACKAcknowledge — set on almost every packet after the first SYN.
FINFinish — graceful close. "I have no more data to send."
RSTReset — abnormal close. "Forget this connection, drop it."
PSHPush — deliver data to the app immediately, don't buffer.
URGUrgent — out-of-band data. Almost nobody uses this anymore.

TCP state machine — the whole picture

shell
                              +-------+
                              | CLOSED|
              listen()        +-------+
                |   socket()      |   connect() → SYN
                v                 v
            +-------+         +----------+
            |LISTEN |←-────SYN/SYN-ACK───|SYN_SENT |
            +-------+                    +----------+
                |   SYN/SYN-ACK              | ACK
                v                            v
          +-------------+              +--------------+
          |SYN_RECEIVED |─────ACK────→|  ESTABLISHED  |
          +-------------+              +--------------+
                                         |    |
                                close()──┘    └──FIN from peer
                                         |    |
                                   FIN_WAIT_1  CLOSE_WAIT
                                         |    |
                                   FIN_WAIT_2  LAST_ACK
                                         |    |
                                    TIME_WAIT  CLOSED

Flow Control & Congestion Control

TCP can't just blast bytes as fast as your NIC allows. It would melt routers in the middle, drop everyone else's traffic, and end up retransmitting half of what it sent. Congestion control is the genius that makes the internet possible.

Slow start, then AIMD

TCP starts with cwnd (congestion window) = 1 MSS. Every successful ACK doubles it (slow start — exponential). When cwnd hits the ssthresh, switch to additive growth (+1 per RTT). On packet loss, halve cwnd (multiplicative decrease). This AIMD — Additive Increase, Multiplicative Decrease — produces the iconic sawtooth that you saw in the animation.

The algorithms

AlgorithmYearKey idea
Tahoe1988First congestion control. Loss → cwnd=1, restart slow start.
Reno1990Tahoe + Fast Recovery: on 3 dup-ACKs, halve cwnd instead of starting from 1.
NewReno1996Better recovery from multiple losses in same window.
CUBIC2008Smoother cubic growth curve. Linux default since ~2007.
BBR2016Google's bandwidth-based: measures throughput + RTT, not loss. Dramatically better on long-fat networks. YouTube uses it.

Flow control vs congestion control — same idea, different scope

TermWhat it doesWhere it lives
Flow controlReceiver tells sender "my buffer is N bytes — don't send more than this".Receive Window (rwnd) in every ACK.
Congestion controlSender estimates network capacity and slows itself down.Maintained internally as cwnd.

Actual sending = min(rwnd, cwnd). The smaller one wins. Slow receivers → flow control limits. Slow networks → congestion control limits.

UDP — When Fast Beats Reliable

UDP is the other transport protocol you should know about. Tiny header (8 bytes vs TCP's 20). No handshake. No retransmission. No ordering. If a packet gets lost, it stays lost. UDP just fires datagrams and hopes for the best.

UDP header — eight bytes total

shell
0      8      16     24     32  bits
+------+------+------+------+
| Source port | Dest port   |     ← 16 bits each
+------+------+------+------+
| Length      | Checksum    |     ← 16 bits each
+------+------+------+------+
| Data...                   |
+---------------------------+

That's the entire protocol. Source port, destination port, length, checksum, payload. No state, no reliability, no flow control. Pure speed.

When UDP wins

Use caseWhy UDP
DNSMost queries fit in one packet. Reply, done. TCP would add 1-3 RTT overhead.
NTPTime sync — one tiny exchange.
Video / VoIPLost packets are stale by the time TCP would retransmit. Better to skip and keep going.
Online gamesSame as VoIP — old packets are useless. Custom reliability on top.
QUIC / HTTP/3Modern reinvention — UDP underneath but builds its own reliability + multiplexing on top.
SNMPPolling network gear. Lightweight.
WireGuardVPN — UDP for fast tunnelling.

UDP's ugly secret: amplification attacks

UDP's connectionless nature makes it the favourite vector for DDoS amplification. Attacker spoofs source IP = victim, sends small query to an open server (DNS, NTP, memcached) → server replies with a much bigger response TO THE VICTIM. memcached attacks in 2018 hit 1.7 Tbps using ~50 000× amplification. Defence: don't run open services on the public internet, use BCP38 to drop spoofed source IPs at the ISP edge.

TCP/IP Attacks — The Greatest Hits

TCP/IP was designed in the friendliest possible threat model: ARPANET was a small research network where everyone knew everyone. Attacking it would be like attacking your friend's LAN party. Security wasn't a goal. Decades later, we're paying for it.

SYN flood

shell
Attacker spams SYN packets (spoofed source IP) at the target:
  SYN → server (state: SYN_RECEIVED, half-open)
  SYN → server (state: SYN_RECEIVED, half-open)
  SYN → server (state: SYN_RECEIVED, half-open)
  ...thousands per second...

Server's connection table fills up.
Legitimate connections can't allocate state. Service is down.

Defence:
  - SYN cookies — encode connection state in the SYN-ACK; no state until final ACK.
  - tcp_max_syn_backlog tuning.
  - SYN proxies in front of the service.

SYN flood is the OG TCP DoS. SYN cookies (Daniel Bernstein, 1996) are basically free defence and shipped in every modern OS.

TCP sequence prediction (Mitnick attack)

Old TCP stacks generated predictable ISNs (e.g., incrementing by a fixed amount per second). An attacker who could observe one SYN-ACK could guess future ISNs and craft a fake "ACK" to establish a spoofed connection — appearing to come from a trusted source. Kevin Mitnick used this against Tsutomu Shimomura on Christmas 1994. Fix: randomise ISNs cryptographically. Every modern OS does.

RST injection

Anyone on the path who can see the sequence numbers can inject a forged TCP RST packet — instantly tearing down the connection. The Great Firewall of China uses this to censor specific TCP streams. Mitigation: TLS doesn't prevent RST itself, but apps see the abrupt drop and can retry. QUIC over UDP can't be RST'd at the transport layer.

Session hijacking

If an attacker can sniff the SYN-ACK (they're on the same Wi-Fi, or upstream), they know seq + ack numbers and can inject packets into the live session. Defence: end-to-end encryption (TLS, SSH). TCP alone is naked.

Idle scan / Zombie scan (nmap -sI)

Clever stealth port scanning: attacker uses a third-party "zombie" host with predictable IPID. By observing the zombie's IPID before and after spoofed probes to the target, attacker determines if target's port is open — without ever sending packets from their own IP. Works only against hosts with sequential IPIDs (Windows older versions, some embedded gear). Mostly historical now.

Slowloris (HTTP-specific but TCP-shaped)

Open many connections to a web server but send headers very slowly — one byte every 10 seconds. Each connection ties up a thread. Modest hardware can take down an Apache prefork server. Defence: use async / event-driven servers (nginx, Apache event MPM), set strict client-header timeouts.

Tools — Your TCP/IP Toolkit

You'll spend a lot of your engineering career poking at TCP/IP. These are the tools you'll use.

tcpdump — packet capture from the CLI

bash
# Capture HTTP traffic on any interface
sudo tcpdump -i any -n port 80

# Capture and save to a file for Wireshark
sudo tcpdump -i any -w capture.pcap port 443

# Show full packet contents (verbose, with hex)
sudo tcpdump -i any -nvvX port 53

# Specific source / destination
sudo tcpdump -i any -n 'host 8.8.8.8 and port 53'

# Just SYN packets (handshake initiations)
sudo tcpdump -i any -n 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'

# Just RST packets (connection resets)
sudo tcpdump -i any -n 'tcp[tcpflags] & tcp-rst != 0'

ss — modern socket statistics

bash
# All TCP connections (Linux replacement for netstat)
ss -tnp

# Listening sockets
ss -tlnp

# Show socket info + timer info + extended state
ss -tinp

# Connection counts by state
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c

# All UDP
ss -unp

Wireshark — the GUI

Same packet capture power as tcpdump, but with a UI. Open a .pcap file, filter with display filters (tcp.flags.syn==1 and tcp.flags.ack==0, tcp.analysis.retransmission), follow TCP streams, decode TLS with SSLKEYLOGFILE. Indispensable.

nmap — scan the world

shell
# Quick TCP SYN scan (default)
nmap -sS target.com

# Full TCP connect (legitimate handshake — slower, but works without root)
nmap -sT target.com

# Port version detection
nmap -sV -p 22,80,443 target.com

# OS detection
nmap -O target.com

# All TCP ports + UDP (slow!)
nmap -p- -sU target.com

# Aggressive — version, OS, scripts, traceroute
nmap -A target.com

hping3 — craft custom packets

bash
# Send a SYN to port 80 (basic ping that works through firewalls)
sudo hping3 -S -p 80 target.com

# Generate a TCP flood (for stress-testing — your own servers only!)
sudo hping3 -i u1 -S -p 80 --rand-source target.com

# Traceroute via TCP (gets through most firewalls)
sudo hping3 -T -S -p 80 target.com

# Send a raw byte payload
sudo hping3 -E payload.txt -p 9999 -S target.com

mtr — visual traceroute + ping in one

shell
# Watch a path live, with stats on every hop
mtr target.com

# Report mode (10 cycles then exit, e.g. for sharing)
mtr -r -c 10 target.com

iperf3 — actual bandwidth measurement

shell
# On the server:
iperf3 -s

# On the client (transfers max throughput for 10 seconds):
iperf3 -c server-ip

# Test UDP throughput
iperf3 -c server-ip -u -b 100M

OS-level kernel tunables (Linux)

bash
# Current congestion control
sysctl net.ipv4.tcp_congestion_control
# cubic   ← typical default

# What's available
sysctl net.ipv4.tcp_available_congestion_control
# reno cubic bbr ...

# Switch to BBR (Google's modern algorithm — often faster on hi-RTT links)
echo "net.core.default_qdisc = fq" | sudo tee -a /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control = bbr" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Increase max connections per port
sudo sysctl -w net.core.somaxconn=4096

# Reuse TIME_WAIT sockets faster (be careful with this in production!)
sudo sysctl -w net.ipv4.tcp_tw_reuse=1

TCP/IP in the Real World

Theory is clean. Production is messy. Here's the lived reality of TCP/IP that interview questions never cover.

NAT — how 4 billion IPs became enough

Your home network has dozens of devices sharing one public IP. NAT (Network Address Translation) at your router rewrites source IP + port on outbound packets, and reverses it on the way back. To the internet, everything looks like it came from one machine.

TypeWhat it does
Source NAT (SNAT)Outbound: rewrite source. Used by routers / firewalls.
Destination NAT (DNAT)Inbound: rewrite dest. Used for port forwarding.
PAT / NAPTNAT that also rewrites ports → many internal clients can share one external IP.
CGNATCarrier-grade NAT. Your ISP NATs you too. Even your "public IP" is private to the ISP. Common in mobile networks.
CGNAT makes inbound connections impossible. Any service expecting users to "open port 80" no longer works for CGNAT customers. Workarounds: TURN servers (WebRTC), reverse tunnels (ngrok), hole-punching.

MTU and the fragmentation footgun

MTU = Maximum Transmission Unit. Ethernet defaults to 1500 bytes. PPPoE (DSL): 1492. VPN tunnels: often 1380. If a packet is bigger than the path MTU, it either gets fragmented (IPv4) or dropped (IPv6, or with DF flag set).

Path MTU Discovery (PMTUD) uses ICMP "Fragmentation Needed" messages to find the smallest MTU on the path. Many firewalls block all ICMP — breaking PMTUD silently. Symptoms: small packets work, large transfers stall. Hunting these is its own special hell.

TIME_WAIT — the misunderstood state

After a connection closes, the side that called close() first stays in TIME_WAIT for 2× MSL (Maximum Segment Lifetime, usually 60s total). Why? To catch stragglers (delayed packets that might arrive after close) and prevent them from confusing a new connection on the same 4-tuple.

On a busy server, you might see tens of thousands of TIME_WAIT entries. Tempting to "fix" with tcp_tw_reuse=1. Tempting and DANGEROUS — the protections are there for a reason. Better fix: use long-lived connections (keep-alive), or use SO_REUSEPORT, or use HTTP/2/3 multiplexing.

Keepalive — detecting dead peers

If the connection just sits idle, TCP doesn't notice if the peer disappeared. TCP keepalive sends periodic empty probes; no reply for ~9 probes → connection considered dead. Linux defaults are very loose (2 hours idle, then probes). Most apps tighten this with SO_KEEPALIVE + interval tuning, or implement application-layer heartbeats.

The C10K and C10M problems

1999 — Dan Kegel asked "can a single server handle 10 000 concurrent connections?" Spawned the evolution of epoll/kqueue/IOCP, async runtimes (libuv, Tokio), event-driven web servers (nginx). 2013 — Robert Graham asked "can it handle 10 MILLION?" Required kernel bypass (DPDK, XDP). The answer in both cases turned out to be "yes — but the kernel is the bottleneck, route around it".

Quick Reference Cheat Sheet

One-liners and constants that come up constantly.

Common ports you should recognise on sight

PortService
20 / 21FTP (data / control)
22SSH
23Telnet (don't use)
25SMTP
53DNS (UDP + TCP)
67 / 68DHCP (server / client)
80HTTP
110POP3
111rpcbind / portmap
123NTP
143IMAP
161SNMP
389LDAP
443HTTPS
445SMB / CIFS
465 / 587SMTPS / SMTP submission
636LDAPS
853DoT (DNS over TLS)
993IMAPS
995POP3S
3306MySQL / MariaDB
3389RDP
5432PostgreSQL
5672AMQP / RabbitMQ
6379Redis
8080 / 8443HTTP / HTTPS alternates
9200Elasticsearch
27017MongoDB

Common TCP/UDP one-liners

bash
# Test if a port is open from CLI
nc -zv target.com 443
# Or:
timeout 2 bash -c 'cat < /dev/tcp/target.com/443' && echo open

# Quick HTTP banner
echo -e 'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n' | nc -q 1 example.com 80

# Find which process owns a port
sudo lsof -i :8080
sudo ss -tlnp | grep :8080

# Show real-time bandwidth per connection
sudo iftop
sudo nethogs

# Test latency to a host
ping -c 4 target.com

# Find the path to a host
traceroute target.com
# Or mtr for live updating

# Continuously refresh active connections
watch -n 1 'ss -tan | head -20'

TCP state cheat-sheet

StateMeaning
LISTENServer bound, waiting for SYN
SYN_SENTClient sent SYN, waiting for SYN-ACK
SYN_RECEIVEDServer sent SYN-ACK, waiting for ACK
ESTABLISHEDData flowing both ways
FIN_WAIT_1Local closed, waiting for peer's ACK
FIN_WAIT_2Local closed and ACKed; waiting for peer's FIN
CLOSE_WAITPeer closed; waiting for local app to call close()
LAST_ACKLocal FIN sent after CLOSE_WAIT; waiting for final ACK
TIME_WAIT2× MSL wait after close to catch stragglers
CLOSEDNo connection

Closing Thoughts

TCP/IP is the unsung infrastructure that runs literally everything you do online. The 1970s-vintage protocol holds up 21st-century traffic at 100+ Gbps over fibre, and somehow it still works.

Two big takeaways. Every problem is a layer problem — when something breaks, ask which layer is lying. And caching, retransmission and congestion control are doing far more work than you can see: the packets you send aren't the ones the wire carries — they're reshaped, compressed, fragmented, retried and queued at every step.

Spend a weekend with tcpdump on your laptop. Open a few websites, ssh somewhere, run a video call. Watch the SYNs fly. The internet stops being magic and starts being a beautifully-engineered, slightly-broken machine. That's the goal.

Reactions

Related Articles