Protocol Performance Analysis

TCP vs REST vs gRPC — kernel-level characterization on a single-node Linux system · IIIT Bangalore · SSP Course Project

SYSTEMS_INFRA KERNEL_PROFILING HW_COUNTERS Go perf_6.8 FlameGraph protobuf taskset pandas LINUX_KERNEL_6.8.0 Ubuntu_24.04 9.9M_SAMPLES 75_EXPERIMENTS

[tldr]

→ 9,915,000 latency measurements across 75 experiments (5 message sizes × 5 concurrency levels × 3 protocols). All use identical protobuf serialization — protocol overhead is the only variable.
→ TCP has lowest average latency (18µs baseline) but catastrophic P99 tail — 121× amplification (P99=1,049ms vs P50=8.9ms) at 64KB, c=500. Zero application-level flow control.
→ gRPC wins under stress: P99/P50 = 2× at all concurrency levels. TCP and REST exhibit a "knee" between c=10 and c=50 where tail ratio jumps to 24× and 29× respectively.
→ REST generates 914K context switches at c=500 — 7.5× more than TCP. Super-linear growth from HTTP/1.1's one-goroutine-per-connection model.
→ REST and gRPC make 36–41× more futex calls than TCP. Synchronization overhead — not protocol parsing — is the dominant cost.
→ Function-level profiling: REST spends 5× more CPU in GC than gRPC (3.71% vs 0.71%) due to short-lived header allocations. gRPC spends 63% more in goroutine scheduler (8.31% vs 5.11%) from HTTP/2's dedicated goroutines.
→ Three-zone page fault behavior discovered across message sizes — Zone 1 (<2KB): gRPC overhead dominates; Zone 2 (2–31KB): REST dynamic buffer growth dominates; Zone 3 (≥32KB): Go allocator _MaxSmallSize=32768 boundary causes step-function page fault spike confirmed by two independent controlled experiments.
→ Dense latency sweep (512B to 64KB, 13 sizes) confirms both crossovers in end-to-end latency: gradual transition at ~2KB (architectural) and sharp step at ~32KB (hard allocator boundary). REST/gRPC ordering is non-monotonic across message sizes.
→ Semaphore-based TCP flow control: 8.7× average latency improvement, 121× → 18× tail ratio, 47× fewer page faults at 7% throughput cost.

121× TCP tail amplification at 64KB, c=500. P99 = 1,049ms while P50 = 8.9ms. No flow control → 500 goroutines simultaneously saturate the kernel send buffer.

2× gRPC P99/P50 ratio across all concurrency levels. HTTP/2 WINDOW_UPDATE frames enforce backpressure — prevents unbounded queue buildup regardless of client count.

914K REST context switches at c=500 — 7.5× more than TCP. HTTP/1.1's per-connection goroutine model generates super-linear scheduling pressure under load.

36–41× More futex (mutex lock) calls in REST/gRPC vs TCP. Synchronization overhead dominates — not parsing, not serialization, not framing.

32KB Go runtime allocator boundary — page fault step function confirmed by two independent controlled experiments (frame-size sweep + GOGC=off). Structural to _MaxSmallSize; not addressable by GC tuning.

8.7× Average latency improvement from semaphore-based TCP flow control. Simple channel semaphore limits in-flight goroutines; converts unbounded kernel queuing into bounded, measurable waiting.

[context] why does protocol choice matter at the kernel level?

In distributed systems, the protocol stack sits on the critical path of every request. The choice between TCP, REST, and gRPC is not merely an API preference — it is a binding contract with the OS scheduler, the kernel network stack, and the memory allocator.

The gap in existing benchmarks: Most protocol comparisons report only end-to-end latency under a single load condition. They do not explain why protocols differ, nor how overhead evolves with message size and concurrency. Without this, protocol selection is intuition-driven rather than evidence-based. This work instruments at the hardware counter, flame graph, and syscall level to expose root causes rather than symptoms.

design principle — uniform serialization

All three protocols use protobuf serialization, including raw TCP (with a 4-byte length-prefix frame). This deliberately isolates protocol overhead as the only variable. Latency differences are attributable to framing, flow control, and multiplexing — not encoding cost. Function-level perf data confirms that protobuf deserialization appears in both REST and gRPC call graphs at equivalent depth, validating this design choice.

Inspired by Raghavan et al.'s network stack overhead analysis, which showed that even loopback communication incurs measurable kernel crossing costs. Our contribution extends this with per-protocol hardware counter comparison and controlled anomaly investigation.

[experimental_setup]

┌─────────────────────────────────────────────────────────────────────┐ │ Platform : Ubuntu 24.04 LTS | Kernel : 6.8.0-101-generic │ │ CPU : Hybrid P/E-core | P-cores: 4500 MHz max │ │ Language : Go 1.25 | Serialization: protobuf (all 3) │ └─────────────────────────────────────────────────────────────────────┘ Message Sizes → 64B 256B 1KB 4KB 64KB (+ 13-point dense sweep) Concurrency → 1 10 50 100 500 (concurrent clients) Protocols → TCP (raw) REST (HTTP/1.1) gRPC (HTTP/2) 75 configurations × (concurrency × 1,000 requests) = 9,915,000 samples CPU Pinning → Server pinned to CPU1 (P-core, 4500MHz) via taskset Client allowed on CPUs 0,2,3,4,5,6,7 (P-cores only) E-cores (2500MHz) excluded entirely — no mixed-core noise Priority → sudo nice -n -20 (highest OS scheduling priority) Isolation → One protocol benchmarked at a time — others stopped Warmup → 1 warmup request per client before measurement begins HW Counters → kernel.perf_event_paranoid = 1

[tools] profiling stack

Tool	Version	Purpose
`perf stat`	6.8.12	Hardware counters: CPU cycles, instructions, IPC, cache refs/misses, branch misses, context switches, page faults, user/sys time split
`perf record`	6.8.12	CPU time sampling at 1KHz with call graph capture for flame graph generation and function-level breakdown
`perf report`	6.8.12	Function-level CPU time attribution — attached to running client processes (120K–137K samples at 4KB, c=1)
`perf trace`	6.8.12	Syscall frequency counting: futex, read, write, epoll_pwait, nanosleep, sched_yield
FlameGraph	Gregg v1.0	stackcollapse-perf.pl + flamegraph.pl — visual CPU time attribution per protocol
Go	1.25.0	All protocol implementations: net (TCP), net/http (REST), google.golang.org/grpc (gRPC)
protoc	25.1	Protocol Buffer code generation — identical Payload message used across all three
taskset	2.39.3	CPU core pinning — server to CPU1, clients to CPUs 0,2,3,4,5,6,7
Python / pandas	3.12	9.9M sample aggregation, P50/P99/P999 computation, zone analysis plots

why Go over C++?

Go provides production-quality libraries for all three protocols under one consistent runtime — same GC, same scheduler, same allocator. Protocol differences are not contaminated by implementation quality gaps. Go's M:N goroutine model keeps client-side scheduling noise lower than 500 raw pthreads in C++ would at high concurrency.

Tradeoff acknowledged: Go runtime overhead (GC, goroutine scheduler) is inseparable from protocol overhead without a bare-metal baseline. A C++ implementation is left as future work. This limitation is partially addressed by function-level perf data which directly attributes GC and scheduler costs per protocol.

[implementation] protocol details

All three implement an identical echo service. Client sends a protobuf Payload{id, data[]byte, timestamp}. Server echoes with updated timestamp. Same logical work — overhead differences are purely protocol mechanics.

Protocol	Framing	Connection Model	Flow Control
TCP	4-byte big-endian length prefix + protobuf body	1 persistent connection per client	None (kernel TCP only)
REST	HTTP/1.1 headers + binary protobuf body to `/echo`	net/http persistent connections with connection pool	None
gRPC	HTTP/2 binary frames + HPACK header compression	Stream multiplexing over shared connection(s)	WINDOW_UPDATE per stream

serialization control

By using binary protobuf on REST instead of JSON, serialization cost is identical across all three. This is confirmed in function-level perf data: protobuf deserialization is visible in both call graphs at equivalent depth.

[protocol_profiles]

TCP (raw)

4-byte length-prefix framing
No application-level flow control
1 persistent connection per client
Zero protocol state management
80% execution in kernel space
0.13 futex calls/request

→ Fastest average. Worst tail. Best CPU efficiency. Kernel-dominated.

REST (HTTP/1.1)

Persistent connections via net/http
Binary protobuf body — not JSON
One goroutine per connection
Connection pool with mutex locking
6 user-space layers before kernel
4.8 futex calls/request

→ 2× TCP overhead. Scales poorly. Super-linear context switch growth.

gRPC (HTTP/2)

Stream multiplexing over one conn
Built-in flow control (WINDOW_UPDATE)
HPACK header compression
8+ user-space layers before kernel
Pre-allocated tieredBufferPool
5.5 futex calls/request

→ Best tail latency. Best under stress. Highest baseline cost. Cache-hungry at scale.

[baseline_results]

Page fault / latency zone summary — REST vs gRPC ordering is non-monotonic:

Zone 1: <2KB
gRPC overhead > REST

Zone 2: 2–31KB
gRPC faster (REST bufio growth)

Zone 3: ≥32KB
Go allocator boundary hit

Each zone has a distinct confirmed root cause — see anomalies section for kernel-level investigation.

Baseline latency vs message size (concurrency = 1)

Message Size	TCP (µs)	REST (µs)	gRPC (µs)	REST/TCP	gRPC/TCP
64B	18.8	35.8	47.9	1.9×	2.5×
256B	17.3	34.2	45.0	2.0×	2.6×
1KB	18.5	42.3	46.3	2.3×	2.5×
4KB ⚠	21.2	55.4	50.4	2.6×	2.4×
64KB	56.6	199.1	216.3	3.5×	3.8×

TCP consistently 2–3.8× faster. At 4KB, REST (55.4µs) exceeds gRPC (50.4µs) — reversing expected ordering. This is the Zone 1→2 crossover in latency, confirmed by dense sweep (see dense latency section). See anomaly section for hardware counter confirmation.

64KB gap widens to 3.8×

At 64KB, payloads exceed L1 cache capacity — cache evictions and memory bandwidth become the dominant cost. TCP, which does no user-space buffer manipulation, is insulated from this effect. REST and gRPC both perform extensive user-space buffer reads and copies, magnifying the cache pressure.

Latency vs concurrency (64B messages) — linear scaling at small messages

Concurrency	TCP (µs)	REST (µs)	gRPC (µs)	REST/TCP
1	18.8	35.8	47.9	1.9×
10	94.7	145.4	194.6	1.5×
50	486.7	777.7	889.2	1.6×
100	936.8	1,631.0	1,848.3	1.7×
500	4,996.8	9,023.5	10,830.5	1.8×

At 64B all three scale approximately linearly. The REST/TCP ratio stays stable at 1.5–1.9× — indicating REST overhead is largely fixed per request at small message sizes. This is average latency; tail behavior tells a completely different story — see next section.

Throughput analysis

At concurrency=1 with 64B messages, TCP achieves approximately 49,000 req/s, versus 24,000 for REST and 21,000 for gRPC. At higher concurrency, all three protocols converge — the bottleneck shifts from protocol overhead to server CPU capacity.

throughput vs tail latency tradeoff

The semaphore-based TCP optimization costs only 7% throughput (10,466 → 9,703 req/s) while reducing tail ratio from 121× to 18× and average latency by 8.7×. Throughput and tail latency are decoupled by queue discipline, not by fundamental protocol limits.

[tail_latency_and_amplification]

protocol inversion effect — the most important result

Low concurrency, small messages (avg latency):
TCP (18µs) < REST (35µs) < gRPC (47µs)

High concurrency, large messages (P99 tail latency):
gRPC (186ms) << TCP (1,049ms) < REST (1,202ms)

The protocol ranked best for average-case latency has the worst P99 tail latency at scale. Performance ranking completely inverts.

P99/P50 tail amplification ratio — full table at 64KB

The "knee" occurs between c=10 and c=50 for TCP and REST. gRPC remains flat throughout.

Protocol	Conc.	P50 (µs)	P99 (µs)	P99/P50
TCP	1	41	128	3.1×
TCP	10	479	897	1.9×
TCP	50	1,753	42,010	24×
TCP	100	3,489	140,215	40×
TCP	500	8,915	1,083,462	121×
REST	1	159	395	2.5×
REST	10	1,487	12,105	8.1×
REST	50	5,877	167,801	29×
REST	100	8,108	391,420	48×
REST	500	16,887	1,177,731	70×
gRPC	1	195	422	2.2×
gRPC	10	1,775	3,066	1.7×
gRPC	50	10,424	18,781	1.8×
gRPC	100	21,276	33,193	1.6×
gRPC	500	95,729	186,705	2.0×

mechanism — why TCP and REST collapse

At c=500 with 64KB messages, all 500 goroutines simultaneously compete for kernel send buffer space. Without application-level backpressure, late-arriving requests encounter a fully saturated send buffer and wait in the kernel queue. gRPC's HTTP/2 WINDOW_UPDATE frames explicitly limit in-flight unacknowledged data per stream — senders block at the application layer before hitting the kernel, converting unbounded kernel queuing into bounded, controlled waiting.

[kernel_profiling]

Hardware performance counters — perf stat (64B, c=1, 10K requests)

Counter	TCP	REST	gRPC	REST/TCP	gRPC/TCP
CPU Cycles	567M	1,350M	1,539M	2.4×	2.7×
Instructions	751M	1,549M	1,674M	2.1×	2.2×
IPC	1.32	1.14	1.08	—	—
Cache References	2.4M	14.5M	18.6M	6.0×	7.7×
Cache Misses	222K	456K	548K	2.1×	2.5×
Branch Misses	2.2M	3.8M	4.6M	1.7×	2.1×
Context Switches	14,708	42,629	45,193	2.9×	3.1×
User Time	33ms (20%)	187ms (46%)	245ms (53%)	—	—
Sys Time	131ms (80%)	215ms (54%)	210ms (47%)	—	—

user vs kernel split — where work happens

TCP: 80% sys — kernel-dominated at all concurrency levels. This ratio is constant regardless of load.
REST: 46% sys at c=1 → 35% sys at c=500 — protocol logic and connection management take over at scale.
gRPC: 47% sys at c=1 → 26% sys at c=500 — HTTP/2 state machine, stream coordination, and GC dominate at scale.

At scale, the bottleneck in REST and gRPC is not the kernel network stack — it is their own user-space abstractions.

Hardware counters at intermediate message sizes — explains the 4KB REST/gRPC crossover

To understand why REST latency exceeds gRPC at 4KB, hardware counters were collected at 1KB, 4KB, and 16KB with c=1 and 10,000 requests.

Metric	1KB	4KB	16KB
REST cycles	1,502M	2,322M	4,076M
gRPC cycles	1,714M	2,020M	2,898M
REST cache-refs	20.2M	40.5M	106.2M
gRPC cache-refs	24.7M	34.2M	66.7M
REST cache-miss%	7.7%	8.3%	5.9%
gRPC cache-miss%	6.4%	6.2%	4.9%
REST page-faults	3,560	6,136	15,387
gRPC page-faults	2,974	3,868	6,251

confirmed root cause — net/http per-request buffer allocation

At 1KB, gRPC consumes 14% more cycles than REST due to HTTP/2 connection setup — HPACK tables, stream state, flow control structures.

Above 2KB this inverts: REST's net/http transport allocates a fresh response buffer per request sized to each payload, increasing heap pressure and reducing cache locality compared to gRPC's pre-allocated fixed-size frame buffers via tieredBufferPool. At 4KB, REST consumes 15% more cycles and 18% more cache references. At 16KB the gap reaches 40% more cycles and 59% more cache references.

gRPC's frame buffers are pre-allocated and reused across requests — no dynamic resizing per request.

Function-level CPU breakdown — perf report (4KB, c=1)

Attached perf record to running client processes (120K–137K samples). Both protocols spend ~77–79% blocked waiting on network I/O. Of active CPU samples:

Category	REST %	gRPC %	Implication
Protocol layer (HTTP framing/parsing)	4.02	3.56	Comparable — framing is not the bottleneck
Memory allocation (mallocgc)	4.91	3.29	REST allocates more per request
Garbage collection (gcDrain, scanobject)	3.71	0.71	5× more GC in REST
Memory copy (memmove)	1.32	2.20	gRPC copies data between frame buffers
Go scheduler (stealWork, findRunnable)	5.11	8.31	gRPC 63% more scheduling
Kernel sync (futex, psi)	4.27	4.73	Similar — both lock-heavy
I/O wait (blocked on network)	~77%	~79%	Dominant cost in both — loopback I/O

REST GC cost — why 5× higher

HTTP/1.1 response parsing allocates many short-lived objects per request — header maps, string slices, bodyEOFSignal wrappers. The GC must trace and collect all of these. REST provides no equivalent to gRPC's buffer pool.

gRPC scheduler cost — why 63% higher

HTTP/2 maintains dedicated goroutines: loopyWriter, frame reader, keepalive pinger. All require scheduling. runtime.procyield appears only in gRPC (0.94%) — confirms mutex spin-waiting from stream table locking.

Flame graph analysis — call stack depth and structure

Generated using perf record at 1KHz + call graph capture + FlameGraph tools. Wider = more CPU time. Flame graph file sizes alone reflect complexity: TCP 149KB, REST 241KB, gRPC 332KB.

TCP — shallow, kernel-dominated

2–3 user-space frames before kernel. No visible futex calls anywhere in the graph — zero synchronization overhead. Most CPU time is the kernel doing actual network work.

REST — deep HTTP/1.1 pipeline with synchronization

6 user-space frames before reaching kernel: Client → Transport → persistConn → bufio → net.Conn → syscall. Each layer reads/writes HTTP header buffers — directly explains the 6× cache reference increase. Visible futex_wait call sites from connection pool locking per request.

gRPC — distributed CPU, most complex call graph

8+ user-space frames. CPU time spread across multiple distinct subsystems: HTTP/2 transport, protobuf serialization, stream management, flow control, keepalive pings — each visible as separate towers. Multiple futex call sites from stream table locking. tieredBufferPool.Get visible as a distinct allocation pattern contrasting with REST's scattered mallocgc.

Syscall analysis — perf trace (64B, c=1, 10K requests)

Syscall	TCP	REST	gRPC	Significance
`futex`	1,335	48,256	55,332	36–41× higher in REST/gRPC
`read`	29,644	20,305	26,384	TCP does most raw I/O directly
`write`	19,836	10,381	14,137	REST/gRPC buffer and batch writes
`epoll_pwait`	19,499	18,899	32,667	gRPC monitors more I/O event sources
`nanosleep`	3,877	8,812	6,932	lock-contention sleep-and-retry
`sched_yield`	0	31	51	TCP never contends on any lock

futex calls per request — the dominant overhead metric

TCP: 0.13 futex/request | REST: 4.8/request | gRPC: 5.5/request

Each futex call is a mutex lock acquisition or release — pure synchronization overhead with no application work. sched_yield = 0 for TCP proves zero lock contention. A thread yields only when spinning on a lock it cannot acquire — REST and gRPC do this regularly, confirming the futex calls represent real contention.

[concurrency_scaling_analysis]

Context switches vs concurrency — REST super-linear collapse

Concurrency	TCP	REST	gRPC	REST/TCP
1	1,439	4,437	5,127	3.1×
10	2,676	14,897	17,649	5.6×
50	16,657	72,125	18,765	4.3×
100	38,281	149,329	31,966	3.9×
500	122,112	914,456	165,840	7.5×

REST super-linear growth — HTTP/1.1 architectural limit

REST context switches grow super-linearly: 3.1× TCP at c=1 → 7.5× at c=500. HTTP/1.1's one-goroutine-per-connection model means 500 goroutines all compete for the same connection pool mutex. At c=50, gRPC (18,765) is nearly identical to TCP (16,657) — HTTP/2 multiplexing pays off here.

Cache misses vs concurrency — gRPC's hidden scaling cost

Concurrency	TCP (K)	REST (K)	gRPC (K)	gRPC/TCP
1	113	236	273	2.4×
10	254	443	492	1.9×
50	588	1,593	3,938	6.7×
100	1,095	6,479	15,758	14.4×
500	12,088	174,887	331,541	27.4×

gRPC wins on context switches at high concurrency, but loses badly on cache misses. At c=500, gRPC generates 27× more cache misses than TCP. HTTP/2 per-stream state (flow control windows, HPACK tables) for 500 concurrent streams far exceeds L1/L2 capacity — constant cache thrashing. This partially offsets gRPC's multiplexing benefits.

User vs system time evolution at low and high concurrency

Protocol	User (c=1)	Sys (c=1)	User (c=500)	Sys (c=500)
TCP	20%	80%	20%	80%
REST	42%	58%	65%	35%
gRPC	44%	56%	74%	26%

TCP's user/sys split is constant at all concurrency levels — always kernel-dominated. REST and gRPC shift toward user space at scale: the bottleneck becomes their own protocol logic and synchronization, not the kernel network stack.

[anomalies_and_root_causes]

Three anomalies were investigated at the kernel level. Each was traced to a confirmed root cause with controlled experiments to rule out alternative hypotheses.

Anomaly 1: Page fault step function at 32KB — Go allocator boundary

Observed: gRPC page fault count jumped discontinuously at exactly 32KB message size — a step function, not a gradual increase.

At 31KB — all three gRPC frame sizes

gRPC-8KB frame: ~4,117 page faults
gRPC-16KB frame: ~4,106 page faults
gRPC-32KB frame: ~3,394 page faults

→ Low and stable across all variants

At 32KB — all three gRPC frame sizes

gRPC-8KB frame: ~16,828 page faults
gRPC-16KB frame: ~14,043 page faults
gRPC-32KB frame: ~16,081 page faults

→ 4× simultaneous jump in all variants

confirmed root cause — Go allocator slab/heap boundary

Go runtime constant: _MaxSmallSize = 32768 (runtime/malloc.go)

Below 32KB → thread-local size-class pool (mcache) → slab page already mapped → no fault.
At or above 32KB → large-object allocator (mheap) → fresh heap page requested from OS → guaranteed fault per allocation.

alternative hypothesis 1 ruled out — HTTP/2 frame splitting

gRPC's default write buffer is ~16KB. Initial hypothesis: messages crossing frame boundaries trigger additional memory operations. Tested with 8KB, 16KB, and 32KB frame size configurations. All three show identical step functions at exactly 32KB. Frame size has zero effect — HTTP/2 framing ruled out.

alternative hypothesis 2 ruled out — GC pressure

GOGC=off experiment: page fault sweep with garbage collection disabled. Absolute fault counts rise (~95K as GC no longer reclaims pages). The step function at 32KB persists in both conditions — confirming cause is allocator path change, not GC behavior.

Size	Normal GC	GOGC=off	Step preserved?
29KB	3,548	95,433	—
31KB	3,327	95,364	—
32KB	16,214	96,414	Yes
33KB	17,469	116,227	Yes

Anomaly 2: Page fault three-zone behavior across message sizes

Observed: REST vs gRPC page fault ordering is non-monotonic across message sizes — three behaviorally distinct zones.

Zone 1: <2KB
gRPC overhead > REST

Zone 2: 2–31KB
REST overhead > gRPC

Zone 3: ≥32KB
gRPC overhead rises sharply

Size	REST PF	gRPC PF	REST/gRPC	Zone
512B	3,573	3,615	1.0×	Zone 1
1KB	3,360	3,363	1.0×	Zone 1
2KB	4,688	3,674	1.3×	Zone 2
4KB	7,178	3,706	1.9×	Zone 2
8KB	8,513	5,362	1.6×	Zone 2
16KB	17,185	5,889	2.9×	Zone 2
32KB	23,736	19,326	1.2×	Zone 3 onset
64KB	33,125	25,035	1.3×	Zone 3

Zone	Range	Root Cause
Zone 1	<2KB	gRPC's fixed HTTP/2 setup overhead (HPACK table initialization, stream state, flow control) generates comparable or more page faults than REST's simple handling at small sizes
Zone 2	2–31KB	REST's net/http transport allocates a fresh response buffer per request. gRPC's frame buffers are pre-allocated via tieredBufferPool — REST page faults grow rapidly, gRPC stays flat at ~4K–6K
Zone 3	≥32KB	Go allocator boundary: both protocols cross _MaxSmallSize=32768. gRPC's per-stream state amplifies the large-object allocation cost, narrowing the gap that existed in Zone 2

continuous vs fragmented memory — the architectural root

HTTP/1.1 allocates the response body as one contiguous buffer. HTTP/2 pre-allocates fixed-size frame buffers via tieredBufferPool and reuses them. Below 32KB, gRPC's reuse strategy generates fewer page faults. At exactly 32KB, both protocols' message buffers cross Go's large-object threshold — the pre-allocation advantage disappears. Above 32KB, the large-object allocator dominates both, and gRPC's additional per-stream state means it hits the threshold more frequently.

Anomaly 3: REST latency exceeds gRPC at 4KB (Zone 1→2 crossover in latency)

At 4KB, c=1: REST (55.4µs) > gRPC (50.4µs) — reversing expected ordering. Reproduced across multiple independent runs. Confirmed by hardware counter analysis showing REST consuming 15% more cycles and 18% more cache references at 4KB than gRPC — the inverse of the 1KB relationship.

At 1KB (Zone 1 — gRPC more expensive)

REST: 42.3µs gRPC: 46.3µs
REST PF: 3,360 gRPC PF: 3,363
gRPC has more fixed overhead, REST wins on latency.

At 4KB (Zone 2 — REST more expensive)

REST: 55.4µs gRPC: 50.4µs
REST PF: 7,178 gRPC PF: 3,706
Buffer allocation overhead overtakes gRPC's setup cost.

[dense_latency_sweep — 512B to 64KB]

The original 5-point message size benchmark was extended to 13 sizes to confirm both crossovers directly in end-to-end latency. This directly answers the professor's question about where time is being spent across message sizes.

Full latency sweep — all three protocols, 13 message sizes, c=1

Size	TCP (µs)	REST (µs)	gRPC (µs)	Zone	Note
512B	19	37	44	Zone 1	gRPC 7µs slower than REST
1KB ←	18	39	42	Zone 1→2	gap closing — only 3µs
2KB	19	43	44	Zone 2	nearly tied
3KB	20	44	45	Zone 2	REST just edges ahead
4KB ⚠	21	53	47	Zone 2	REST clearly slower — anomaly confirmed
6KB	21	56	48	Zone 2	gap growing
8KB	21	63	51	Zone 2	REST 23% slower than gRPC
12KB	23	76	53	Zone 2	REST 43% slower
16KB	26	85	62	Zone 2	peak gRPC advantage
24KB	32	102	70	Zone 2	gRPC still 32µs faster
32KB ←	32	121	171	Zone 3	sharp flip — gRPC 50µs SLOWER than REST
48KB	40	147	185	Zone 3	REST now faster than gRPC
64KB	48	180	199	Zone 3	REST 19µs faster than gRPC

two crossovers confirmed in latency

Crossover 1 (Zone 1→2, around 1–2KB): GRADUAL
Gap closes over 4 data points (512B→1KB→2KB→3KB→4KB). This matches the architectural cause — HTTP/2 fixed setup cost is gradually overtaken by REST's growing buffer allocation cost. Gradual = structural/architectural difference.

Crossover 2 (Zone 2→3, exactly 24KB→32KB): SHARP
Between 24KB and 32KB, gRPC goes from 32µs faster to 50µs slower — an 82µs swing in one step. This matches the page fault step function exactly. Sharp = hard boundary (Go allocator _MaxSmallSize). Two independent mechanisms, two distinct crossover signatures.

TCP monotonically fastest — never crosses

TCP latency grows from 19µs to 48µs across the entire 512B–64KB range. It never inverts with either protocol. The REST/gRPC inversions are entirely absent in TCP because TCP has no user-space buffer management overhead — its latency growth is driven purely by data transfer cost.

[optimization_semaphore_flow_control]

The tail amplification analysis identified TCP's 121× P99/P50 ratio as directly caused by the absence of application-level flow control. This was validated by implementing a minimal semaphore-based fix and measuring the effect.

[implementation] the fix — 5 lines of Go

root cause

At c=500 with 64KB messages, all 500 goroutines immediately call write on their TCP connections. The kernel send buffer fills rapidly. Late-arriving goroutines block inside the kernel, accumulating into a queue that grows unboundedly. Request latency is proportional to queue position — the last goroutine waits for all 499 others to drain.

// Original — all 500 goroutines blast simultaneously go func() { sendRecv(conn, data) }() // × 500 // Optimized — semaphore limits max in-flight to 50 sem := make(chan struct{}, maxInflight) // maxInflight = 50 go func() { sem <- struct{}{} // acquire — blocks when 50 already in-flight sendRecv(conn, data) <-sem // release }()

With maxInflight=50, at most 50 goroutines hold the semaphore at any moment. The remaining 450 block on the Go channel — outside the kernel, in user space — without touching the kernel send buffer. Queue discipline moves from kernel to application layer, making it measurable and controllable.

Results — before vs after (64KB, c=500)

Before (raw TCP, no flow control)

44,735µsavg latency 121×P99/P50 tail ratio 1,083,462µsP99 latency 1,557,815page faults

After (semaphore, maxInflight=50)

5,132µsavg latency — 8.7× better 18×P99/P50 tail ratio — 6.7× better 64,388µsP99 latency — 16.8× better 32,961page faults — 47× fewer

8.7×avg latency improvement

16.8×P99 improvement

47×fewer page faults

7%throughput cost

why 18× residual — not 2× like gRPC

The semaphore limits the sender side but provides no receiver-driven backpressure. The 50 active goroutines still write without acknowledgment from the server that it has consumed the data. gRPC's WINDOW_UPDATE mechanism is receiver-driven: the server explicitly signals how much more data it can accept. A full solution requires receiver-driven flow control analogous to HTTP/2 — left as future work.

page fault reduction — explained

Before: 500 goroutines simultaneously allocate 64KB response buffers → 500 large-object allocations → 1.5M page faults. After: at most 50 goroutines hold buffers at once → 47× fewer page faults. Confirms the original page fault explosion was caused by simultaneous large-object allocations, not the protocol itself.

[discussion]

The abstraction cost — quantified per layer

REST over HTTP/1.1 adds 6 user-space function call layers over raw TCP. This translates to 6× more memory accesses, 3× more context switches, and 36× more mutex lock operations for identical 64B payloads. The overhead is largely additive and fixed per request at small message sizes — REST is essentially a constant 2× overhead tax on TCP.

gRPC introduces greater per-request complexity (2.7× more CPU cycles, 8 user-space layers), but its architectural features prevent catastrophic degradation at scale. The additional overhead is a worthwhile investment when P99 tail latency matters.

key design insight

There is no universally best protocol. The ranking depends entirely on the workload regime. The decision point is not average latency — it is what happens at the tail under peak load. Choosing based on microbenchmarks (average latency at low concurrency) can produce catastrophically wrong decisions for production deployments.

Methodology validity and limitations

acknowledged limitations

Go runtime inseparability: GC and goroutine scheduler costs cannot be fully isolated from protocol costs without a bare-metal C++ baseline. Partially mitigated by function-level perf attribution per protocol.
Single-node only: Network latency would dominate µs-level differences in a distributed deployment. Relative ratios, tail latency patterns, and synchronization costs remain applicable.
Symmetric echo design: Both send-path and receive-path overhead are combined in each measurement. Asymmetric designs (small ACK response, or large server-initiated response) would isolate each direction independently — left as future work.
Client-side profiling only: All perf stat, flame graph, and syscall measurements were collected on the client process. Server-side profiling would reveal server-side protocol overhead separately — left as future work.
Go allocator threshold: Definitive confirmation of the 32KB hypothesis requires recompiling Go runtime with modified _MaxSmallSize. Two indirect experiments provide strong but indirect evidence.
Warmup: One warmup request per client. gRPC's HPACK table initialization and HTTP/2 stream 0 setup may not be fully amortized in a single warmup request.

[when_to_use]

Use TCP when:

You control both endpoints
Payloads are small and uniform (<2KB)
Concurrency is bounded (<50)
Minimum median latency is critical
Tail latency controlled via semaphore

→ HFT order entry, internal hot paths, game servers, custom control planes

Use REST when:

Simplicity and debuggability matter
Concurrency is low (<50)
Broad client compatibility needed
2× TCP overhead is acceptable
Message size stays below 2KB

→ Public APIs, low-traffic internal services, human-readable debugging

Use gRPC when:

P99 tail latency SLAs exist
Concurrency is high (>50)
Payloads vary in size (2KB–32KB sweet spot)
Streaming RPCs needed
Internal microservice mesh

→ Latency-sensitive distributed systems, ML inference serving, high-concurrency backends

one-line protocol selection rule

If your system has tail latency SLAs and sees burst concurrency above 50 clients: use gRPC.
If you own both endpoints and concurrency is controlled: TCP + semaphore flow control.
REST is the right default only when neither of the above applies.

systems design principle

Average latency is what your system does on a good day.
P99 latency is what your system does to your users on a normal day.
P99 at c=500 is what your system does during a traffic spike — the moment that matters most.

TCP's 121× P99/P50 ratio means the slowest 1% of requests are 121× worse than the median. For a service handling 10,000 req/s, that is 100 requests per second experiencing 1-second+ latency. Optimize for tail, not mean.