→ 9,915,000 latency measurements across 75 experiments (5 message sizes × 5 concurrency levels × 3 protocols). All use identical protobuf serialization — protocol overhead is the only variable.
→ TCP has lowest average latency (18µs baseline) but catastrophic P99 tail — 121× amplification (P99=1,049ms vs P50=8.9ms) at 64KB, c=500. Zero application-level flow control.
→ gRPC wins under stress: P99/P50 = 2× at all concurrency levels. TCP and REST exhibit a "knee" between c=10 and c=50 where tail ratio jumps to 24× and 29× respectively.
→ REST generates 914K context switches at c=500 — 7.5× more than TCP. Super-linear growth from HTTP/1.1's one-goroutine-per-connection model.
→ REST and gRPC make 36–41× more futex calls than TCP. Synchronization overhead — not protocol parsing — is the dominant cost.
→ Function-level profiling: REST spends 5× more CPU in GC than gRPC (3.71% vs 0.71%) due to short-lived header allocations. gRPC spends 63% more in goroutine scheduler (8.31% vs 5.11%) from HTTP/2's dedicated goroutines.
→ Three-zone page fault behavior discovered across message sizes — Zone 1 (<2KB): gRPC overhead dominates; Zone 2 (2–31KB): REST dynamic buffer growth dominates; Zone 3 (≥32KB): Go allocator _MaxSmallSize=32768 boundary causes step-function page fault spike confirmed by two independent controlled experiments.
→ Dense latency sweep (512B to 64KB, 13 sizes) confirms both crossovers in end-to-end latency: gradual transition at ~2KB (architectural) and sharp step at ~32KB (hard allocator boundary). REST/gRPC ordering is non-monotonic across message sizes.
121×TCP tail amplification at 64KB, c=500. P99 = 1,049ms while P50 = 8.9ms. No flow control → 500 goroutines simultaneously saturate the kernel send buffer.
2×gRPC P99/P50 ratio across all concurrency levels. HTTP/2 WINDOW_UPDATE frames enforce backpressure — prevents unbounded queue buildup regardless of client count.
914KREST context switches at c=500 — 7.5× more than TCP. HTTP/1.1's per-connection goroutine model generates super-linear scheduling pressure under load.
36–41×More futex (mutex lock) calls in REST/gRPC vs TCP. Synchronization overhead dominates — not parsing, not serialization, not framing.
32KBGo runtime allocator boundary — page fault step function confirmed by two independent controlled experiments (frame-size sweep + GOGC=off). Structural to _MaxSmallSize; not addressable by GC tuning.
[context] why does protocol choice matter at the kernel level?
In distributed systems, the protocol stack sits on the critical path of every request. The choice between TCP, REST, and gRPC is not merely an API preference — it is a binding contract with the OS scheduler, the kernel network stack, and the memory allocator.
The gap in existing benchmarks: Most protocol comparisons report only end-to-end latency under a single load condition. They do not explain why protocols differ, nor how overhead evolves with message size and concurrency. Without this, protocol selection is intuition-driven rather than evidence-based. This work instruments at the hardware counter, flame graph, and syscall level to expose root causes rather than symptoms.
design principle — uniform serialization
All three protocols use protobuf serialization, including raw TCP (with a 4-byte length-prefix frame). This deliberately isolates protocol overhead as the only variable. Latency differences are attributable to framing, flow control, and multiplexing — not encoding cost. Function-level perf data confirms that protobuf deserialization appears in both REST and gRPC call graphs at equivalent depth, validating this design choice.
Inspired by Raghavan et al.'s network stack overhead analysis, which showed that even loopback communication incurs measurable kernel crossing costs. Our contribution extends this with per-protocol hardware counter comparison and controlled anomaly investigation.
[experimental_setup]
┌─────────────────────────────────────────────────────────────────────┐
│ Platform : Ubuntu 24.04 LTS | Kernel : 6.8.0-101-generic │
│ CPU : Hybrid P/E-core | P-cores: 4500 MHz max │
│ Language : Go 1.25 | Serialization: protobuf (all 3) │
└─────────────────────────────────────────────────────────────────────┘
Message Sizes → 64B 256B 1KB 4KB 64KB (+ 13-point dense sweep)
Concurrency → 1 10 50 100 500 (concurrent clients)
Protocols → TCP (raw) REST (HTTP/1.1) gRPC (HTTP/2)
75 configurations × (concurrency × 1,000 requests) = 9,915,000 samples
CPU Pinning → Server pinned to CPU1 (P-core, 4500MHz) via taskset
Client allowed on CPUs 0,2,3,4,5,6,7 (P-cores only)
E-cores (2500MHz) excluded entirely — no mixed-core noise
Priority → sudo nice -n -20 (highest OS scheduling priority)
Isolation → One protocol benchmarked at a time — others stopped
Warmup → 1 warmup request per client before measurement begins
HW Counters → kernel.perf_event_paranoid = 1
[tools] profiling stack
Tool
Version
Purpose
perf stat
6.8.12
Hardware counters: CPU cycles, instructions, IPC, cache refs/misses, branch misses, context switches, page faults, user/sys time split
perf record
6.8.12
CPU time sampling at 1KHz with call graph capture for flame graph generation and function-level breakdown
perf report
6.8.12
Function-level CPU time attribution — attached to running client processes (120K–137K samples at 4KB, c=1)
perf trace
6.8.12
Syscall frequency counting: futex, read, write, epoll_pwait, nanosleep, sched_yield
FlameGraph
Gregg v1.0
stackcollapse-perf.pl + flamegraph.pl — visual CPU time attribution per protocol
Go
1.25.0
All protocol implementations: net (TCP), net/http (REST), google.golang.org/grpc (gRPC)
protoc
25.1
Protocol Buffer code generation — identical Payload message used across all three
taskset
2.39.3
CPU core pinning — server to CPU1, clients to CPUs 0,2,3,4,5,6,7
Python / pandas
3.12
9.9M sample aggregation, P50/P99/P999 computation, zone analysis plots
why Go over C++?
Go provides production-quality libraries for all three protocols under one consistent runtime — same GC, same scheduler, same allocator. Protocol differences are not contaminated by implementation quality gaps. Go's M:N goroutine model keeps client-side scheduling noise lower than 500 raw pthreads in C++ would at high concurrency.
Tradeoff acknowledged: Go runtime overhead (GC, goroutine scheduler) is inseparable from protocol overhead without a bare-metal baseline. A C++ implementation is left as future work. This limitation is partially addressed by function-level perf data which directly attributes GC and scheduler costs per protocol.
[implementation] protocol details
All three implement an identical echo service. Client sends a protobuf Payload{id, data[]byte, timestamp}. Server echoes with updated timestamp. Same logical work — overhead differences are purely protocol mechanics.
Protocol
Framing
Connection Model
Flow Control
TCP
4-byte big-endian length prefix + protobuf body
1 persistent connection per client
None (kernel TCP only)
REST
HTTP/1.1 headers + binary protobuf body to /echo
net/http persistent connections with connection pool
None
gRPC
HTTP/2 binary frames + HPACK header compression
Stream multiplexing over shared connection(s)
WINDOW_UPDATE per stream
serialization control
By using binary protobuf on REST instead of JSON, serialization cost is identical across all three. This is confirmed in function-level perf data: protobuf deserialization is visible in both call graphs at equivalent depth.
[protocol_profiles]
TCP (raw)
4-byte length-prefix framing
No application-level flow control
1 persistent connection per client
Zero protocol state management
80% execution in kernel space
0.13 futex calls/request
→ Fastest average. Worst tail. Best CPU efficiency. Kernel-dominated.
→ Best tail latency. Best under stress. Highest baseline cost. Cache-hungry at scale.
[baseline_results]
Page fault / latency zone summary — REST vs gRPC ordering is non-monotonic:
Zone 1: <2KB gRPC overhead > REST
Zone 2: 2–31KB gRPC faster (REST bufio growth)
Zone 3: ≥32KB Go allocator boundary hit
Each zone has a distinct confirmed root cause — see anomalies section for kernel-level investigation.
Baseline latency vs message size (concurrency = 1)
Message Size
TCP (µs)
REST (µs)
gRPC (µs)
REST/TCP
gRPC/TCP
64B
18.8
35.8
47.9
1.9×
2.5×
256B
17.3
34.2
45.0
2.0×
2.6×
1KB
18.5
42.3
46.3
2.3×
2.5×
4KB ⚠
21.2
55.4
50.4
2.6×
2.4×
64KB
56.6
199.1
216.3
3.5×
3.8×
TCP consistently 2–3.8× faster. At 4KB, REST (55.4µs) exceeds gRPC (50.4µs) — reversing expected ordering. This is the Zone 1→2 crossover in latency, confirmed by dense sweep (see dense latency section). See anomaly section for hardware counter confirmation.
64KB gap widens to 3.8×
At 64KB, payloads exceed L1 cache capacity — cache evictions and memory bandwidth become the dominant cost. TCP, which does no user-space buffer manipulation, is insulated from this effect. REST and gRPC both perform extensive user-space buffer reads and copies, magnifying the cache pressure.
Latency vs concurrency (64B messages) — linear scaling at small messages
Concurrency
TCP (µs)
REST (µs)
gRPC (µs)
REST/TCP
1
18.8
35.8
47.9
1.9×
10
94.7
145.4
194.6
1.5×
50
486.7
777.7
889.2
1.6×
100
936.8
1,631.0
1,848.3
1.7×
500
4,996.8
9,023.5
10,830.5
1.8×
At 64B all three scale approximately linearly. The REST/TCP ratio stays stable at 1.5–1.9× — indicating REST overhead is largely fixed per request at small message sizes. This is average latency; tail behavior tells a completely different story — see next section.
Throughput analysis
At concurrency=1 with 64B messages, TCP achieves approximately 49,000 req/s, versus 24,000 for REST and 21,000 for gRPC. At higher concurrency, all three protocols converge — the bottleneck shifts from protocol overhead to server CPU capacity.
throughput vs tail latency tradeoff
The semaphore-based TCP optimization costs only 7% throughput (10,466 → 9,703 req/s) while reducing tail ratio from 121× to 18× and average latency by 8.7×. Throughput and tail latency are decoupled by queue discipline, not by fundamental protocol limits.
[tail_latency_and_amplification]
protocol inversion effect — the most important result
High concurrency, large messages (P99 tail latency): gRPC (186ms) << TCP (1,049ms) < REST (1,202ms)
The protocol ranked best for average-case latency has the worst P99 tail latency at scale. Performance ranking completely inverts.
P99/P50 tail amplification ratio — full table at 64KB
The "knee" occurs between c=10 and c=50 for TCP and REST. gRPC remains flat throughout.
Protocol
Conc.
P50 (µs)
P99 (µs)
P99/P50
TCP
1
41
128
3.1×
TCP
10
479
897
1.9×
TCP
50
1,753
42,010
24×
TCP
100
3,489
140,215
40×
TCP
500
8,915
1,083,462
121×
REST
1
159
395
2.5×
REST
10
1,487
12,105
8.1×
REST
50
5,877
167,801
29×
REST
100
8,108
391,420
48×
REST
500
16,887
1,177,731
70×
gRPC
1
195
422
2.2×
gRPC
10
1,775
3,066
1.7×
gRPC
50
10,424
18,781
1.8×
gRPC
100
21,276
33,193
1.6×
gRPC
500
95,729
186,705
2.0×
mechanism — why TCP and REST collapse
At c=500 with 64KB messages, all 500 goroutines simultaneously compete for kernel send buffer space. Without application-level backpressure, late-arriving requests encounter a fully saturated send buffer and wait in the kernel queue. gRPC's HTTP/2 WINDOW_UPDATE frames explicitly limit in-flight unacknowledged data per stream — senders block at the application layer before hitting the kernel, converting unbounded kernel queuing into bounded, controlled waiting.
[kernel_profiling]
Hardware performance counters — perf stat (64B, c=1, 10K requests)
Counter
TCP
REST
gRPC
REST/TCP
gRPC/TCP
CPU Cycles
567M
1,350M
1,539M
2.4×
2.7×
Instructions
751M
1,549M
1,674M
2.1×
2.2×
IPC
1.32
1.14
1.08
—
—
Cache References
2.4M
14.5M
18.6M
6.0×
7.7×
Cache Misses
222K
456K
548K
2.1×
2.5×
Branch Misses
2.2M
3.8M
4.6M
1.7×
2.1×
Context Switches
14,708
42,629
45,193
2.9×
3.1×
User Time
33ms (20%)
187ms (46%)
245ms (53%)
—
—
Sys Time
131ms (80%)
215ms (54%)
210ms (47%)
—
—
user vs kernel split — where work happens
TCP: 80% sys — kernel-dominated at all concurrency levels. This ratio is constant regardless of load.
REST: 46% sys at c=1 → 35% sys at c=500 — protocol logic and connection management take over at scale.
gRPC: 47% sys at c=1 → 26% sys at c=500 — HTTP/2 state machine, stream coordination, and GC dominate at scale.
At scale, the bottleneck in REST and gRPC is not the kernel network stack — it is their own user-space abstractions.
Hardware counters at intermediate message sizes — explains the 4KB REST/gRPC crossover
To understand why REST latency exceeds gRPC at 4KB, hardware counters were collected at 1KB, 4KB, and 16KB with c=1 and 10,000 requests.
Metric
1KB
4KB
16KB
REST cycles
1,502M
2,322M
4,076M
gRPC cycles
1,714M
2,020M
2,898M
REST cache-refs
20.2M
40.5M
106.2M
gRPC cache-refs
24.7M
34.2M
66.7M
REST cache-miss%
7.7%
8.3%
5.9%
gRPC cache-miss%
6.4%
6.2%
4.9%
REST page-faults
3,560
6,136
15,387
gRPC page-faults
2,974
3,868
6,251
confirmed root cause — net/http per-request buffer allocation
At 1KB, gRPC consumes 14% more cycles than REST due to HTTP/2 connection setup — HPACK tables, stream state, flow control structures.
Above 2KB this inverts: REST's net/http transport allocates a fresh response buffer per request sized to each payload, increasing heap pressure and reducing cache locality compared to gRPC's pre-allocated fixed-size frame buffers via tieredBufferPool. At 4KB, REST consumes 15% more cycles and 18% more cache references. At 16KB the gap reaches 40% more cycles and 59% more cache references.
gRPC's frame buffers are pre-allocated and reused across requests — no dynamic resizing per request.
Function-level CPU breakdown — perf report (4KB, c=1)
Attached perf record to running client processes (120K–137K samples). Both protocols spend ~77–79% blocked waiting on network I/O. Of active CPU samples:
Category
REST %
gRPC %
Implication
Protocol layer (HTTP framing/parsing)
4.02
3.56
Comparable — framing is not the bottleneck
Memory allocation (mallocgc)
4.91
3.29
REST allocates more per request
Garbage collection (gcDrain, scanobject)
3.71
0.71
5× more GC in REST
Memory copy (memmove)
1.32
2.20
gRPC copies data between frame buffers
Go scheduler (stealWork, findRunnable)
5.11
8.31
gRPC 63% more scheduling
Kernel sync (futex, psi)
4.27
4.73
Similar — both lock-heavy
I/O wait (blocked on network)
~77%
~79%
Dominant cost in both — loopback I/O
REST GC cost — why 5× higher
HTTP/1.1 response parsing allocates many short-lived objects per request — header maps, string slices, bodyEOFSignal wrappers. The GC must trace and collect all of these. REST provides no equivalent to gRPC's buffer pool.
gRPC scheduler cost — why 63% higher
HTTP/2 maintains dedicated goroutines: loopyWriter, frame reader, keepalive pinger. All require scheduling. runtime.procyield appears only in gRPC (0.94%) — confirms mutex spin-waiting from stream table locking.
Flame graph analysis — call stack depth and structure
Generated using perf record at 1KHz + call graph capture + FlameGraph tools. Wider = more CPU time. Flame graph file sizes alone reflect complexity: TCP 149KB, REST 241KB, gRPC 332KB.
TCP — shallow, kernel-dominated
2–3 user-space frames before kernel. No visible futex calls anywhere in the graph — zero synchronization overhead. Most CPU time is the kernel doing actual network work.
REST — deep HTTP/1.1 pipeline with synchronization
6 user-space frames before reaching kernel: Client → Transport → persistConn → bufio → net.Conn → syscall. Each layer reads/writes HTTP header buffers — directly explains the 6× cache reference increase. Visible futex_wait call sites from connection pool locking per request.
gRPC — distributed CPU, most complex call graph
8+ user-space frames. CPU time spread across multiple distinct subsystems: HTTP/2 transport, protobuf serialization, stream management, flow control, keepalive pings — each visible as separate towers. Multiple futex call sites from stream table locking. tieredBufferPool.Get visible as a distinct allocation pattern contrasting with REST's scattered mallocgc.
Each futex call is a mutex lock acquisition or release — pure synchronization overhead with no application work. sched_yield = 0 for TCP proves zero lock contention. A thread yields only when spinning on a lock it cannot acquire — REST and gRPC do this regularly, confirming the futex calls represent real contention.
[concurrency_scaling_analysis]
Context switches vs concurrency — REST super-linear collapse
REST context switches grow super-linearly: 3.1× TCP at c=1 → 7.5× at c=500. HTTP/1.1's one-goroutine-per-connection model means 500 goroutines all compete for the same connection pool mutex. At c=50, gRPC (18,765) is nearly identical to TCP (16,657) — HTTP/2 multiplexing pays off here.
Cache misses vs concurrency — gRPC's hidden scaling cost
Concurrency
TCP (K)
REST (K)
gRPC (K)
gRPC/TCP
1
113
236
273
2.4×
10
254
443
492
1.9×
50
588
1,593
3,938
6.7×
100
1,095
6,479
15,758
14.4×
500
12,088
174,887
331,541
27.4×
gRPC wins on context switches at high concurrency, but loses badly on cache misses. At c=500, gRPC generates 27× more cache misses than TCP. HTTP/2 per-stream state (flow control windows, HPACK tables) for 500 concurrent streams far exceeds L1/L2 capacity — constant cache thrashing. This partially offsets gRPC's multiplexing benefits.
User vs system time evolution at low and high concurrency
Protocol
User (c=1)
Sys (c=1)
User (c=500)
Sys (c=500)
TCP
20%
80%
20%
80%
REST
42%
58%
65%
35%
gRPC
44%
56%
74%
26%
TCP's user/sys split is constant at all concurrency levels — always kernel-dominated. REST and gRPC shift toward user space at scale: the bottleneck becomes their own protocol logic and synchronization, not the kernel network stack.
[anomalies_and_root_causes]
Three anomalies were investigated at the kernel level. Each was traced to a confirmed root cause with controlled experiments to rule out alternative hypotheses.
Anomaly 1: Page fault step function at 32KB — Go allocator boundary
Observed: gRPC page fault count jumped discontinuously at exactly 32KB message size — a step function, not a gradual increase.
confirmed root cause — Go allocator slab/heap boundary
Go runtime constant: _MaxSmallSize = 32768 (runtime/malloc.go)
Below 32KB → thread-local size-class pool (mcache) → slab page already mapped → no fault.
At or above 32KB → large-object allocator (mheap) → fresh heap page requested from OS → guaranteed fault per allocation.
alternative hypothesis 1 ruled out — HTTP/2 frame splitting
gRPC's default write buffer is ~16KB. Initial hypothesis: messages crossing frame boundaries trigger additional memory operations. Tested with 8KB, 16KB, and 32KB frame size configurations. All three show identical step functions at exactly 32KB. Frame size has zero effect — HTTP/2 framing ruled out.
alternative hypothesis 2 ruled out — GC pressure
GOGC=off experiment: page fault sweep with garbage collection disabled. Absolute fault counts rise (~95K as GC no longer reclaims pages). The step function at 32KB persists in both conditions — confirming cause is allocator path change, not GC behavior.
Size
Normal GC
GOGC=off
Step preserved?
29KB
3,548
95,433
—
31KB
3,327
95,364
—
32KB
16,214
96,414
Yes
33KB
17,469
116,227
Yes
Anomaly 2: Page fault three-zone behavior across message sizes
Observed: REST vs gRPC page fault ordering is non-monotonic across message sizes — three behaviorally distinct zones.
Zone 1: <2KB gRPC overhead > REST
Zone 2: 2–31KB REST overhead > gRPC
Zone 3: ≥32KB gRPC overhead rises sharply
Size
REST PF
gRPC PF
REST/gRPC
Zone
512B
3,573
3,615
1.0×
Zone 1
1KB
3,360
3,363
1.0×
Zone 1
2KB
4,688
3,674
1.3×
Zone 2
4KB
7,178
3,706
1.9×
Zone 2
8KB
8,513
5,362
1.6×
Zone 2
16KB
17,185
5,889
2.9×
Zone 2
32KB
23,736
19,326
1.2×
Zone 3 onset
64KB
33,125
25,035
1.3×
Zone 3
Zone
Range
Root Cause
Zone 1
<2KB
gRPC's fixed HTTP/2 setup overhead (HPACK table initialization, stream state, flow control) generates comparable or more page faults than REST's simple handling at small sizes
Zone 2
2–31KB
REST's net/http transport allocates a fresh response buffer per request. gRPC's frame buffers are pre-allocated via tieredBufferPool — REST page faults grow rapidly, gRPC stays flat at ~4K–6K
Zone 3
≥32KB
Go allocator boundary: both protocols cross _MaxSmallSize=32768. gRPC's per-stream state amplifies the large-object allocation cost, narrowing the gap that existed in Zone 2
continuous vs fragmented memory — the architectural root
HTTP/1.1 allocates the response body as one contiguous buffer. HTTP/2 pre-allocates fixed-size frame buffers via tieredBufferPool and reuses them. Below 32KB, gRPC's reuse strategy generates fewer page faults. At exactly 32KB, both protocols' message buffers cross Go's large-object threshold — the pre-allocation advantage disappears. Above 32KB, the large-object allocator dominates both, and gRPC's additional per-stream state means it hits the threshold more frequently.
Anomaly 3: REST latency exceeds gRPC at 4KB (Zone 1→2 crossover in latency)
At 4KB, c=1: REST (55.4µs) > gRPC (50.4µs) — reversing expected ordering. Reproduced across multiple independent runs. Confirmed by hardware counter analysis showing REST consuming 15% more cycles and 18% more cache references at 4KB than gRPC — the inverse of the 1KB relationship.
At 1KB (Zone 1 — gRPC more expensive)
REST: 42.3µs gRPC: 46.3µs
REST PF: 3,360 gRPC PF: 3,363
gRPC has more fixed overhead, REST wins on latency.
The original 5-point message size benchmark was extended to 13 sizes to confirm both crossovers directly in end-to-end latency. This directly answers the professor's question about where time is being spent across message sizes.
Full latency sweep — all three protocols, 13 message sizes, c=1
Size
TCP (µs)
REST (µs)
gRPC (µs)
Zone
Note
512B
19
37
44
Zone 1
gRPC 7µs slower than REST
1KB ←
18
39
42
Zone 1→2
gap closing — only 3µs
2KB
19
43
44
Zone 2
nearly tied
3KB
20
44
45
Zone 2
REST just edges ahead
4KB ⚠
21
53
47
Zone 2
REST clearly slower — anomaly confirmed
6KB
21
56
48
Zone 2
gap growing
8KB
21
63
51
Zone 2
REST 23% slower than gRPC
12KB
23
76
53
Zone 2
REST 43% slower
16KB
26
85
62
Zone 2
peak gRPC advantage
24KB
32
102
70
Zone 2
gRPC still 32µs faster
32KB ←
32
121
171
Zone 3
sharp flip — gRPC 50µs SLOWER than REST
48KB
40
147
185
Zone 3
REST now faster than gRPC
64KB
48
180
199
Zone 3
REST 19µs faster than gRPC
two crossovers confirmed in latency
Crossover 1 (Zone 1→2, around 1–2KB): GRADUAL
Gap closes over 4 data points (512B→1KB→2KB→3KB→4KB). This matches the architectural cause — HTTP/2 fixed setup cost is gradually overtaken by REST's growing buffer allocation cost. Gradual = structural/architectural difference.
Crossover 2 (Zone 2→3, exactly 24KB→32KB): SHARP
Between 24KB and 32KB, gRPC goes from 32µs faster to 50µs slower — an 82µs swing in one step. This matches the page fault step function exactly. Sharp = hard boundary (Go allocator _MaxSmallSize). Two independent mechanisms, two distinct crossover signatures.
TCP monotonically fastest — never crosses
TCP latency grows from 19µs to 48µs across the entire 512B–64KB range. It never inverts with either protocol. The REST/gRPC inversions are entirely absent in TCP because TCP has no user-space buffer management overhead — its latency growth is driven purely by data transfer cost.
[optimization_semaphore_flow_control]
The tail amplification analysis identified TCP's 121× P99/P50 ratio as directly caused by the absence of application-level flow control. This was validated by implementing a minimal semaphore-based fix and measuring the effect.
[implementation] the fix — 5 lines of Go
root cause
At c=500 with 64KB messages, all 500 goroutines immediately call write on their TCP connections. The kernel send buffer fills rapidly. Late-arriving goroutines block inside the kernel, accumulating into a queue that grows unboundedly. Request latency is proportional to queue position — the last goroutine waits for all 499 others to drain.
// Original — all 500 goroutines blast simultaneously
go func() { sendRecv(conn, data) }() // × 500
// Optimized — semaphore limits max in-flight to 50
sem := make(chan struct{}, maxInflight) // maxInflight = 50
go func() {
sem <- struct{}{} // acquire — blocks when 50 already in-flight
sendRecv(conn, data)
<-sem // release
}()
With maxInflight=50, at most 50 goroutines hold the semaphore at any moment. The remaining 450 block on the Go channel — outside the kernel, in user space — without touching the kernel send buffer. Queue discipline moves from kernel to application layer, making it measurable and controllable.
The semaphore limits the sender side but provides no receiver-driven backpressure. The 50 active goroutines still write without acknowledgment from the server that it has consumed the data. gRPC's WINDOW_UPDATE mechanism is receiver-driven: the server explicitly signals how much more data it can accept. A full solution requires receiver-driven flow control analogous to HTTP/2 — left as future work.
page fault reduction — explained
Before: 500 goroutines simultaneously allocate 64KB response buffers → 500 large-object allocations → 1.5M page faults. After: at most 50 goroutines hold buffers at once → 47× fewer page faults. Confirms the original page fault explosion was caused by simultaneous large-object allocations, not the protocol itself.
[discussion]
The abstraction cost — quantified per layer
REST over HTTP/1.1 adds 6 user-space function call layers over raw TCP. This translates to 6× more memory accesses, 3× more context switches, and 36× more mutex lock operations for identical 64B payloads. The overhead is largely additive and fixed per request at small message sizes — REST is essentially a constant 2× overhead tax on TCP.
gRPC introduces greater per-request complexity (2.7× more CPU cycles, 8 user-space layers), but its architectural features prevent catastrophic degradation at scale. The additional overhead is a worthwhile investment when P99 tail latency matters.
key design insight
There is no universally best protocol. The ranking depends entirely on the workload regime. The decision point is not average latency — it is what happens at the tail under peak load. Choosing based on microbenchmarks (average latency at low concurrency) can produce catastrophically wrong decisions for production deployments.
Methodology validity and limitations
acknowledged limitations
Go runtime inseparability: GC and goroutine scheduler costs cannot be fully isolated from protocol costs without a bare-metal C++ baseline. Partially mitigated by function-level perf attribution per protocol.
Single-node only: Network latency would dominate µs-level differences in a distributed deployment. Relative ratios, tail latency patterns, and synchronization costs remain applicable.
Symmetric echo design: Both send-path and receive-path overhead are combined in each measurement. Asymmetric designs (small ACK response, or large server-initiated response) would isolate each direction independently — left as future work.
Client-side profiling only: All perf stat, flame graph, and syscall measurements were collected on the client process. Server-side profiling would reveal server-side protocol overhead separately — left as future work.
Go allocator threshold: Definitive confirmation of the 32KB hypothesis requires recompiling Go runtime with modified _MaxSmallSize. Two indirect experiments provide strong but indirect evidence.
Warmup: One warmup request per client. gRPC's HPACK table initialization and HTTP/2 stream 0 setup may not be fully amortized in a single warmup request.
[when_to_use]
Use TCP when:
You control both endpoints
Payloads are small and uniform (<2KB)
Concurrency is bounded (<50)
Minimum median latency is critical
Tail latency controlled via semaphore
→ HFT order entry, internal hot paths, game servers, custom control planes
Use REST when:
Simplicity and debuggability matter
Concurrency is low (<50)
Broad client compatibility needed
2× TCP overhead is acceptable
Message size stays below 2KB
→ Public APIs, low-traffic internal services, human-readable debugging
Use gRPC when:
P99 tail latency SLAs exist
Concurrency is high (>50)
Payloads vary in size (2KB–32KB sweet spot)
Streaming RPCs needed
Internal microservice mesh
→ Latency-sensitive distributed systems, ML inference serving, high-concurrency backends
one-line protocol selection rule
If your system has tail latency SLAs and sees burst concurrency above 50 clients: use gRPC.
If you own both endpoints and concurrency is controlled: TCP + semaphore flow control.
REST is the right default only when neither of the above applies.
systems design principle
Average latency is what your system does on a good day.
P99 latency is what your system does to your users on a normal day.
P99 at c=500 is what your system does during a traffic spike — the moment that matters most.
TCP's 121× P99/P50 ratio means the slowest 1% of requests are 121× worse than the median. For a service handling 10,000 req/s, that is 100 requests per second experiencing 1-second+ latency. Optimize for tail, not mean.