Protocol Performance Analysis

TCP vs REST vs gRPC — kernel-level characterization on a single-node Linux system · IIIT Bangalore · SSP Course Project
SYSTEMS_INFRA KERNEL_PROFILING HW_COUNTERS Go perf_6.8 FlameGraph protobuf taskset pandas LINUX_KERNEL_6.8.0 Ubuntu_24.04 9.9M_SAMPLES 75_EXPERIMENTS
[tldr]
121× TCP tail amplification at 64KB, c=500. P99 = 1,049ms while P50 = 8.9ms. No flow control → 500 goroutines simultaneously saturate the kernel send buffer.
gRPC P99/P50 ratio across all concurrency levels. HTTP/2 WINDOW_UPDATE frames enforce backpressure — prevents unbounded queue buildup regardless of client count.
914K REST context switches at c=500 — 7.5× more than TCP. HTTP/1.1's per-connection goroutine model generates super-linear scheduling pressure under load.
36–41× More futex (mutex lock) calls in REST/gRPC vs TCP. Synchronization overhead dominates — not parsing, not serialization, not framing.
32KB Go runtime allocator boundary — page fault step function confirmed by two independent controlled experiments (frame-size sweep + GOGC=off). Structural to _MaxSmallSize; not addressable by GC tuning.
8.7× Average latency improvement from semaphore-based TCP flow control. Simple channel semaphore limits in-flight goroutines; converts unbounded kernel queuing into bounded, measurable waiting.
[context] why does protocol choice matter at the kernel level?

In distributed systems, the protocol stack sits on the critical path of every request. The choice between TCP, REST, and gRPC is not merely an API preference — it is a binding contract with the OS scheduler, the kernel network stack, and the memory allocator.

The gap in existing benchmarks: Most protocol comparisons report only end-to-end latency under a single load condition. They do not explain why protocols differ, nor how overhead evolves with message size and concurrency. Without this, protocol selection is intuition-driven rather than evidence-based. This work instruments at the hardware counter, flame graph, and syscall level to expose root causes rather than symptoms.

design principle — uniform serialization
All three protocols use protobuf serialization, including raw TCP (with a 4-byte length-prefix frame). This deliberately isolates protocol overhead as the only variable. Latency differences are attributable to framing, flow control, and multiplexing — not encoding cost. Function-level perf data confirms that protobuf deserialization appears in both REST and gRPC call graphs at equivalent depth, validating this design choice.

Inspired by Raghavan et al.'s network stack overhead analysis, which showed that even loopback communication incurs measurable kernel crossing costs. Our contribution extends this with per-protocol hardware counter comparison and controlled anomaly investigation.

[experimental_setup]
┌─────────────────────────────────────────────────────────────────────┐ │ Platform : Ubuntu 24.04 LTS | Kernel : 6.8.0-101-generic │ │ CPU : Hybrid P/E-core | P-cores: 4500 MHz max │ │ Language : Go 1.25 | Serialization: protobuf (all 3) │ └─────────────────────────────────────────────────────────────────────┘ Message Sizes → 64B 256B 1KB 4KB 64KB (+ 13-point dense sweep) Concurrency → 1 10 50 100 500 (concurrent clients) Protocols → TCP (raw) REST (HTTP/1.1) gRPC (HTTP/2) 75 configurations × (concurrency × 1,000 requests) = 9,915,000 samples CPU Pinning → Server pinned to CPU1 (P-core, 4500MHz) via taskset Client allowed on CPUs 0,2,3,4,5,6,7 (P-cores only) E-cores (2500MHz) excluded entirely — no mixed-core noise Priority → sudo nice -n -20 (highest OS scheduling priority) Isolation → One protocol benchmarked at a time — others stopped Warmup → 1 warmup request per client before measurement begins HW Counters → kernel.perf_event_paranoid = 1
[tools] profiling stack
ToolVersionPurpose
perf stat6.8.12Hardware counters: CPU cycles, instructions, IPC, cache refs/misses, branch misses, context switches, page faults, user/sys time split
perf record6.8.12CPU time sampling at 1KHz with call graph capture for flame graph generation and function-level breakdown
perf report6.8.12Function-level CPU time attribution — attached to running client processes (120K–137K samples at 4KB, c=1)
perf trace6.8.12Syscall frequency counting: futex, read, write, epoll_pwait, nanosleep, sched_yield
FlameGraphGregg v1.0stackcollapse-perf.pl + flamegraph.pl — visual CPU time attribution per protocol
Go1.25.0All protocol implementations: net (TCP), net/http (REST), google.golang.org/grpc (gRPC)
protoc25.1Protocol Buffer code generation — identical Payload message used across all three
taskset2.39.3CPU core pinning — server to CPU1, clients to CPUs 0,2,3,4,5,6,7
Python / pandas3.129.9M sample aggregation, P50/P99/P999 computation, zone analysis plots
why Go over C++?
Go provides production-quality libraries for all three protocols under one consistent runtime — same GC, same scheduler, same allocator. Protocol differences are not contaminated by implementation quality gaps. Go's M:N goroutine model keeps client-side scheduling noise lower than 500 raw pthreads in C++ would at high concurrency.

Tradeoff acknowledged: Go runtime overhead (GC, goroutine scheduler) is inseparable from protocol overhead without a bare-metal baseline. A C++ implementation is left as future work. This limitation is partially addressed by function-level perf data which directly attributes GC and scheduler costs per protocol.
[implementation] protocol details

All three implement an identical echo service. Client sends a protobuf Payload{id, data[]byte, timestamp}. Server echoes with updated timestamp. Same logical work — overhead differences are purely protocol mechanics.

ProtocolFramingConnection ModelFlow Control
TCP4-byte big-endian length prefix + protobuf body1 persistent connection per clientNone (kernel TCP only)
RESTHTTP/1.1 headers + binary protobuf body to /echonet/http persistent connections with connection poolNone
gRPCHTTP/2 binary frames + HPACK header compressionStream multiplexing over shared connection(s)WINDOW_UPDATE per stream
serialization control
By using binary protobuf on REST instead of JSON, serialization cost is identical across all three. This is confirmed in function-level perf data: protobuf deserialization is visible in both call graphs at equivalent depth.
[protocol_profiles]
TCP (raw)
→ Fastest average. Worst tail. Best CPU efficiency. Kernel-dominated.
REST (HTTP/1.1)
→ 2× TCP overhead. Scales poorly. Super-linear context switch growth.
gRPC (HTTP/2)
→ Best tail latency. Best under stress. Highest baseline cost. Cache-hungry at scale.
[baseline_results]

Page fault / latency zone summary — REST vs gRPC ordering is non-monotonic:

Zone 1: <2KB
gRPC overhead > REST
Zone 2: 2–31KB
gRPC faster (REST bufio growth)
Zone 3: ≥32KB
Go allocator boundary hit

Each zone has a distinct confirmed root cause — see anomalies section for kernel-level investigation.

Baseline latency vs message size (concurrency = 1)
Message SizeTCP (µs)REST (µs)gRPC (µs)REST/TCPgRPC/TCP
64B18.835.847.91.9×2.5×
256B17.334.245.02.0×2.6×
1KB18.542.346.32.3×2.5×
4KB ⚠21.255.450.42.6×2.4×
64KB56.6199.1216.33.5×3.8×

TCP consistently 2–3.8× faster. At 4KB, REST (55.4µs) exceeds gRPC (50.4µs) — reversing expected ordering. This is the Zone 1→2 crossover in latency, confirmed by dense sweep (see dense latency section). See anomaly section for hardware counter confirmation.

64KB gap widens to 3.8×
At 64KB, payloads exceed L1 cache capacity — cache evictions and memory bandwidth become the dominant cost. TCP, which does no user-space buffer manipulation, is insulated from this effect. REST and gRPC both perform extensive user-space buffer reads and copies, magnifying the cache pressure.
Latency vs concurrency (64B messages) — linear scaling at small messages
ConcurrencyTCP (µs)REST (µs)gRPC (µs)REST/TCP
118.835.847.91.9×
1094.7145.4194.61.5×
50486.7777.7889.21.6×
100936.81,631.01,848.31.7×
5004,996.89,023.510,830.51.8×

At 64B all three scale approximately linearly. The REST/TCP ratio stays stable at 1.5–1.9× — indicating REST overhead is largely fixed per request at small message sizes. This is average latency; tail behavior tells a completely different story — see next section.

Throughput analysis

At concurrency=1 with 64B messages, TCP achieves approximately 49,000 req/s, versus 24,000 for REST and 21,000 for gRPC. At higher concurrency, all three protocols converge — the bottleneck shifts from protocol overhead to server CPU capacity.

throughput vs tail latency tradeoff
The semaphore-based TCP optimization costs only 7% throughput (10,466 → 9,703 req/s) while reducing tail ratio from 121× to 18× and average latency by 8.7×. Throughput and tail latency are decoupled by queue discipline, not by fundamental protocol limits.
[tail_latency_and_amplification]
protocol inversion effect — the most important result
Low concurrency, small messages (avg latency):
  TCP (18µs) < REST (35µs) < gRPC (47µs)

High concurrency, large messages (P99 tail latency):
  gRPC (186ms) << TCP (1,049ms) < REST (1,202ms)

The protocol ranked best for average-case latency has the worst P99 tail latency at scale. Performance ranking completely inverts.
P99/P50 tail amplification ratio — full table at 64KB

The "knee" occurs between c=10 and c=50 for TCP and REST. gRPC remains flat throughout.

ProtocolConc.P50 (µs)P99 (µs)P99/P50
TCP1411283.1×
TCP104798971.9×
TCP501,75342,01024×
TCP1003,489140,21540×
TCP5008,9151,083,462121×
REST11593952.5×
REST101,48712,1058.1×
REST505,877167,80129×
REST1008,108391,42048×
REST50016,8871,177,73170×
gRPC11954222.2×
gRPC101,7753,0661.7×
gRPC5010,42418,7811.8×
gRPC10021,27633,1931.6×
gRPC50095,729186,7052.0×
mechanism — why TCP and REST collapse
At c=500 with 64KB messages, all 500 goroutines simultaneously compete for kernel send buffer space. Without application-level backpressure, late-arriving requests encounter a fully saturated send buffer and wait in the kernel queue. gRPC's HTTP/2 WINDOW_UPDATE frames explicitly limit in-flight unacknowledged data per stream — senders block at the application layer before hitting the kernel, converting unbounded kernel queuing into bounded, controlled waiting.
[kernel_profiling]
Hardware performance counters — perf stat (64B, c=1, 10K requests)
CounterTCPRESTgRPCREST/TCPgRPC/TCP
CPU Cycles567M1,350M1,539M2.4×2.7×
Instructions751M1,549M1,674M2.1×2.2×
IPC1.321.141.08
Cache References2.4M14.5M18.6M6.0×7.7×
Cache Misses222K456K548K2.1×2.5×
Branch Misses2.2M3.8M4.6M1.7×2.1×
Context Switches14,70842,62945,1932.9×3.1×
User Time33ms (20%)187ms (46%)245ms (53%)
Sys Time131ms (80%)215ms (54%)210ms (47%)
user vs kernel split — where work happens
TCP: 80% sys — kernel-dominated at all concurrency levels. This ratio is constant regardless of load.
REST: 46% sys at c=1 → 35% sys at c=500 — protocol logic and connection management take over at scale.
gRPC: 47% sys at c=1 → 26% sys at c=500 — HTTP/2 state machine, stream coordination, and GC dominate at scale.

At scale, the bottleneck in REST and gRPC is not the kernel network stack — it is their own user-space abstractions.
Hardware counters at intermediate message sizes — explains the 4KB REST/gRPC crossover

To understand why REST latency exceeds gRPC at 4KB, hardware counters were collected at 1KB, 4KB, and 16KB with c=1 and 10,000 requests.

Metric1KB4KB16KB
REST cycles1,502M2,322M4,076M
gRPC cycles1,714M2,020M2,898M
REST cache-refs20.2M40.5M106.2M
gRPC cache-refs24.7M34.2M66.7M
REST cache-miss%7.7%8.3%5.9%
gRPC cache-miss%6.4%6.2%4.9%
REST page-faults3,5606,13615,387
gRPC page-faults2,9743,8686,251
confirmed root cause — net/http per-request buffer allocation
At 1KB, gRPC consumes 14% more cycles than REST due to HTTP/2 connection setup — HPACK tables, stream state, flow control structures.

Above 2KB this inverts: REST's net/http transport allocates a fresh response buffer per request sized to each payload, increasing heap pressure and reducing cache locality compared to gRPC's pre-allocated fixed-size frame buffers via tieredBufferPool. At 4KB, REST consumes 15% more cycles and 18% more cache references. At 16KB the gap reaches 40% more cycles and 59% more cache references.

gRPC's frame buffers are pre-allocated and reused across requests — no dynamic resizing per request.
Function-level CPU breakdown — perf report (4KB, c=1)

Attached perf record to running client processes (120K–137K samples). Both protocols spend ~77–79% blocked waiting on network I/O. Of active CPU samples:

CategoryREST %gRPC %Implication
Protocol layer (HTTP framing/parsing)4.023.56Comparable — framing is not the bottleneck
Memory allocation (mallocgc)4.913.29REST allocates more per request
Garbage collection (gcDrain, scanobject)3.710.715× more GC in REST
Memory copy (memmove)1.322.20gRPC copies data between frame buffers
Go scheduler (stealWork, findRunnable)5.118.31gRPC 63% more scheduling
Kernel sync (futex, psi)4.274.73Similar — both lock-heavy
I/O wait (blocked on network)~77%~79%Dominant cost in both — loopback I/O
REST GC cost — why 5× higher
HTTP/1.1 response parsing allocates many short-lived objects per request — header maps, string slices, bodyEOFSignal wrappers. The GC must trace and collect all of these. REST provides no equivalent to gRPC's buffer pool.
gRPC scheduler cost — why 63% higher
HTTP/2 maintains dedicated goroutines: loopyWriter, frame reader, keepalive pinger. All require scheduling. runtime.procyield appears only in gRPC (0.94%) — confirms mutex spin-waiting from stream table locking.
Flame graph analysis — call stack depth and structure

Generated using perf record at 1KHz + call graph capture + FlameGraph tools. Wider = more CPU time. Flame graph file sizes alone reflect complexity: TCP 149KB, REST 241KB, gRPC 332KB.

TCP — shallow, kernel-dominated
2–3 user-space frames before kernel. No visible futex calls anywhere in the graph — zero synchronization overhead. Most CPU time is the kernel doing actual network work.
REST — deep HTTP/1.1 pipeline with synchronization
6 user-space frames before reaching kernel: Client → Transport → persistConn → bufio → net.Conn → syscall. Each layer reads/writes HTTP header buffers — directly explains the 6× cache reference increase. Visible futex_wait call sites from connection pool locking per request.
gRPC — distributed CPU, most complex call graph
8+ user-space frames. CPU time spread across multiple distinct subsystems: HTTP/2 transport, protobuf serialization, stream management, flow control, keepalive pings — each visible as separate towers. Multiple futex call sites from stream table locking. tieredBufferPool.Get visible as a distinct allocation pattern contrasting with REST's scattered mallocgc.
Syscall analysis — perf trace (64B, c=1, 10K requests)
SyscallTCPRESTgRPCSignificance
futex1,33548,25655,33236–41× higher in REST/gRPC
read29,64420,30526,384TCP does most raw I/O directly
write19,83610,38114,137REST/gRPC buffer and batch writes
epoll_pwait19,49918,89932,667gRPC monitors more I/O event sources
nanosleep3,8778,8126,932lock-contention sleep-and-retry
sched_yield03151TCP never contends on any lock
futex calls per request — the dominant overhead metric
TCP: 0.13 futex/request  |  REST: 4.8/request  |  gRPC: 5.5/request

Each futex call is a mutex lock acquisition or release — pure synchronization overhead with no application work. sched_yield = 0 for TCP proves zero lock contention. A thread yields only when spinning on a lock it cannot acquire — REST and gRPC do this regularly, confirming the futex calls represent real contention.
[concurrency_scaling_analysis]
Context switches vs concurrency — REST super-linear collapse
ConcurrencyTCPRESTgRPCREST/TCP
11,4394,4375,1273.1×
102,67614,89717,6495.6×
5016,65772,12518,7654.3×
10038,281149,32931,9663.9×
500122,112914,456165,8407.5×
REST super-linear growth — HTTP/1.1 architectural limit
REST context switches grow super-linearly: 3.1× TCP at c=1 → 7.5× at c=500. HTTP/1.1's one-goroutine-per-connection model means 500 goroutines all compete for the same connection pool mutex. At c=50, gRPC (18,765) is nearly identical to TCP (16,657) — HTTP/2 multiplexing pays off here.
Cache misses vs concurrency — gRPC's hidden scaling cost
ConcurrencyTCP (K)REST (K)gRPC (K)gRPC/TCP
11132362732.4×
102544434921.9×
505881,5933,9386.7×
1001,0956,47915,75814.4×
50012,088174,887331,54127.4×

gRPC wins on context switches at high concurrency, but loses badly on cache misses. At c=500, gRPC generates 27× more cache misses than TCP. HTTP/2 per-stream state (flow control windows, HPACK tables) for 500 concurrent streams far exceeds L1/L2 capacity — constant cache thrashing. This partially offsets gRPC's multiplexing benefits.

User vs system time evolution at low and high concurrency
ProtocolUser (c=1)Sys (c=1)User (c=500)Sys (c=500)
TCP20%80%20%80%
REST42%58%65%35%
gRPC44%56%74%26%

TCP's user/sys split is constant at all concurrency levels — always kernel-dominated. REST and gRPC shift toward user space at scale: the bottleneck becomes their own protocol logic and synchronization, not the kernel network stack.

[anomalies_and_root_causes]

Three anomalies were investigated at the kernel level. Each was traced to a confirmed root cause with controlled experiments to rule out alternative hypotheses.

Anomaly 1: Page fault step function at 32KB — Go allocator boundary

Observed: gRPC page fault count jumped discontinuously at exactly 32KB message size — a step function, not a gradual increase.

At 31KB — all three gRPC frame sizes
gRPC-8KB frame: ~4,117 page faults
gRPC-16KB frame: ~4,106 page faults
gRPC-32KB frame: ~3,394 page faults

→ Low and stable across all variants
At 32KB — all three gRPC frame sizes
gRPC-8KB frame: ~16,828 page faults
gRPC-16KB frame: ~14,043 page faults
gRPC-32KB frame: ~16,081 page faults

→ 4× simultaneous jump in all variants
confirmed root cause — Go allocator slab/heap boundary
Go runtime constant: _MaxSmallSize = 32768 (runtime/malloc.go)

Below 32KB → thread-local size-class pool (mcache) → slab page already mapped → no fault.
At or above 32KB → large-object allocator (mheap) → fresh heap page requested from OS → guaranteed fault per allocation.
alternative hypothesis 1 ruled out — HTTP/2 frame splitting
gRPC's default write buffer is ~16KB. Initial hypothesis: messages crossing frame boundaries trigger additional memory operations. Tested with 8KB, 16KB, and 32KB frame size configurations. All three show identical step functions at exactly 32KB. Frame size has zero effect — HTTP/2 framing ruled out.
alternative hypothesis 2 ruled out — GC pressure
GOGC=off experiment: page fault sweep with garbage collection disabled. Absolute fault counts rise (~95K as GC no longer reclaims pages). The step function at 32KB persists in both conditions — confirming cause is allocator path change, not GC behavior.
SizeNormal GCGOGC=offStep preserved?
29KB3,54895,433
31KB3,32795,364
32KB16,21496,414Yes
33KB17,469116,227Yes
Anomaly 2: Page fault three-zone behavior across message sizes

Observed: REST vs gRPC page fault ordering is non-monotonic across message sizes — three behaviorally distinct zones.

Zone 1: <2KB
gRPC overhead > REST
Zone 2: 2–31KB
REST overhead > gRPC
Zone 3: ≥32KB
gRPC overhead rises sharply
SizeREST PFgRPC PFREST/gRPCZone
512B3,5733,6151.0×Zone 1
1KB3,3603,3631.0×Zone 1
2KB4,6883,6741.3×Zone 2
4KB7,1783,7061.9×Zone 2
8KB8,5135,3621.6×Zone 2
16KB17,1855,8892.9×Zone 2
32KB23,73619,3261.2×Zone 3 onset
64KB33,12525,0351.3×Zone 3
ZoneRangeRoot Cause
Zone 1<2KBgRPC's fixed HTTP/2 setup overhead (HPACK table initialization, stream state, flow control) generates comparable or more page faults than REST's simple handling at small sizes
Zone 22–31KBREST's net/http transport allocates a fresh response buffer per request. gRPC's frame buffers are pre-allocated via tieredBufferPool — REST page faults grow rapidly, gRPC stays flat at ~4K–6K
Zone 3≥32KBGo allocator boundary: both protocols cross _MaxSmallSize=32768. gRPC's per-stream state amplifies the large-object allocation cost, narrowing the gap that existed in Zone 2
continuous vs fragmented memory — the architectural root
HTTP/1.1 allocates the response body as one contiguous buffer. HTTP/2 pre-allocates fixed-size frame buffers via tieredBufferPool and reuses them. Below 32KB, gRPC's reuse strategy generates fewer page faults. At exactly 32KB, both protocols' message buffers cross Go's large-object threshold — the pre-allocation advantage disappears. Above 32KB, the large-object allocator dominates both, and gRPC's additional per-stream state means it hits the threshold more frequently.
Anomaly 3: REST latency exceeds gRPC at 4KB (Zone 1→2 crossover in latency)

At 4KB, c=1: REST (55.4µs) > gRPC (50.4µs) — reversing expected ordering. Reproduced across multiple independent runs. Confirmed by hardware counter analysis showing REST consuming 15% more cycles and 18% more cache references at 4KB than gRPC — the inverse of the 1KB relationship.

At 1KB (Zone 1 — gRPC more expensive)
REST: 42.3µs   gRPC: 46.3µs
REST PF: 3,360   gRPC PF: 3,363
gRPC has more fixed overhead, REST wins on latency.
At 4KB (Zone 2 — REST more expensive)
REST: 55.4µs   gRPC: 50.4µs
REST PF: 7,178   gRPC PF: 3,706
Buffer allocation overhead overtakes gRPC's setup cost.
[dense_latency_sweep — 512B to 64KB]

The original 5-point message size benchmark was extended to 13 sizes to confirm both crossovers directly in end-to-end latency. This directly answers the professor's question about where time is being spent across message sizes.

Full latency sweep — all three protocols, 13 message sizes, c=1
SizeTCP (µs)REST (µs)gRPC (µs)ZoneNote
512B193744Zone 1gRPC 7µs slower than REST
1KB ←183942Zone 1→2gap closing — only 3µs
2KB194344Zone 2nearly tied
3KB204445Zone 2REST just edges ahead
4KB ⚠215347Zone 2REST clearly slower — anomaly confirmed
6KB215648Zone 2gap growing
8KB216351Zone 2REST 23% slower than gRPC
12KB237653Zone 2REST 43% slower
16KB268562Zone 2peak gRPC advantage
24KB3210270Zone 2gRPC still 32µs faster
32KB ←32121171Zone 3sharp flip — gRPC 50µs SLOWER than REST
48KB40147185Zone 3REST now faster than gRPC
64KB48180199Zone 3REST 19µs faster than gRPC
two crossovers confirmed in latency
Crossover 1 (Zone 1→2, around 1–2KB): GRADUAL
Gap closes over 4 data points (512B→1KB→2KB→3KB→4KB). This matches the architectural cause — HTTP/2 fixed setup cost is gradually overtaken by REST's growing buffer allocation cost. Gradual = structural/architectural difference.

Crossover 2 (Zone 2→3, exactly 24KB→32KB): SHARP
Between 24KB and 32KB, gRPC goes from 32µs faster to 50µs slower — an 82µs swing in one step. This matches the page fault step function exactly. Sharp = hard boundary (Go allocator _MaxSmallSize). Two independent mechanisms, two distinct crossover signatures.
TCP monotonically fastest — never crosses
TCP latency grows from 19µs to 48µs across the entire 512B–64KB range. It never inverts with either protocol. The REST/gRPC inversions are entirely absent in TCP because TCP has no user-space buffer management overhead — its latency growth is driven purely by data transfer cost.
[optimization_semaphore_flow_control]

The tail amplification analysis identified TCP's 121× P99/P50 ratio as directly caused by the absence of application-level flow control. This was validated by implementing a minimal semaphore-based fix and measuring the effect.

[implementation] the fix — 5 lines of Go
root cause
At c=500 with 64KB messages, all 500 goroutines immediately call write on their TCP connections. The kernel send buffer fills rapidly. Late-arriving goroutines block inside the kernel, accumulating into a queue that grows unboundedly. Request latency is proportional to queue position — the last goroutine waits for all 499 others to drain.
// Original — all 500 goroutines blast simultaneously go func() { sendRecv(conn, data) }() // × 500 // Optimized — semaphore limits max in-flight to 50 sem := make(chan struct{}, maxInflight) // maxInflight = 50 go func() { sem <- struct{}{} // acquire — blocks when 50 already in-flight sendRecv(conn, data) <-sem // release }()

With maxInflight=50, at most 50 goroutines hold the semaphore at any moment. The remaining 450 block on the Go channel — outside the kernel, in user space — without touching the kernel send buffer. Queue discipline moves from kernel to application layer, making it measurable and controllable.

Results — before vs after (64KB, c=500)
Before (raw TCP, no flow control)
44,735µsavg latency 121×P99/P50 tail ratio 1,083,462µsP99 latency 1,557,815page faults
After (semaphore, maxInflight=50)
5,132µsavg latency — 8.7× better 18×P99/P50 tail ratio — 6.7× better 64,388µsP99 latency — 16.8× better 32,961page faults — 47× fewer
8.7×avg latency improvement
16.8×P99 improvement
47×fewer page faults
7%throughput cost
why 18× residual — not 2× like gRPC
The semaphore limits the sender side but provides no receiver-driven backpressure. The 50 active goroutines still write without acknowledgment from the server that it has consumed the data. gRPC's WINDOW_UPDATE mechanism is receiver-driven: the server explicitly signals how much more data it can accept. A full solution requires receiver-driven flow control analogous to HTTP/2 — left as future work.
page fault reduction — explained
Before: 500 goroutines simultaneously allocate 64KB response buffers → 500 large-object allocations → 1.5M page faults. After: at most 50 goroutines hold buffers at once → 47× fewer page faults. Confirms the original page fault explosion was caused by simultaneous large-object allocations, not the protocol itself.
[discussion]
The abstraction cost — quantified per layer

REST over HTTP/1.1 adds 6 user-space function call layers over raw TCP. This translates to 6× more memory accesses, 3× more context switches, and 36× more mutex lock operations for identical 64B payloads. The overhead is largely additive and fixed per request at small message sizes — REST is essentially a constant 2× overhead tax on TCP.

gRPC introduces greater per-request complexity (2.7× more CPU cycles, 8 user-space layers), but its architectural features prevent catastrophic degradation at scale. The additional overhead is a worthwhile investment when P99 tail latency matters.

key design insight
There is no universally best protocol. The ranking depends entirely on the workload regime. The decision point is not average latency — it is what happens at the tail under peak load. Choosing based on microbenchmarks (average latency at low concurrency) can produce catastrophically wrong decisions for production deployments.
Methodology validity and limitations
acknowledged limitations
[when_to_use]
Use TCP when:
→ HFT order entry, internal hot paths, game servers, custom control planes
Use REST when:
→ Public APIs, low-traffic internal services, human-readable debugging
Use gRPC when:
→ Latency-sensitive distributed systems, ML inference serving, high-concurrency backends
one-line protocol selection rule
If your system has tail latency SLAs and sees burst concurrency above 50 clients: use gRPC.
If you own both endpoints and concurrency is controlled: TCP + semaphore flow control.
REST is the right default only when neither of the above applies.
systems design principle
Average latency is what your system does on a good day.
P99 latency is what your system does to your users on a normal day.
P99 at c=500 is what your system does during a traffic spike — the moment that matters most.

TCP's 121× P99/P50 ratio means the slowest 1% of requests are 121× worse than the median. For a service handling 10,000 req/s, that is 100 requests per second experiencing 1-second+ latency. Optimize for tail, not mean.