Lock-Free in Java: Scenario 07 - Per-Core Sharded Processing
Part 1: The 2AM Crisis That Changed Everything
Sharding is what you reach for when contention is not an implementation detail anymore, but the defining cost of the system.
A single shared buffer can look reasonable in design diagrams and still fail badly on modern multi-core hardware. Once dozens of threads fight over one coordination point, throughput stops scaling, cache lines bounce between cores, and the scheduler starts doing more work than the application.
Per-core sharding addresses that by changing the question entirely. Instead of asking how to coordinate many threads around one hot structure, it asks how little coordination the workload can tolerate. This article walks through that shift: the baseline shared buffer, the per-core alternative, and the practical trade-offs that appear once you isolate work by shard instead of centralizing it.
Part 2: Why Contention Kills Performance
Before diving into the solution, we need to understand the problem at a fundamental level. Contention isn't just "threads waiting" - it's a cascade of performance-destroying effects that compound under load.
The Anatomy of Lock Contention
When multiple threads compete for a shared resource protected by a lock, several things happen:
1. Thread Parking and Context Switches
When a thread tries to acquire a lock that another thread holds, it eventually "parks" - tells the operating system to stop scheduling it until the lock is available. This parking operation isn't free:
The actual lock operation (compare-and-swap) takes about 50 nanoseconds. But the context switch overhead is 2,000-6,000 nanoseconds - 40-120x more expensive than the operation we're trying to protect.
2. Lock Convoy Effect
Here's where it gets really insidious. Consider four threads all trying to access the same lock:
Each thread pays the full context switch penalty. They form a "convoy" - processing sequentially with the worst possible overhead. Four threads that should provide 4x parallelism instead provide less throughput than a single thread would, because they spend all their time in scheduling overhead.
3. Cache Line Bouncing
Modern CPUs have per-core caches organized in a hierarchy (L1, L2, L3). When data is modified, the cache line containing that data must be invalidated in all other cores. This is called the MESI protocol (Modified, Exclusive, Shared, Invalid).
A lock's internal state (whether it's held, by whom, who's waiting) lives in memory. When Thread-1 acquires the lock, it modifies this state. That modification invalidates the cache line in all other cores. When Thread-2 tries to acquire the lock, it must fetch the cache line from Thread-1's cache (or main memory) - an operation costing 40-100+ nanoseconds depending on CPU topology.
Each bounce costs 40-100ns. With 64 cores competing, the lock state might bounce dozens of times per lock acquisition. The CPU's interconnect becomes saturated with coherency traffic rather than useful work.
The CAS Retry Storm
Lock-free algorithms use Compare-And-Swap (CAS) operations instead of locks. But they're not immune to contention:
Under low contention, this works beautifully - one CAS, done. Under high contention, something nasty happens:
This is a CAS retry storm. Instead of O(N) operations for N increments, we get O(N^2). Worse, each failed CAS still bounces cache lines, generating massive memory traffic with zero progress.
Quantifying the Damage
In our transaction processing system, I measured the following before we fixed the problem:
The system was spending 73% of its cycles on locking overhead and only 27% on actual work. The 8% CPU utilization showed that most cores were parked, waiting. The L3 cache miss rate of 38% indicated massive cache-line bouncing.
This is what contention looks like at scale. And the only way to fix it is to eliminate the contention itself.
Part 3: The Insight - Eliminating Contention Through Sharding
The night after our emergency meeting, I couldn't sleep. I kept turning the problem over in my mind. We had tried:
- Reducing lock hold time (already minimized)
- Using StampedLock for optimistic reads (didn't help - we were write-heavy)
- Using lock striping (helped somewhat, but not enough)
- Going lock-free with CAS (CAS retry storms under high contention)
None of these approaches addressed the fundamental issue: all threads were fighting over the same resource. Whether that resource was protected by a lock, a CAS variable, or anything else, the contention remained.
Then it hit me. What if we didn't have one shared resource? What if we had 64 shared resources - one per core?
The idea was simple: instead of a single buffer that all threads write to, create multiple buffers. Assign each thread to a specific buffer based on some deterministic mapping. Now threads on different cores never compete with each other.
Before: Single Shared Buffer (contention!)
After: Per-Core Sharded Buffers (zero contention!)
This is sharding - partitioning a shared resource into independent pieces that can be accessed without coordination. It's the same principle that makes distributed databases scale: instead of one big lock, many small locks (or better, no locks at all).
The Key Insight: Thread Affinity
For sharding to work well, we need stable thread-to-shard assignments. If threads randomly pick shards, we're back to contention. The insight is to use the thread's identity to deterministically select a shard.
In Java, every thread has a unique ID accessible via Thread.currentThread().getId(). This ID is stable for the thread's lifetime. We can use it to map threads to shards:
The bitwise AND with (shardCount - 1) works because shardCount is a power of 2 (we'll ensure this). It's equivalent to modulo but much faster.
This mapping ensures:
- The same thread always writes to the same shard
- Different threads write to potentially different shards
- The mapping is O(1) with no memory access required
Choosing the Shard Count
How many shards should we have? There's a trade-off:
Too few shards: Some shards will be shared by multiple threads, reintroducing contention.
Too many shards: Memory overhead increases, and the consumer has more work round-robining between shards.
The sweet spot is typically:
- Minimum: Number of cores (so each core can have its own shard)
- Maximum: Number of producer threads (so each thread has its own shard)
For a 64-core machine with 64 producer threads, 64 shards is ideal. Each thread gets exclusive access to its shard - zero contention.
But what if we have 64 threads and only 8 shards? Then on average, 8 threads share each shard. Contention is reduced 8x compared to a single buffer, but not eliminated. We call this the "contention factor."
Round-Robin Consumption
With multiple shards, the consumer must check all of them. A simple approach is round-robin:
This has two nice properties:
- Fairness: No shard gets starved
- Batching opportunity: The consumer can drain multiple items from a shard before moving on
The trade-off is that consumption isn't strictly FIFO across the entire system - items might be consumed out of arrival order if they landed in different shards. For most systems, this is acceptable.
Part 4: The Single Shared Buffer - Our Baseline
Before building the sharded solution, let's examine exactly what we're replacing. Understanding the baseline in detail reveals the specific costs we're trying to eliminate.
Implementation
Here's the single shared buffer implementation we started with:
Analysis
This implementation is correct. It provides thread safety through a ReentrantLock, prevents buffer overflows, maintains FIFO ordering, and helps the garbage collector by nulling consumed slots.
But let's trace through what happens when 4 threads try to offer() simultaneously:
We achieved 3.4% efficiency. The other 96.6% was overhead.
Memory Layout Analysis
Let's examine the object layout using JOL (Java Object Layout):
Notice that head, tail, count, and lock are all within 32 bytes of each other - they fit on a single 64-byte cache line. This means:
- When a producer updates
head, it invalidates the consumer's cachedtail - When the consumer updates
tail, it invalidates the producer's cachedhead - Every lock acquisition invalidates the cache lines for all competing threads
This is false sharing at its worst. Fields that should be independent are sharing cache lines, causing unnecessary invalidation traffic.
Benchmark Results
Using JMH with 64 producer threads on a 64-core machine:
The median latency (298ns) is acceptable. But look at the tail:
- p99: 2.8 microseconds (10x median)
- p99.9: 18 microseconds (60x median)
- p99.99: 67 microseconds (225x median)
This tail latency is the lock convoy effect manifesting. Some unlucky threads wait through multiple context switch cycles before they can proceed.
Part 5: The Per-Core Sharded Buffer - Our Solution
Now let's build the sharded solution. The design goals are:
- Eliminate contention between producer threads
- Maintain lock-free (or very-low-contention) access paths
- Keep the implementation simple and debuggable
- Provide O(1) shard selection
Architecture
Implementation
Key Design Decisions
1. Power-of-2 Shard Count
This enables fast shard selection using bitwise AND instead of modulo:
On x86-64, modulo requires a IDIV instruction (~40 cycles), while bitwise AND is a single AND instruction (~1 cycle).
2. Thread ID Hashing
Thread IDs in Java are sequential starting from 1. For most applications, consecutive threads will map to consecutive shards, providing good distribution. If you need better distribution (e.g., thread pool recycling), consider using a hash:
3. Cache Line Padding
Each shard has padding fields (p01-p07, etc.) to ensure that hot fields of adjacent shards don't share cache lines:
Without padding, updates to Shard[0]'s head would invalidate Shard[1]'s cached data - defeating the purpose of sharding.
4. Per-Slot Sequence Numbers
The shard implementation uses the same per-slot sequence pattern we developed in earlier articles. This enables:
- Lock-free producer path (CAS only for slot claiming)
- Safe publication (sequence update after data write)
- Consumer doesn't need synchronization
5. Round-Robin Consumption
The consumer maintains state (consumerShardIndex) to remember where it left off. This ensures fairness - no shard gets starved even if another shard is very active.
Part 6: Technical Deep Dive - Why Sharding Works
Let's analyze exactly why sharding provides such dramatic performance improvements by examining the CPU-level behavior.
Memory Access Patterns
Single Shared Buffer:
With 64 cores, each write causes 63 cache line invalidations. The CPU's coherency protocol becomes the bottleneck.
Sharded Buffer:
With proper shard count (one per thread), cache lines never bounce. Each core operates on its own data in its own cache.
Cache Line Utilization Visualization
CAS Retry Analysis
Single Buffer with 64 Threads:
When all 64 threads CAS the same head variable:
- Expected successful CAS operations: 64 (one per item)
- Actual CAS operations (with retries): 64 + 63 + 62 + ... + 1 = 2,080
That's 32x more atomic operations than necessary!
Sharded Buffer with 64 Shards:
Each thread CAS-es its own shard's head:
- Expected successful CAS operations: 64
- Actual CAS operations: 64 (no retries when one thread per shard)
Throughput Scaling Model
For a single shared resource with N threads:
Where ContentionFactor represents the overhead from contention (cache misses, retries, context switches).
For sharded resources:
This is the fundamental difference. Shared resources follow Amdahl's Law - performance is limited by the serial portion (contention). Sharded resources follow Gustafson's Law - performance scales with parallelism.
Experimental Validation
I ran experiments comparing throughput vs. thread count:
Notice:
- Single buffer performance actually decreases beyond 4 threads due to contention
- Sharded buffer scales nearly linearly up to 64 threads
- At 64 threads, sharding provides 51.57x better throughput
This is the power of eliminating contention. The sharded buffer lets all 64 cores work at full speed, while the single buffer serializes them through a chokepoint.
Part 7: Benchmarks and Results
Benchmark Setup
Hardware:
- CPU: AMD EPYC 7742 (64 cores, 128 threads, 2.25 GHz base)
- RAM: 512 GB DDR4-3200
- OS: Ubuntu 22.04, Linux 5.15
- JVM: OpenJDK 21, ZGC
Benchmark configuration:
- Buffer capacity: 1024 per shard
- Shard count: 64 (one per core)
- Producer threads: Variable (1-64)
- Consumer: Single thread, continuous drain
- Duration: 60 seconds per configuration
- Warmup: 30 seconds
Latency Results
Offer Latency (64 Producer Threads):
| Metric | Single Buffer | Sharded Buffer | Improvement |
|---|---|---|---|
| Mean | 512ns | 27ns | 19.0x |
| p50 | 298ns | 22ns | 13.5x |
| p90 | 756ns | 38ns | 19.9x |
| p99 | 2,890ns | 67ns | 43.1x |
| p99.9 | 18,234ns | 134ns | 136.1x |
| p99.99 | 67,234ns | 287ns | 234.2x |
The tail latency improvement is dramatic. At p99.99, sharding is 234x better - turning 67 microsecond worst-case latency into sub-microsecond latency.
Throughput Results
| Producers | Single Buffer | Sharded Buffer | Improvement |
|---|---|---|---|
| 1 | 2.1M/s | 2.0M/s | 0.95x |
| 4 | 3.1M/s | 7.8M/s | 2.5x |
| 8 | 3.0M/s | 15.2M/s | 5.1x |
| 16 | 2.8M/s | 29.8M/s | 10.6x |
| 32 | 2.5M/s | 57.1M/s | 22.8x |
| 64 | 2.1M/s | 108.3M/s | 51.6x |
At 64 threads, the sharded buffer achieves over 100 million operations per second. The single buffer is throttled to 2.1 million - less than it achieved with a single thread!
Latency Distribution
The sharded buffer's distribution is tightly clustered in the sub-30ns range, while the single buffer has a long tail extending into tens of microseconds.
CPU Utilization
Single Buffer (64 threads):
- User CPU: 8%
- System CPU: 12%
- Idle: 80%
Sharded Buffer (64 threads):
- User CPU: 89%
- System CPU: 3%
- Idle: 8%
The single buffer wastes 80% of CPU capacity on contention overhead. The sharded buffer actually utilizes the available compute resources.
Cache Analysis (via perf)
The sharded buffer has:
- 89% fewer L1 cache misses
- 93% fewer LLC misses
- 13x fewer cycles per operation
This directly maps to the performance improvement - fewer cache misses means faster execution.
GC Behavior
Both implementations have similar allocation patterns (same data, same buffer sizes), so GC behavior is comparable. The key difference is that the sharded buffer doesn't create lock wait queue nodes, eliminating ~3MB/sec of allocation pressure that the locked implementation generates.
Part 8: Trade-offs and When to Use
When Per-Core Sharding Excels
1. High-Core-Count Servers
Modern servers have 32, 64, or even 128+ cores. Traditional synchronization patterns that worked fine on 4-8 core machines fall apart at this scale. Sharding is designed for this environment.
2. Many-Producer, Single-Consumer Patterns
Examples:
- Log aggregation (many app threads write logs, one thread flushes to disk)
- Metrics collection (many threads emit metrics, one thread aggregates)
- Event sourcing (many threads emit events, one thread persists)
The MPSC (Multi-Producer Single-Consumer) pattern is a natural fit for sharding.
3. Bursty Workloads
When work arrives in bursts (market open, flash sales, etc.), contention spikes dramatically. Sharding maintains consistent performance regardless of load pattern.
4. Latency-Sensitive Systems
For trading, gaming, or real-time systems where tail latency matters, the 234x improvement at p99.99 is transformative.
When to Avoid Sharding
1. Low Thread Counts
With 1-4 threads, the overhead of managing multiple shards may exceed the contention cost. The single buffer is simpler and nearly as fast.
2. Strict Ordering Requirements
Sharding relaxes FIFO ordering - items in different shards may be consumed out of arrival order. If strict ordering is required, sharding won't work without additional coordination (which reintroduces contention).
3. Memory-Constrained Environments
64 shards each with 1024 slots means 65,536 buffer slots instead of 1,024. For embedded systems or containers with tight memory limits, this overhead may be unacceptable.
4. Simple Applications
If you're not hitting performance limits, the added complexity of sharding isn't justified. Premature optimization is the root of all evil.
Choosing Shard Count
The optimal shard count depends on your workload:
Shard Count = Number of Producer Threads:
- Zero contention (ideal)
- Maximum memory usage
- Best for performance-critical paths
Shard Count = Number of Cores:
- Near-zero contention for most workloads
- Reasonable memory usage
- Good default choice
Shard Count = Number of NUMA Nodes:
- Minimizes cross-NUMA traffic
- Lower memory usage
- Good for memory-constrained systems
Monitoring Recommendations
Track these metrics in production:
If cas.retries is consistently high, you may need more shards. If shard utilization is very uneven, your thread-to-shard mapping may need adjustment.
Part 9: Advanced Optimizations
Optimization 1: NUMA-Aware Shard Placement
On multi-socket systems, accessing memory from a remote NUMA node costs 2-3x more than local access. We can optimize by aligning shards with NUMA topology:
This keeps threads accessing local memory, reducing cross-node traffic.
Optimization 2: Adaptive Shard Selection
If thread IDs cluster badly (e.g., all map to the same few shards), use adaptive selection:
This adds overhead (reading load counters) but ensures better distribution under pathological thread ID patterns.
Optimization 3: Batch Operations
For very high throughput, batch multiple items per shard access:
Batching amortizes the overhead of shard selection and cache-line access across multiple items.
Optimization 4: Work Stealing for Uneven Loads
If some shards fill faster than others, the consumer can use work stealing:
This ensures the consumer always works on available data, reducing idle time.
Part 10: Real-World Application
Let me share how we applied these concepts to solve our original problem - the transaction processing system that was failing at 3,000 TPS when we needed 60,000.
The Architecture Before
All 64 handler threads fought for the single buffer. Lock convoys, cache-line bouncing, and CAS retry storms killed our performance.
The Architecture After
Each handler gets its own shard. Zero contention, maximum parallelism.
The Results
We exceeded our 60,000 TPS target with room to spare. More importantly, our tail latencies dropped from milliseconds to microseconds, meeting our SLA requirements with margin.
Lessons Learned
1. Profile Before Optimizing
We could have guessed that "locks are slow" and tried many optimizations. Instead, we profiled and discovered exactly where time was going (73% in lock wait). This directed us to the right solution.
2. Understand Hardware
The fix wasn't algorithmic cleverness - it was understanding CPU caches, coherency protocols, and NUMA topology. Software engineering is hardware engineering at this level.
3. Measure After Optimizing
We validated every change with benchmarks. Some "optimizations" (like adding more spin iterations) actually made things worse. Data beats intuition.
4. Simple Solutions Often Best
Sharding isn't complex. It's essentially "have more buffers instead of one." The insight was recognizing that our problem was contention, not algorithm efficiency.
Part 11: Conclusion
That Thursday night crisis taught me something fundamental: at scale, coordination is the enemy of performance. We had 64 cores capable of processing 100 million operations per second, reduced to 3 million by a single lock.
The journey from 3,000 TPS to 62,000 TPS wasn't about clever algorithms or exotic data structures. It was about one key insight: eliminate contention by eliminating sharing.
Per-core sharding embodies this principle:
- Instead of one buffer, many buffers
- Instead of threads competing, threads cooperating (by staying out of each other's way)
- Instead of cache lines bouncing, cache lines staying put
The results speak for themselves:
- 20x throughput improvement
- 136x tail latency improvement
- 70x reduction in context switches
- Linear scalability with core count
But sharding isn't magic. It requires:
- Power-of-2 shard counts for efficient selection
- Careful cache-line padding to prevent false sharing
- Thread-to-shard affinity for stable assignment
- Single-consumer design for simple consumption
When you find your high-core-count system underperforming despite apparent CPU headroom, look for contention. Profile for lock wait time, cache misses, and context switches. If you find a hot lock or CAS variable being hammered by many threads, consider sharding.
The pattern applies beyond ring buffers:
- Connection pools can be sharded
- Statistics counters can be sharded (see LongAdder)
- Thread-local storage is extreme sharding
- Database sharding follows the same principle
Remember: the fastest synchronization is no synchronization. The best lock is no lock. When you can partition your problem so threads never need to coordinate, you unlock the full parallelism potential of modern hardware.
And remember - measure, understand, optimize. In that order.
