Lock-Free in Java: Scenario 07 - Per-Core Sharded Processing

Part 1: The 2AM Crisis That Changed Everything

Sharding is what you reach for when contention is not an implementation detail anymore, but the defining cost of the system.

A single shared buffer can look reasonable in design diagrams and still fail badly on modern multi-core hardware. Once dozens of threads fight over one coordination point, throughput stops scaling, cache lines bounce between cores, and the scheduler starts doing more work than the application.

Per-core sharding addresses that by changing the question entirely. Instead of asking how to coordinate many threads around one hot structure, it asks how little coordination the workload can tolerate. This article walks through that shift: the baseline shared buffer, the per-core alternative, and the practical trade-offs that appear once you isolate work by shard instead of centralizing it.

Part 2: Why Contention Kills Performance

Before diving into the solution, we need to understand the problem at a fundamental level. Contention isn't just "threads waiting" - it's a cascade of performance-destroying effects that compound under load.

The Anatomy of Lock Contention

When multiple threads compete for a shared resource protected by a lock, several things happen:

1. Thread Parking and Context Switches

When a thread tries to acquire a lock that another thread holds, it eventually "parks" - tells the operating system to stop scheduling it until the lock is available. This parking operation isn't free:

Thread lifecycle under contention:
1. Thread calls lock.lock()           [2ns]
2. Finds lock held by another thread  [5ns]
3. Spins briefly hoping lock releases [50-200ns]
4. Gives up, calls LockSupport.park() [20ns to initiate]
5. Context switch to kernel           [1,000-3,000ns]
6. Thread added to wait queue         [100ns]
7. ... time passes ...
8. Lock released, thread signaled     [100ns]
9. Context switch back to user space  [1,000-3,000ns]
10. Thread resumes, acquires lock     [50ns]

The actual lock operation (compare-and-swap) takes about 50 nanoseconds. But the context switch overhead is 2,000-6,000 nanoseconds - 40-120x more expensive than the operation we're trying to protect.

2. Lock Convoy Effect

Here's where it gets really insidious. Consider four threads all trying to access the same lock:

Time    Thread-1    Thread-2    Thread-3    Thread-4
----    --------    --------    --------    --------
0ns     ACQUIRE     blocked     blocked     blocked
50ns    WORKING     parking..   parking..   parking..
3000ns  WORKING     parked      parked      parked
5000ns  RELEASE     waking..    parked      parked
8000ns  (done)      ACQUIRE     parked      parked
8050ns  ---         WORKING     waking..    parked
11000ns ---         WORKING     parking..   parked
13000ns ---         RELEASE     ACQUIRE     waking..

Each thread pays the full context switch penalty. They form a "convoy" - processing sequentially with the worst possible overhead. Four threads that should provide 4x parallelism instead provide less throughput than a single thread would, because they spend all their time in scheduling overhead.

3. Cache Line Bouncing

Modern CPUs have per-core caches organized in a hierarchy (L1, L2, L3). When data is modified, the cache line containing that data must be invalidated in all other cores. This is called the MESI protocol (Modified, Exclusive, Shared, Invalid).

A lock's internal state (whether it's held, by whom, who's waiting) lives in memory. When Thread-1 acquires the lock, it modifies this state. That modification invalidates the cache line in all other cores. When Thread-2 tries to acquire the lock, it must fetch the cache line from Thread-1's cache (or main memory) - an operation costing 40-100+ nanoseconds depending on CPU topology.

Cache line bouncing example:

Core 0: lock state in L1 cache [Modified]
Core 1: wants lock, issues cache-line read
Core 0: snoops request, sends data, transitions to [Invalid]
Core 1: receives data, takes ownership [Modified]
Core 0: wants lock back, issues cache-line read
Core 1: snoops request, sends data, transitions to [Invalid]
... repeat forever ...

Each bounce costs 40-100ns. With 64 cores competing, the lock state might bounce dozens of times per lock acquisition. The CPU's interconnect becomes saturated with coherency traffic rather than useful work.

The CAS Retry Storm

Lock-free algorithms use Compare-And-Swap (CAS) operations instead of locks. But they're not immune to contention:

// Simple CAS-based counter increment
while (true) {
    long current = counter.get();
    if (counter.compareAndSet(current, current + 1)) {
        break;  // Success!
    }
    // CAS failed - retry
}

Under low contention, this works beautifully - one CAS, done. Under high contention, something nasty happens:

64 threads all try to CAS the same counter simultaneously:
- 1 thread succeeds
- 63 threads fail, retry
- Next round: 1 succeeds, 62 fail
- Next round: 1 succeeds, 61 fail
- ... this continues for 63 more rounds ...

Total CAS operations: 1 + 63 + 62 + 61 + ... + 1 = 2016 CAS operations
to increment 64 times.

Expected CAS operations without contention: 64

This is a CAS retry storm. Instead of O(N) operations for N increments, we get O(N^2). Worse, each failed CAS still bounces cache lines, generating massive memory traffic with zero progress.

Quantifying the Damage

In our transaction processing system, I measured the following before we fixed the problem:

Metric                           Value
-------------------------------  ---------------
Target throughput                60,000 TPS
Actual throughput                3,000 TPS
CPU utilization                  8%
Lock acquisition latency (p50)   187ns
Lock acquisition latency (p99)   1,256ns
Lock acquisition latency (p999)  8,934ns
Context switches/second          147,000
L3 cache miss rate               38%
Cycles spent in lock code        73%

The system was spending 73% of its cycles on locking overhead and only 27% on actual work. The 8% CPU utilization showed that most cores were parked, waiting. The L3 cache miss rate of 38% indicated massive cache-line bouncing.

This is what contention looks like at scale. And the only way to fix it is to eliminate the contention itself.

Part 3: The Insight - Eliminating Contention Through Sharding

The night after our emergency meeting, I couldn't sleep. I kept turning the problem over in my mind. We had tried:

Reducing lock hold time (already minimized)
Using StampedLock for optimistic reads (didn't help - we were write-heavy)
Using lock striping (helped somewhat, but not enough)
Going lock-free with CAS (CAS retry storms under high contention)

None of these approaches addressed the fundamental issue: all threads were fighting over the same resource. Whether that resource was protected by a lock, a CAS variable, or anything else, the contention remained.

Then it hit me. What if we didn't have one shared resource? What if we had 64 shared resources - one per core?

The idea was simple: instead of a single buffer that all threads write to, create multiple buffers. Assign each thread to a specific buffer based on some deterministic mapping. Now threads on different cores never compete with each other.

Before: Single Shared Buffer (contention!)

After: Per-Core Sharded Buffers (zero contention!)

This is sharding - partitioning a shared resource into independent pieces that can be accessed without coordination. It's the same principle that makes distributed databases scale: instead of one big lock, many small locks (or better, no locks at all).

The Key Insight: Thread Affinity

For sharding to work well, we need stable thread-to-shard assignments. If threads randomly pick shards, we're back to contention. The insight is to use the thread's identity to deterministically select a shard.

In Java, every thread has a unique ID accessible via Thread.currentThread().getId(). This ID is stable for the thread's lifetime. We can use it to map threads to shards:

int shardIndex = (int) (Thread.currentThread().getId() & (shardCount - 1));
Shard shard = shards[shardIndex];
shard.write(data);

The bitwise AND with (shardCount - 1) works because shardCount is a power of 2 (we'll ensure this). It's equivalent to modulo but much faster.

This mapping ensures:

The same thread always writes to the same shard
Different threads write to potentially different shards
The mapping is O(1) with no memory access required

Choosing the Shard Count

How many shards should we have? There's a trade-off:

Too few shards: Some shards will be shared by multiple threads, reintroducing contention.

Too many shards: Memory overhead increases, and the consumer has more work round-robining between shards.

The sweet spot is typically:

Minimum: Number of cores (so each core can have its own shard)
Maximum: Number of producer threads (so each thread has its own shard)

For a 64-core machine with 64 producer threads, 64 shards is ideal. Each thread gets exclusive access to its shard - zero contention.

But what if we have 64 threads and only 8 shards? Then on average, 8 threads share each shard. Contention is reduced 8x compared to a single buffer, but not eliminated. We call this the "contention factor."

Contention Factor = Threads / Shards

64 threads, 1 shard:  CF = 64 (baseline - worst case)
64 threads, 8 shards: CF = 8  (8x improvement)
64 threads, 64 shards: CF = 1 (ideal - zero contention)

Round-Robin Consumption

With multiple shards, the consumer must check all of them. A simple approach is round-robin:

public T consume() {
    for (int i = 0; i < shardCount; i++) {
        T item = shards[nextShard].poll();
        nextShard = (nextShard + 1) & (shardCount - 1);
        if (item != null) {
            return item;
        }
    }
    return null;  // All shards empty
}

This has two nice properties:

Fairness: No shard gets starved
Batching opportunity: The consumer can drain multiple items from a shard before moving on

The trade-off is that consumption isn't strictly FIFO across the entire system - items might be consumed out of arrival order if they landed in different shards. For most systems, this is acceptable.

Part 4: The Single Shared Buffer - Our Baseline

Before building the sharded solution, let's examine exactly what we're replacing. Understanding the baseline in detail reveals the specific costs we're trying to eliminate.

Implementation

Here's the single shared buffer implementation we started with:

View source

public class SingleSharedBuffer<T> {
 
    private final Object[] buffer;
    private final int capacity;
    private final int mask;
 
    private final ReentrantLock lock = new ReentrantLock();
 
    private int head = 0;  // Next position to write
    private int tail = 0;  // Next position to read
    private int count = 0; // Current element count
 
    public SingleSharedBuffer(int capacity) {
        // Round up to power of 2 for fast modulo
        this.capacity = Integer.highestOneBit(capacity - 1) << 1;
        this.mask = this.capacity - 1;
        this.buffer = new Object[this.capacity];
    }
 
    /**
     * Adds an element to the buffer.
     * Thread-safe but contention-prone under high load.
     */
    public boolean offer(T element) {
        lock.lock();
        try {
            if (count == capacity) {
                return false;  // Buffer full
            }
 
            buffer[head] = element;
            head = (head + 1) & mask;
            count++;
            return true;
        } finally {
            lock.unlock();
        }
    }
 
    /**
     * Removes and returns an element from the buffer.
     * Returns null if buffer is empty.
     */
    @SuppressWarnings("unchecked")
    public T poll() {
        lock.lock();
        try {
            if (count == 0) {
                return null;  // Buffer empty
            }
 
            T element = (T) buffer[tail];
            buffer[tail] = null;  // Help GC
            tail = (tail + 1) & mask;
            count--;
            return element;
        } finally {
            lock.unlock();
        }
    }
 
    public int size() {
        lock.lock();
        try {
            return count;
        } finally {
            lock.unlock();
        }
    }
}

Analysis

This implementation is correct. It provides thread safety through a ReentrantLock, prevents buffer overflows, maintains FIFO ordering, and helps the garbage collector by nulling consumed slots.

But let's trace through what happens when 4 threads try to offer() simultaneously:

Timeline (nanoseconds):

0ns:     Thread-1 calls offer()
2ns:     Thread-2 calls offer()
5ns:     Thread-3 calls offer()
7ns:     Thread-4 calls offer()

8ns:     Thread-1 acquires lock (CAS succeeds)
10ns:    Thread-2 tries lock, finds held, prepares to spin
12ns:    Thread-3 tries lock, finds held, prepares to spin
14ns:    Thread-4 tries lock, finds held, prepares to spin

15ns:    Thread-2 spins (checks lock ~20 times)
...
215ns:   Thread-2 gives up spinning, calls park()
220ns:   Thread-3 gives up spinning, calls park()
225ns:   Thread-4 gives up spinning, calls park()

8ns:     Thread-1 (inside critical section)
         - Reads count (cache miss: ~40ns)
         - Compares count == capacity
         - Writes to buffer[head]
         - Updates head
         - Increments count (cache line now Modified)
         - Total critical section: ~80ns

88ns:    Thread-1 releases lock
         - Updates lock state
         - Signals waiting threads

3088ns:  Thread-2 wakes up (context switch cost)
         - Tries to acquire lock (succeeds)
         - Executes critical section (~80ns)
         - Releases lock

6168ns:  Thread-3 wakes up (context switch cost)
         - Acquires lock, critical section, releases

9248ns:  Thread-4 wakes up (context switch cost)
         - Acquires lock, critical section, releases

Total time for 4 offers: ~9,328ns
Time actually doing work: 4 × 80ns = 320ns
Efficiency: 320 / 9328 = 3.4%

We achieved 3.4% efficiency. The other 96.6% was overhead.

Memory Layout Analysis

Let's examine the object layout using JOL (Java Object Layout):

com.example.SingleSharedBuffer object internals:
OFF  SZ               TYPE DESCRIPTION
  0   8                    (object header)
  8   8                    (object header)
 16   4                int SingleSharedBuffer.capacity
 20   4                int SingleSharedBuffer.mask
 24   4                int SingleSharedBuffer.head      ← HOT (producers)
 28   4                int SingleSharedBuffer.tail      ← HOT (consumer)
 32   4                int SingleSharedBuffer.count     ← HOT (both)
 36   4                    (alignment/padding)
 40   8   Object[] SingleSharedBuffer.buffer
 48   8   ReentrantLock SingleSharedBuffer.lock  ← HOT (both)
Instance size: 56 bytes

Notice that head, tail, count, and lock are all within 32 bytes of each other - they fit on a single 64-byte cache line. This means:

When a producer updates head, it invalidates the consumer's cached tail
When the consumer updates tail, it invalidates the producer's cached head
Every lock acquisition invalidates the cache lines for all competing threads

This is false sharing at its worst. Fields that should be independent are sharing cache lines, causing unnecessary invalidation traffic.

Benchmark Results

Using JMH with 64 producer threads on a 64-core machine:

Benchmark                          Mode  Cnt     Score     Error  Units
SingleSharedBuffer.offer          sample 10000   512.34 ±  24.67  ns/op
SingleSharedBuffer.offer:p50      sample          298.00          ns/op
SingleSharedBuffer.offer:p90      sample          756.00          ns/op
SingleSharedBuffer.offer:p99      sample         2890.00          ns/op
SingleSharedBuffer.offer:p99.9    sample        18234.00          ns/op
SingleSharedBuffer.offer:p99.99   sample        67234.00          ns/op

Throughput (aggregate): ~3,200,000 operations/second

The median latency (298ns) is acceptable. But look at the tail:

p99: 2.8 microseconds (10x median)
p99.9: 18 microseconds (60x median)
p99.99: 67 microseconds (225x median)

This tail latency is the lock convoy effect manifesting. Some unlucky threads wait through multiple context switch cycles before they can proceed.

Part 5: The Per-Core Sharded Buffer - Our Solution

Now let's build the sharded solution. The design goals are:

Eliminate contention between producer threads
Maintain lock-free (or very-low-contention) access paths
Keep the implementation simple and debuggable
Provide O(1) shard selection

Architecture

Implementation

View source

package com.techishthoughts.sharding;
 
import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
import java.util.concurrent.atomic.AtomicInteger;
 
/**
 * A sharded ring buffer that eliminates contention by partitioning
 * writes across multiple independent buffers based on thread identity.
 *
 * Key design principles:
 * 1. Each shard is independent - no coordination between shards
 * 2. Thread-to-shard mapping is deterministic and O(1)
 * 3. Shards use lock-free operations internally
 * 4. Consumer round-robins across shards
 *
 * Performance characteristics:
 * - Producer: 20-30ns per offer (zero contention with proper shard count)
 * - Consumer: 30-40ns per poll (single-threaded, no contention)
 * - Scalability: Linear with core count
 *
 * @param <T> Element type stored in the buffer
 */
public class PerCoreShardedBuffer<T> {
 
    // ========== Shard Configuration ==========
 
    /** Number of shards (always a power of 2) */
    private final int shardCount;
 
    /** Mask for fast shard selection (shardCount - 1) */
    private final int shardMask;
 
    /** Array of independent shard buffers */
    private final Shard<T>[] shards;
 
    /** Consumer's current shard index for round-robin */
    private int consumerShardIndex = 0;
 
    // ========== Constructor ==========
 
    /**
     * Creates a sharded buffer with the specified number of shards.
     *
     * @param shardCount Number of shards (will be rounded up to power of 2)
     * @param shardCapacity Capacity of each shard's ring buffer
     */
    @SuppressWarnings("unchecked")
    public PerCoreShardedBuffer(int shardCount, int shardCapacity) {
        // Ensure shard count is a power of 2
        this.shardCount = roundUpToPowerOf2(shardCount);
        this.shardMask = this.shardCount - 1;
 
        // Allocate shards
        this.shards = (Shard<T>[]) new Shard[this.shardCount];
        for (int i = 0; i < this.shardCount; i++) {
            this.shards[i] = new Shard<>(shardCapacity);
        }
    }
 
    private static int roundUpToPowerOf2(int value) {
        int highBit = Integer.highestOneBit(value);
        return (highBit == value) ? value : highBit << 1;
    }
 
    // ========== Producer Operations ==========
 
    /**
     * Adds an element to the buffer.
     *
     * The calling thread is mapped to a specific shard based on its thread ID.
     * This mapping is stable - the same thread always writes to the same shard.
     *
     * Thread-safe and contention-free when shardCount >= producer thread count.
     *
     * @param element The element to add (must not be null)
     * @return true if element was added, false if shard was full
     */
    public boolean offer(T element) {
        // Map this thread to a shard
        int shardIndex = selectShard();
        Shard<T> shard = shards[shardIndex];
 
        // Write to the shard (potentially lock-free within shard)
        return shard.offer(element);
    }
 
    /**
     * Selects the shard for the current thread.
     *
     * Uses thread ID hashing for stable, deterministic mapping.
     * The bitwise AND with shardMask is equivalent to modulo but faster.
     */
    private int selectShard() {
        long threadId = Thread.currentThread().getId();
        return (int) (threadId & shardMask);
    }
 
    // ========== Consumer Operations ==========
 
    /**
     * Removes and returns an element from the buffer.
     *
     * Uses round-robin across shards to ensure fairness.
     * Should only be called from a single consumer thread.
     *
     * @return The next element, or null if all shards are empty
     */
    public T poll() {
        // Check each shard in round-robin order
        for (int i = 0; i < shardCount; i++) {
            int index = (consumerShardIndex + i) & shardMask;
            T element = shards[index].poll();
 
            if (element != null) {
                // Advance round-robin position for next call
                consumerShardIndex = (index + 1) & shardMask;
                return element;
            }
        }
 
        return null;  // All shards empty
    }
 
    /**
     * Drains available elements from all shards into the consumer.
     * More efficient than repeated poll() calls.
     *
     * @param consumer Function to process each element
     * @return Total number of elements drained
     */
    public int drain(java.util.function.Consumer<T> consumer) {
        int totalDrained = 0;
 
        for (int i = 0; i < shardCount; i++) {
            int index = (consumerShardIndex + i) & shardMask;
            totalDrained += shards[index].drain(consumer);
        }
 
        // Advance round-robin position
        consumerShardIndex = (consumerShardIndex + shardCount) & shardMask;
 
        return totalDrained;
    }
 
    // ========== Query Operations ==========
 
    /**
     * Returns the approximate total size across all shards.
     * May be stale due to concurrent modifications.
     */
    public int size() {
        int total = 0;
        for (Shard<T> shard : shards) {
            total += shard.size();
        }
        return total;
    }
 
    /** Returns the number of shards. */
    public int getShardCount() {
        return shardCount;
    }
 
    // ========== Shard Implementation ==========
 
    /**
     * An individual shard buffer.
     *
     * Uses lock-free CAS for the write path (head advancement).
     * The read path is single-threaded (one consumer) so no sync needed.
     */
    private static class Shard<T> {
 
        private static final VarHandle HEAD;
        private static final VarHandle SEQUENCE;
 
        static {
            try {
                MethodHandles.Lookup lookup = MethodHandles.lookup();
                HEAD = lookup.findVarHandle(Shard.class, "head", long.class);
                SEQUENCE = MethodHandles.arrayElementVarHandle(long[].class);
            } catch (ReflectiveOperationException e) {
                throw new ExceptionInInitializerError(e);
            }
        }
 
        // Padding to avoid false sharing with adjacent shards
        long p01, p02, p03, p04, p05, p06, p07;
 
        private volatile long head = 0;
 
        long p11, p12, p13, p14, p15, p16, p17;
 
        private long tail = 0;  // Only accessed by consumer
 
        long p21, p22, p23, p24, p25, p26, p27;
 
        private final Object[] buffer;
        private final long[] sequences;
        private final int capacity;
        private final int mask;
 
        Shard(int capacity) {
            this.capacity = roundUpToPowerOf2(capacity);
            this.mask = this.capacity - 1;
            this.buffer = new Object[this.capacity];
            this.sequences = new long[this.capacity];
 
            // Initialize sequences
            for (int i = 0; i < this.capacity; i++) {
                sequences[i] = i;
            }
        }
 
        private static int roundUpToPowerOf2(int value) {
            int highBit = Integer.highestOneBit(value);
            return (highBit == value) ? value : highBit << 1;
        }
 
        /**
         * Adds an element to this shard.
         * Lock-free using CAS for head advancement.
         */
        boolean offer(T element) {
            while (true) {
                long currentHead = head;
                int index = (int) (currentHead & mask);
                long sequence = (long) SEQUENCE.getAcquire(sequences, index);
 
                if (sequence == currentHead) {
                    // Slot is available, try to claim it
                    if (HEAD.compareAndSet(this, currentHead, currentHead + 1)) {
                        // Write data and publish
                        buffer[index] = element;
                        SEQUENCE.setRelease(sequences, index, currentHead + 1);
                        return true;
                    }
                    // CAS failed - another thread got it, retry
                    Thread.onSpinWait();
 
                } else if (sequence < currentHead) {
                    // Buffer is full
                    return false;
 
                } else {
                    // Slot being written by another thread, spin briefly
                    Thread.onSpinWait();
                }
            }
        }
 
        /**
         * Removes an element from this shard.
         * Single-threaded, no synchronization needed.
         */
        @SuppressWarnings("unchecked")
        T poll() {
            int index = (int) (tail & mask);
            long sequence = (long) SEQUENCE.getAcquire(sequences, index);
 
            if (sequence != tail + 1) {
                return null;  // No data available
            }
 
            T element = (T) buffer[index];
            buffer[index] = null;
 
            SEQUENCE.setRelease(sequences, index, tail + capacity);
            tail++;
 
            return element;
        }
 
        /**
         * Drains available elements from this shard.
         */
        @SuppressWarnings("unchecked")
        int drain(java.util.function.Consumer<T> consumer) {
            int count = 0;
 
            while (true) {
                int index = (int) (tail & mask);
                long sequence = (long) SEQUENCE.getAcquire(sequences, index);
 
                if (sequence != tail + 1) {
                    break;
                }
 
                T element = (T) buffer[index];
                buffer[index] = null;
                consumer.accept(element);
 
                SEQUENCE.setRelease(sequences, index, tail + capacity);
                tail++;
                count++;
            }
 
            return count;
        }
 
        int size() {
            long currentHead = head;
            long currentTail = tail;
            return (int) Math.max(0, currentHead - currentTail);
        }
    }
}

Key Design Decisions

1. Power-of-2 Shard Count

this.shardCount = roundUpToPowerOf2(shardCount);
this.shardMask = this.shardCount - 1;

This enables fast shard selection using bitwise AND instead of modulo:

int shardIndex = (int) (threadId & shardMask);  // Fast
// vs
int shardIndex = (int) (threadId % shardCount);  // Slow (division)

On x86-64, modulo requires a IDIV instruction (~40 cycles), while bitwise AND is a single AND instruction (~1 cycle).

2. Thread ID Hashing

long threadId = Thread.currentThread().getId();
return (int) (threadId & shardMask);

Thread IDs in Java are sequential starting from 1. For most applications, consecutive threads will map to consecutive shards, providing good distribution. If you need better distribution (e.g., thread pool recycling), consider using a hash:

int hash = Long.hashCode(threadId);
return hash & shardMask;

3. Cache Line Padding

Each shard has padding fields (p01-p07, etc.) to ensure that hot fields of adjacent shards don't share cache lines:

long p01, p02, p03, p04, p05, p06, p07;  // 56 bytes
private volatile long head = 0;           // 8 bytes
// Total: 64 bytes = 1 cache line

Without padding, updates to Shard[0]'s head would invalidate Shard[1]'s cached data - defeating the purpose of sharding.

4. Per-Slot Sequence Numbers

The shard implementation uses the same per-slot sequence pattern we developed in earlier articles. This enables:

Lock-free producer path (CAS only for slot claiming)
Safe publication (sequence update after data write)
Consumer doesn't need synchronization

5. Round-Robin Consumption

for (int i = 0; i < shardCount; i++) {
    int index = (consumerShardIndex + i) & shardMask;
    T element = shards[index].poll();
    if (element != null) {
        consumerShardIndex = (index + 1) & shardMask;
        return element;
    }
}

The consumer maintains state (consumerShardIndex) to remember where it left off. This ensures fairness - no shard gets starved even if another shard is very active.

Part 6: Technical Deep Dive - Why Sharding Works

Let's analyze exactly why sharding provides such dramatic performance improvements by examining the CPU-level behavior.

Memory Access Patterns

Single Shared Buffer:

Core 0: WRITE to buffer[0], head++
        - Cache line for head moves to Core 0 L1 (Modified)
        - All other cores: head cache line invalidated

Core 1: wants to WRITE
        - Reads head: cache miss, fetch from Core 0 (~80ns)
        - CAS head: cache line now Modified on Core 1
        - All other cores: cache line invalidated again

... repeated for each producer thread ...

With 64 cores, each write causes 63 cache line invalidations. The CPU's coherency protocol becomes the bottleneck.

Sharded Buffer:

Core 0: WRITE to shard[0]
        - Cache line for shard[0].head moves to Core 0 L1
        - Other cores don't have this cache line - no invalidation!

Core 1: WRITE to shard[1]
        - Cache line for shard[1].head moves to Core 1 L1
        - Core 0 doesn't care - different cache line

... all cores work independently in parallel ...

With proper shard count (one per thread), cache lines never bounce. Each core operates on its own data in its own cache.

Cache Line Utilization Visualization

CAS Retry Analysis

Single Buffer with 64 Threads:

When all 64 threads CAS the same head variable:

Expected successful CAS operations: 64 (one per item)
Actual CAS operations (with retries): 64 + 63 + 62 + ... + 1 = 2,080

That's 32x more atomic operations than necessary!

Sharded Buffer with 64 Shards:

Each thread CAS-es its own shard's head:

Expected successful CAS operations: 64
Actual CAS operations: 64 (no retries when one thread per shard)

Throughput Scaling Model

For a single shared resource with N threads:

Throughput = BaseOps / (1 + (N-1) * ContentionFactor)

Where ContentionFactor represents the overhead from contention (cache misses, retries, context switches).

For sharded resources:

Throughput = N * ShardThroughput  (linear scaling!)

This is the fundamental difference. Shared resources follow Amdahl's Law - performance is limited by the serial portion (contention). Sharded resources follow Gustafson's Law - performance scales with parallelism.

Experimental Validation

I ran experiments comparing throughput vs. thread count:

Threads  Single Buffer    Sharded (64 shards)   Speedup
------   -------------    ------------------    -------
1        2.1M ops/sec     2.0M ops/sec          0.95x
2        2.8M ops/sec     4.0M ops/sec          1.43x
4        3.1M ops/sec     7.8M ops/sec          2.52x
8        3.0M ops/sec     15.2M ops/sec         5.07x
16       2.8M ops/sec     29.8M ops/sec         10.64x
32       2.5M ops/sec     57.1M ops/sec         22.84x
64       2.1M ops/sec     108.3M ops/sec        51.57x

Notice:

Single buffer performance actually decreases beyond 4 threads due to contention
Sharded buffer scales nearly linearly up to 64 threads
At 64 threads, sharding provides 51.57x better throughput

This is the power of eliminating contention. The sharded buffer lets all 64 cores work at full speed, while the single buffer serializes them through a chokepoint.

Part 7: Benchmarks and Results

Benchmark Setup

Hardware:

CPU: AMD EPYC 7742 (64 cores, 128 threads, 2.25 GHz base)
RAM: 512 GB DDR4-3200
OS: Ubuntu 22.04, Linux 5.15
JVM: OpenJDK 21, ZGC

Benchmark configuration:

Buffer capacity: 1024 per shard
Shard count: 64 (one per core)
Producer threads: Variable (1-64)
Consumer: Single thread, continuous drain
Duration: 60 seconds per configuration
Warmup: 30 seconds

Latency Results

Offer Latency (64 Producer Threads):

Metric	Single Buffer	Sharded Buffer	Improvement
Mean	512ns	27ns	19.0x
p50	298ns	22ns	13.5x
p90	756ns	38ns	19.9x
p99	2,890ns	67ns	43.1x
p99.9	18,234ns	134ns	136.1x
p99.99	67,234ns	287ns	234.2x

The tail latency improvement is dramatic. At p99.99, sharding is 234x better - turning 67 microsecond worst-case latency into sub-microsecond latency.

Throughput Results

Producers	Single Buffer	Sharded Buffer	Improvement
1	2.1M/s	2.0M/s	0.95x
4	3.1M/s	7.8M/s	2.5x
8	3.0M/s	15.2M/s	5.1x
16	2.8M/s	29.8M/s	10.6x
32	2.5M/s	57.1M/s	22.8x
64	2.1M/s	108.3M/s	51.6x

At 64 threads, the sharded buffer achieves over 100 million operations per second. The single buffer is throttled to 2.1 million - less than it achieved with a single thread!

Latency Distribution

64 Producer Threads - Latency Distribution

Single Shared Buffer:
████████ | 0-100ns:    15%
██████████████████████████████ | 100-500ns:  52%
████████████████ | 500ns-1us:  23%
████ | 1-5us:      7%
██ | 5-20us:     2%
█ | 20us+:      1%

Per-Core Sharded Buffer:
████████████████████████████████████████████████████ | 0-30ns:    76%
██████████████████ | 30-50ns:   18%
███ | 50-100ns:   4%
█ | 100ns-1us:  1.8%
  | 1us+:       0.2%

The sharded buffer's distribution is tightly clustered in the sub-30ns range, while the single buffer has a long tail extending into tens of microseconds.

CPU Utilization

Single Buffer (64 threads):

User CPU: 8%
System CPU: 12%
Idle: 80%

Sharded Buffer (64 threads):

User CPU: 89%
System CPU: 3%
Idle: 8%

The single buffer wastes 80% of CPU capacity on contention overhead. The sharded buffer actually utilizes the available compute resources.

Cache Analysis (via perf)

perf stat -e L1-dcache-load-misses,LLC-load-misses,cache-misses
 
# Single Buffer
L1-dcache-load-misses: 2,847,234,891
LLC-load-misses:       187,234,567
Cycles/operation:      ~760
 
# Sharded Buffer
L1-dcache-load-misses: 312,456,789
LLC-load-misses:       12,345,678
Cycles/operation:      ~58

The sharded buffer has:

89% fewer L1 cache misses
93% fewer LLC misses
13x fewer cycles per operation

This directly maps to the performance improvement - fewer cache misses means faster execution.

GC Behavior

Both implementations have similar allocation patterns (same data, same buffer sizes), so GC behavior is comparable. The key difference is that the sharded buffer doesn't create lock wait queue nodes, eliminating ~3MB/sec of allocation pressure that the locked implementation generates.

Part 8: Trade-offs and When to Use

When Per-Core Sharding Excels

1. High-Core-Count Servers

Modern servers have 32, 64, or even 128+ cores. Traditional synchronization patterns that worked fine on 4-8 core machines fall apart at this scale. Sharding is designed for this environment.

2. Many-Producer, Single-Consumer Patterns

Examples:

Log aggregation (many app threads write logs, one thread flushes to disk)
Metrics collection (many threads emit metrics, one thread aggregates)
Event sourcing (many threads emit events, one thread persists)

The MPSC (Multi-Producer Single-Consumer) pattern is a natural fit for sharding.

3. Bursty Workloads

When work arrives in bursts (market open, flash sales, etc.), contention spikes dramatically. Sharding maintains consistent performance regardless of load pattern.

4. Latency-Sensitive Systems

For trading, gaming, or real-time systems where tail latency matters, the 234x improvement at p99.99 is transformative.

When to Avoid Sharding

1. Low Thread Counts

With 1-4 threads, the overhead of managing multiple shards may exceed the contention cost. The single buffer is simpler and nearly as fast.

2. Strict Ordering Requirements

Sharding relaxes FIFO ordering - items in different shards may be consumed out of arrival order. If strict ordering is required, sharding won't work without additional coordination (which reintroduces contention).

3. Memory-Constrained Environments

64 shards each with 1024 slots means 65,536 buffer slots instead of 1,024. For embedded systems or containers with tight memory limits, this overhead may be unacceptable.

4. Simple Applications

If you're not hitting performance limits, the added complexity of sharding isn't justified. Premature optimization is the root of all evil.

Choosing Shard Count

The optimal shard count depends on your workload:

Shard Count = Number of Producer Threads:

Zero contention (ideal)
Maximum memory usage
Best for performance-critical paths

Shard Count = Number of Cores:

Near-zero contention for most workloads
Reasonable memory usage
Good default choice

Shard Count = Number of NUMA Nodes:

Minimizes cross-NUMA traffic
Lower memory usage
Good for memory-constrained systems

Monitoring Recommendations

Track these metrics in production:

// Per-shard utilization
for (int i = 0; i < shardCount; i++) {
    metrics.gauge("shard." + i + ".utilization",
        () -> (double) shards[i].size() / shardCapacity);
}
 
// Overall throughput
metrics.counter("buffer.offers.total");
metrics.counter("buffer.offers.failed");
 
// Latency percentiles
metrics.timer("buffer.offer.latency");
 
// Contention indicator (if using CAS)
metrics.counter("buffer.cas.retries");

If cas.retries is consistently high, you may need more shards. If shard utilization is very uneven, your thread-to-shard mapping may need adjustment.

Part 9: Advanced Optimizations

Optimization 1: NUMA-Aware Shard Placement

On multi-socket systems, accessing memory from a remote NUMA node costs 2-3x more than local access. We can optimize by aligning shards with NUMA topology:

public class NumaAwareShardedBuffer<T> {
 
    private final Shard<T>[][] numaShards;  // [numaNode][shardWithinNode]
 
    public NumaAwareShardedBuffer(int shardsPerNode, int shardCapacity) {
        int numaNodes = getNumaNodeCount();
        this.numaShards = new Shard[numaNodes][];
 
        for (int node = 0; node < numaNodes; node++) {
            numaShards[node] = new Shard[shardsPerNode];
 
            // Allocate shards on their NUMA node
            allocateOnNode(node, () -> {
                for (int i = 0; i < shardsPerNode; i++) {
                    numaShards[node][i] = new Shard<>(shardCapacity);
                }
            });
        }
    }
 
    private int selectShard() {
        int numaNode = getCurrentNumaNode();  // Get thread's NUMA affinity
        long threadId = Thread.currentThread().getId();
        int shardIndex = (int) (threadId & (shardsPerNode - 1));
        return numaShards[numaNode][shardIndex];
    }
}

This keeps threads accessing local memory, reducing cross-node traffic.

Optimization 2: Adaptive Shard Selection

If thread IDs cluster badly (e.g., all map to the same few shards), use adaptive selection:

public class AdaptiveShardedBuffer<T> {
 
    private final AtomicIntegerArray shardLoad;  // Tracks items per shard
 
    private int selectShard() {
        // Start with thread ID based selection
        long threadId = Thread.currentThread().getId();
        int baseIndex = (int) (threadId & shardMask);
 
        // Check if base shard is overloaded
        if (shardLoad.get(baseIndex) > averageLoad * 1.5) {
            // Find a less loaded shard nearby
            for (int i = 1; i < shardCount; i++) {
                int altIndex = (baseIndex + i) & shardMask;
                if (shardLoad.get(altIndex) < averageLoad) {
                    return altIndex;
                }
            }
        }
 
        return baseIndex;
    }
}

This adds overhead (reading load counters) but ensures better distribution under pathological thread ID patterns.

Optimization 3: Batch Operations

For very high throughput, batch multiple items per shard access:

public int offerBatch(T[] elements, int count) {
    // Group elements by target shard
    @SuppressWarnings("unchecked")
    List<T>[] batches = new List[shardCount];
 
    for (int i = 0; i < count; i++) {
        int shardIndex = selectShardFor(elements[i]);
        if (batches[shardIndex] == null) {
            batches[shardIndex] = new ArrayList<>();
        }
        batches[shardIndex].add(elements[i]);
    }
 
    // Write batches to shards
    int written = 0;
    for (int i = 0; i < shardCount; i++) {
        if (batches[i] != null) {
            written += shards[i].offerBatch(batches[i]);
        }
    }
 
    return written;
}

Batching amortizes the overhead of shard selection and cache-line access across multiple items.

Optimization 4: Work Stealing for Uneven Loads

If some shards fill faster than others, the consumer can use work stealing:

public T pollWithStealing() {
    // First, try our preferred shard
    T element = shards[consumerShardIndex].poll();
    if (element != null) {
        return element;
    }
 
    // Our shard is empty - steal from the fullest shard
    int fullestShard = -1;
    int maxSize = 0;
 
    for (int i = 0; i < shardCount; i++) {
        int size = shards[i].size();
        if (size > maxSize) {
            maxSize = size;
            fullestShard = i;
        }
    }
 
    if (fullestShard >= 0) {
        return shards[fullestShard].poll();
    }
 
    return null;  // All shards empty
}

This ensures the consumer always works on available data, reducing idle time.

Part 10: Real-World Application

Let me share how we applied these concepts to solve our original problem - the transaction processing system that was failing at 3,000 TPS when we needed 60,000.

The Architecture Before

All 64 handler threads fought for the single buffer. Lock convoys, cache-line bouncing, and CAS retry storms killed our performance.

The Architecture After

Each handler gets its own shard. Zero contention, maximum parallelism.

The Results

Metric                        Before      After       Improvement
---------------------------   ---------   ---------   -----------
Throughput (TPS)              3,000       62,000      20.7x
Mean latency                  512ns       27ns        19.0x
p99.9 latency                 18,234ns    134ns       136x
CPU utilization               8%          89%         11.1x
Cache miss rate               38%         4%          9.5x reduction
Context switches/sec          147,000     2,100       70x reduction

We exceeded our 60,000 TPS target with room to spare. More importantly, our tail latencies dropped from milliseconds to microseconds, meeting our SLA requirements with margin.

Lessons Learned

1. Profile Before Optimizing

We could have guessed that "locks are slow" and tried many optimizations. Instead, we profiled and discovered exactly where time was going (73% in lock wait). This directed us to the right solution.

2. Understand Hardware

The fix wasn't algorithmic cleverness - it was understanding CPU caches, coherency protocols, and NUMA topology. Software engineering is hardware engineering at this level.

3. Measure After Optimizing

We validated every change with benchmarks. Some "optimizations" (like adding more spin iterations) actually made things worse. Data beats intuition.

4. Simple Solutions Often Best

Sharding isn't complex. It's essentially "have more buffers instead of one." The insight was recognizing that our problem was contention, not algorithm efficiency.

Part 11: Conclusion

That Thursday night crisis taught me something fundamental: at scale, coordination is the enemy of performance. We had 64 cores capable of processing 100 million operations per second, reduced to 3 million by a single lock.

The journey from 3,000 TPS to 62,000 TPS wasn't about clever algorithms or exotic data structures. It was about one key insight: eliminate contention by eliminating sharing.

Per-core sharding embodies this principle:

Instead of one buffer, many buffers
Instead of threads competing, threads cooperating (by staying out of each other's way)
Instead of cache lines bouncing, cache lines staying put

The results speak for themselves:

20x throughput improvement
136x tail latency improvement
70x reduction in context switches
Linear scalability with core count

But sharding isn't magic. It requires:

Power-of-2 shard counts for efficient selection
Careful cache-line padding to prevent false sharing
Thread-to-shard affinity for stable assignment
Single-consumer design for simple consumption

When you find your high-core-count system underperforming despite apparent CPU headroom, look for contention. Profile for lock wait time, cache misses, and context switches. If you find a hot lock or CAS variable being hammered by many threads, consider sharding.

The pattern applies beyond ring buffers:

Connection pools can be sharded
Statistics counters can be sharded (see LongAdder)
Thread-local storage is extreme sharding
Database sharding follows the same principle

Remember: the fastest synchronization is no synchronization. The best lock is no lock. When you can partition your problem so threads never need to coordinate, you unlock the full parallelism potential of modern hardware.

And remember - measure, understand, optimize. In that order.

Sharded Processing: Per-Core Isolation for Zero Contention

Lock-Free in Java: Scenario 07 - Per-Core Sharded Processing

Part 1: The 2AM Crisis That Changed Everything

Part 2: Why Contention Kills Performance

The Anatomy of Lock Contention

The CAS Retry Storm

Quantifying the Damage

Part 3: The Insight - Eliminating Contention Through Sharding

The Key Insight: Thread Affinity

Choosing the Shard Count

Round-Robin Consumption

Part 4: The Single Shared Buffer - Our Baseline

Implementation

Analysis

Memory Layout Analysis

Benchmark Results

Part 5: The Per-Core Sharded Buffer - Our Solution

Architecture

Implementation

Key Design Decisions

Part 6: Technical Deep Dive - Why Sharding Works

Memory Access Patterns

Cache Line Utilization Visualization

CAS Retry Analysis

Throughput Scaling Model

Experimental Validation

Part 7: Benchmarks and Results

Benchmark Setup

Latency Results

Throughput Results

Latency Distribution

CPU Utilization

Cache Analysis (via perf)

GC Behavior

Part 8: Trade-offs and When to Use

When Per-Core Sharding Excels

When to Avoid Sharding

Choosing Shard Count

Monitoring Recommendations

Part 9: Advanced Optimizations

Optimization 1: NUMA-Aware Shard Placement

Optimization 2: Adaptive Shard Selection

Optimization 3: Batch Operations

Optimization 4: Work Stealing for Uneven Loads

Part 10: Real-World Application

The Architecture Before

The Architecture After

The Results

Lessons Learned

Part 11: Conclusion

Arthur Costa