Measuring Ethereum's Execution Limits: The Gas Benchmarking Framework

Ethereum

•

November 6, 2025

Measuring Ethereum's Execution Limits: The Gas Benchmarking Framework

Ethereum

•

November 6, 2025

Intro

‍

The shared methodology Ethereum now uses to measure readiness, not speed, as the network scales.

‍

What began as a Nethermind experiment under a Worldchain grant has become a shared standard across clients and Ethereum Foundation research groups. Gas Benchmarks fills blocks with single opcodes or precompiles to push execution to its computational limits, measuring throughput in mega gas per second (MGas/s). It answers the question that defines Ethereum’s scalability: how much more load can the network safely handle?

‍

These benchmarks are not competitive leaderboards. They are a shared reproducibility framework to validate whether every client can handle heavier blocks under identical conditions.

‍

As Ethereum moves toward raising the gas limit to 60 million, that question now guides every conversation among researchers and client teams. The benchmark data distinguishes between aspiration and readiness, indicating whether all clients can process heavier blocks safely.

‍

The tool’s origin traces back to an internal performance experiment led by Nethermind engineers Marcin, Marek, and Marcos. Their early findings revealed that Ethereum lacked a reproducible method for comparing client performance. The Ethereum Foundation quickly recognized its potential, turning a local test harness into a community-wide framework for measuring readiness.

‍

What began as one client team’s experiment now underpins Ethereum’s process for safe scaling. The methodology remains simple: consistent test conditions and coordinated validation before protocol changes.

‍

How the Benchmarks Work

Figure 1: Sample Scenario Comparing All Execution Layer Clients' Performance

‍

Each test constructs a block isolating one operation. Instead of typical transaction mixes, the block repeats a single opcode or precompiles thousands of times, creating an artificial but revealing stress test.

‍

The suite runs in Docker using bash scripts triggered via CI, allowing any team to reproduce results locally. All benchmarks run on identical hardware to eliminate environmental differences. After a bottleneck is identified and the client is optimized, the same suite reruns on the same machine for a clean before-and-after comparison. Results are never compared across hardware or configurations. All numbers are relative within a single setup and are used only to measure before-and-after effects.

‍

The framework started with custom test definitions; however, it has been reworked to reuse the Ethereum Execution Specs (EELS), aligning gas benchmarking with the Ethereum Foundation’s broader testing infrastructure. This integration ensures that test definitions are stored in one location (EELS) and can be reused by all clients and EF testing pipelines.

‍

Since Ethereum has a diverse client base, language differences must also be considered. Some clients written in C# or Java would benefit from the underlying system being warmed up. To accommodate this, each run begins with warmup cycles that stabilize CPU caches, JIT compilation, and I/O layers. This allows us to maintain consistent test runs. Ten repeated runs of the same client should produce almost identical results.

‍

The ModExp Turning Point

‍

The ModExp precompile became the first decisive outcome of gas benchmarking. This marked the first instance where benchmark data directly triggered both a protocol-level repricing proposal and coordinated cross-client optimizations.

‍

Early tests revealed that specific edge-case ModExp scenarios were significantly slower than their gas cost suggested, exposing a bottleneck shared by all major clients.

‍

These included:

The known Guido4Even bottleneck case
Cases with small base and modulo (8 bytes) but large exponent (648 bits)
Cases with large base and modulo (192 bytes) but small exponent (3 bits)

‍

These scenarios filled blocks with thousands of repeated modular exponentiation calls using large operands and varying exponent sizes. The patterns stressed big-integer arithmetic, memory reuse, and caching strategies inside each client. Execution time in these blocks grew faster than the predicted gas costs, revealing that ModExp’s pricing underestimated its accurate computational weight under worst-case conditions.

‍

What began as a local performance investigation quickly evolved into something more critical. With the Fusaka deadline approaching, the decision was made to convert these findings into EIP-7883, which proposes modifying the gas cost repricing for ModExp. This prompted coordinated optimizations across Nethermind, Geth, Besu, Erigon, and Reth.

‍

Benchmark data provided the evidence needed to justify the repricing and showed that once ModExp was improved, the network could safely support higher throughput. While the ModExp repricing will only be shipped with the Fusaka fork. Related optimizations have already landed in several client implementations.

‍

Benchmarks in Action: The Berlinterop

‍

In June 2025, all the client teams met up for a Fusaka interop event called Berlinterop. One of the work streams at the event focused on performance optimization. We utilized the benchmarking infrastructure to create a friendly competition among our clients. A published dashboard was available, and clients attempted to maximize their worst-case performance. During Berlinterop, client teams participated in coordinated gas-limit stress tests under shared parameters, guided by EF testing contributors.

‍

A slew of bottlenecks were identified during this challenge, enabling us to catalog essential topics to address for the remainder of the year. The benchmarking infrastructure, in combination with Berlinterop, led us to gain confidence in a gas limit increase from 36M to 45M after the event, with plans for much higher increases throughout the following year.

‍

To show the progress made during the Berlinterop, the charts were recorded at the beginning of the week:

Figure 2a. Gas Processing Rate (normalized MGas/s per client) at Berlinterop Start

Five days later, the same benchmark showed significant improvement across all clients:

‍

Figure 2b. Gas Processing Rate (normalized MGas/s per client) After 5 Days of Optimization

‍

The difference in the slowest-performing scenarios was significant and had already improved performance for every execution layer client. During this period, developers also identified and addressed additional bottlenecks beyond those shown in the charts.

‍

Discoveries Along the Way

‍

Among the follow-up findings from gas benchmarking were patterns that, although smaller than those of ModExp, still influenced client performance and execution stability.

‍

Early test cases were cacheable, allowing teams to experiment with pre-caching and pre-warming strategies. As benchmarks evolved into uncacheable versions, these effects became less pronounced; however, they demonstrated that preparing key data paths before execution improved runtime stability and reduced the number of scenarios that could stress the network. These refinements made it more difficult to construct pathological blocks, thereby improving both runtime stability and network resilience. Several clients, including Reth, later applied similar strategies to improve consistency under load.

‍

Another area of insight came from precompile caching. When identical operations appeared repeatedly within a block, caching the first result and reusing it eliminated redundant computation. Benchmarking highlighted how this optimization improved performance on repetition-heavy blocks — a pattern familiar in real-world mainnet conditions. As test design matured, teams refined these cases to ensure comparability while preserving realistic execution behavior.

‍

Subsequent runs standardized these cases to ensure comparability, and teams agreed to remove cache-only wins from official datasets. These lessons began as local experiments and gradually became shared best practices across all major clients.

‍

Gas Benchmarks Measure Network-Wide Readiness

‍

Figure 3. Normalized Throughput Range (Minimum MGas/s Across Clients)

‍

Gas benchmarks represent a shared commitment to Ethereum’s long-term performance. They provide every client team with the same conditions and data to evaluate how their implementation behaves under maximum stress. Before any gas limit increase, the framework validates that every client can process pathological blocks without compromising consensus stability.
‍

The focus is on raising the minimum performance across all clients. Ethereum scales safely only when every implementation sustains worst-case loads.

‍

Each team has contributed to this process: running tests, validating results, and improving their clients in response. Besu used local benchmark runs to confirm performance fixes before merging. Geth validated its ModExp improvements with the shared dataset. Erigon and Reth analyzed the database and execution behaviors using the exact scenarios.

‍

Through this collaboration, performance work has shifted from a focus on competition to one of coordination and collaboration. Every optimization made in one client helps define what’s possible for all. The benchmark framework ensures that as Ethereum scales, every client implementation sustains the throughput required for higher gas limits without compromising consensus stability.
‍

For Nethermind, maintaining and expanding this infrastructure is part of our ongoing commitment to the ecosystem: ensuring that performance progress is transparent, reproducible, and shared.

‍

This collective effort is what enables Ethereum to move forward safely. Network-wide readiness depends on validated performance across all client implementations.

‍

What This Means for End Users

‍

As the benchmarks become more comprehensive and provide us with greater confidence, we can expect to see faster increases in gas limits, which will benefit all users of Ethereum, as well as L2s.

‍

For validators, better worst-case execution means smoother operation. Heavy blocks process faster, reducing missed attestations and timing variance during peak activity. Execution becomes predictable, which translates into more consistent rewards and fewer performance surprises during fork rehearsals.
For end users, verified gas limit increases expand block space and raise throughput capacity. A network that can safely process more transactions per block sees less congestion and steadier gas prices. Each validated increase adds headroom for activity without sacrificing security or consensus stability.
For infrastructure teams and institutions, benchmark data provides confidence that client diversity will hold under load. Reproducible testing lowers operational risk and helps node operators anticipate the effects of protocol changes before they reach mainnet.
And for Layer 2s, the clients they use will be able to handle significantly more load - allowing them to scale more.

‍

Each of these outcomes stems from the same foundation: measuring and improving the worst-case scenario. Tightening worst-case performance reduces the surface area for attacks that depend on vulnerable execution paths, thereby improving overall network resilience.

‍

The broader impact is predictability: execution paths stabilize, operators can plan upgrades with data, and protocol discussions move from intuition to evidence.

‍

Figure 4. Normalized Throughput Range (MGas/s) Across Benchmark Scenarios

‍

Working With the Ethereum Foundation

‍

The gas benchmarking framework has evolved into a shared component of Ethereum’s infrastructure, co-developed by client teams and the Ethereum Foundation’s research and testing groups. It now serves as both a research tool and a verification layer for safe scaling.

‍

EF contributors, including Parithosh Jayanthi, Jochem Brouwer, Louis Tsai, and Carlos Perez, work alongside Nethermind engineers to design new scenarios, automate test execution, and integrate results into the Ethereum Execution Client Specifications (EELS).

‍

Through that integration, benchmark data now feeds directly into EF’s broader testing pipelines. Performance validation runs in parallel with functional and consensus testing, creating a single, ecosystem-wide framework for verifying readiness before upgrades or changes to the gas limit.

‍

Together, we are developing automated systems that can assess client performance under the most demanding conditions, including loops of the heaviest and most resource-intensive contracts, to ensure consistent behavior and stable execution under stress. What once required manual setup is now a repeatable process that continuously validates the network’s readiness to scale.

‍

This collaboration blurs the line between research and operations. Benchmark data informs testing priorities, directly feeds into Ethereum’s test infrastructure, and helps shape upcoming discussions on fork readiness. The result is a living performance framework that evolves in tandem with the protocol itself.

‍

What's Next

‍

Gas benchmarking continues to evolve in tandem with Ethereum's scaling roadmap. The framework that revealed ModExp bottlenecks and guided the path to 60M gas is now expanding to address the next generation of performance challenges.

‍

State growth and parallelization (next focus): The benchmark infrastructure is being updated to include reproducible tests for state-related degradation - the bottleneck that will define Ethereum's ability to scale beyond 60M gas. With Block-Level Access Lists coming in the Glamsterdam fork, parallelization testing will become essential for validating performance gains across all clients.
Continuous, stateful testing and automation: Integrating benchmarks into full EVM execution scenarios provides better visibility into real block performance, while standardized automation enables continuous validation across clients without manual coordination.
L2 benchmarks & sequencer validation: After switching tests to use the Ethereum Execution Client Specifications (EELS), the infrastructure can now support L2-specific benchmarks, allowing Layer 2 teams to pool resources and validate their clients using the same reproducible methodology.
Gas repricing pipeline (ModExp -> next): As benchmark data matures, it will continue to inform EIP proposals for more accurate gas costs - following the path ModExp established from performance data to protocol change.

‍

This work moves forward through shared infrastructure and coordinated effort. The gas benchmark framework is open source and reproducible, designed for any team to validate their client's performance under the same conditions that guide Ethereum's scaling decisions.

‍

What began as a local experiment is now a shared standard. As Ethereum scales, the benchmarks scale with it: measuring, validating, and proving readiness at every step.

‍

As the network targets 60M gas and beyond, this shared framework defines how Ethereum proves its readiness to scale: empirically, safely, and together.