In late February 2023, Solana’s Mainnet Beta cluster experienced an outage, leading to long block finalization times. The root cause of the outage was determined to be congestion within the Turbine protocol due to abnormal network traffic.
To mitigate any further issues, block leaders entered into vote-only mode. Validator operators attempted to live downgrade the cluster, but this was unsuccessful. Ultimately, a manual restart with a downgrade to the last known stable validator software version was necessary. The network was restored on February 26th without any rolled-back transactions of economic value.
Investigation into the event revealed that the irregular Turbine traffic was caused by block forwarding services that malfunctioned when they encountered an unexpectedly large block. The large block overwhelmed validator deduplication filters, and its data was continuously reforwarded between validators. As new blocks were produced, the issue became worse, eventually saturating the protocol. While Turbine should be resilient to blocks of this size, the block forwarding services reduced its efficacy.
The Solana team has taken steps to prevent similar incidents in the future, including adding more rigorous testing of block propagation scenarios to their test suite and implementing a mechanism to identify and alert validators when the network is at risk of reaching saturation. The team has also encouraged validators to upgrade their hardware and optimize their nodes to better handle block propagation.