Why Achieving Low Failure Rate In NAND Flash Storage System Needs More Than Only Strong ECC Code

The industry attaches great importance to the strength of a single ECC code: but what is often overlooked is the strength of error prevention, which is important before it can be corrected or even brought into play.

The industry attaches great importance to the strength of a single ECC code: but what is often overlooked is the strength of error prevention, which is important before it can be corrected or even brought into play.

How can we achieve the lowest failure rate in a NAND flash-based system? You may have already had this discussion between the engineering team or the storage system vendor. What measures are you taking to ensure that the quality solution you get can not only effectively correct the inevitable errors that will occur, but also build such a strong system structure to prevent errors from occurring in the first place?

As the process geometry of NAND flash memory shrinks, the bit error rate continues to increase, resulting in a decrease in system failure rate. Anyone who understands the basics of SD cards, USB flash drives and other NAND flash-based solutions knows that the key component that controls these minimizing failure rates is the NAND flash controller. You may be familiar with this component and discussed the strength of Error Correcting Code (ECC). Have you ever wondered what exactly appears in this small package? What does the flash memory controller do to prevent malfunctions? ECC is a unit in a set of different building blocks. The system design is quite good, and its reliability and error-proofing functions are staggered throughout the process array, including ECC for unavoidable bit errors. If you want your boss to impress and provide more valuable things for your work project, I suggest you continue reading, because we will explain the extremely powerful flash memory controller functions.

Even before system assembly, whether internally or through system integrators, there is an important planning standard to enter flash memory certification. In other words, the flash controller should be paired with the correct flash strategy. So what does qualification determination mean? Qualification does not only mean that the controller will use the selected flash memory. Most importantly, it means testing, and not just a few. At Shanghai WorldCom, we make sure that the combination has been thoroughly tested. The first is to characterize the flash memory itself. Characterization is accomplished by extensive testing of NAND flash memory in all life cycle stages through different use cases. This knowledge helps to correctly design the error correction unit, extract the log-likelihood ratio (LLR) table of soft decoding for error correction, and realize the most effective overall error recovery process.

When planning and designing, most companies will discuss flash memory related to overall cost, but many people forget to consider the behavior of flash memory because of their architecture, environment, and the use cases they expose. Each scenario requires unique processing, correction, and recovery options to achieve the best results. This characterization activity is very important because all collected data can validate the tool in the most accurate and effective way. Complex and well-thought-out qualifications are the foundation of a robust and stable system. For demanding systems, it is worth questioning and discussing the qualification process with the system integrator. Or, if you design a solution in-house for greater flexibility, please consult the controller company directly. Although reliable certification sets up a successful system, calibration and controller functions (such as read disturbance management, wear leveling, and dynamic data refresh) are more like direct error prevention.

An effective calibration process can maintain a low bit error rate during the entire life of the device, while dynamically adapting to changes in the threshold voltage in the memory cell. There are many disturbances that affect the battery’s threshold voltage: programming and erasing cycles, read disturbances, data retention temperature changes, etc. Flash memory does not automatically track threshold changes. Instead, the flash memory controller will determine when calibration is needed and execute the appropriate sequence of operations.

As described below, calibration changes the battery’s reference voltage. Since different blocks or pages may encounter different interferences, the best calibration for one page may not be suitable for another page.

Why Achieving Low Failure Rate In NAND Flash Storage System Needs More Than Only Strong ECC Code

In addition, error prevention mechanisms such as wear leveling (WL), read disturbance management (RDM), near missed ECC, and dynamic data refresh (DDR) work together to manage the efficient and reliable transfer of data to flash memory. Wear leveling ensures that all blocks in the flash memory or storage system are close to their defined erase cycle budget at the same time, rather than some blocks that were close to it before. The read interference management calculates all read operations to the flash memory. If a certain threshold is reached, the surrounding area is refreshed. All data read by the ECC refresh application exceeds the configured error threshold, while the dynamic data refresh scan reads all data and identifies the error status of all blocks as a background operation. If a specific threshold error of each block or ECC cell is exceeded in this scan read, a refresh operation is triggered. These functions are usually named in different ways by different controller companies, and ultimately aim at the logic and algorithms behind them. At the same time, people should establish a close relationship with their controller companies to achieve the same goal in different ways. Relationship in order to understand how these functions work with qualified flashlights.

Finally, error correction has become one of the most famous and important tasks in flash memory controllers, and error prevention should bear more weight in terms of its value. The complexity and intensity of error correction ultimately make it the most valuable cake The mechanism of the controller. When considering area and power constraints, error correction coding becomes more and more difficult. As the demand for error correction capabilities continues to increase, old code can no longer provide the required correction performance based on the limited spare area available in the latest flash memory.

In order to provide the best solution, Shanghai WorldCom has developed its own error correction engine, which is a hard decision and soft decision error correction module based on generalized concatenated codes. The great advantage provided by this code structure lies in a specific aspect: the number of correctable errors in each code word can be determined analytically. This means that for each codeword, error correction can guarantee a certain degree of correction performance. For all available flash memory, a guaranteed bit error rate is specified to ensure reliable operation within the specified parameters.

When the data is read back from the flash memory and transferred to the error correction module, the judgment of which bit error is based solely on the redundant information added to the codeword. Using only this information means that for each bit, it may also be correct or incorrect. The probability is considered using so-called soft information, which indicates the probability of the received bit being the received bit or whether it is another value (bits can be “zero” or “one”). These probabilities are taken from so-called log-likelihood tables (LLR) tables, which have been generated and stored in look-up tables in the controller. Using this information, the error correction now has more input: for each individual bit, the probability information now indicates the probability of the bit being received, for example, a zero is received with 74% confidence, and the original value is zero. Error correction has a clear indication of which bits may be wrong and which bits are unlikely to be wrong. This additional information significantly increases the correction capability of error correction.

The flash memory controller is a key component to ensure the reliable and safe processing of flash memory. They handle a series of functions designed to effectively manage the data transfer on the flash memory, and not only can correct errors, but also prevent errors. However, these functions are designed in different ways, and depending on the company’s business model and focus, your controller can be minimal. At Shanghai WorldCom, we call it high-quality functions, mechanisms and complex processes, which are designed to improve the durability of the FlashXE┬«eXtendedEndurance ecosystem, thereby improving the reliability of flash memory.

The Links:   LMBGAS032JCK PM10RHB120