# Reducing MLC Flash Memory Retention Errors through Programming Initial Step Only

Wei Wang<sup>1</sup>, Tao Xie<sup>2</sup>, Antoine Khoueir<sup>3</sup>, Youngpil Kim<sup>3</sup>

<sup>1</sup>Computational Science Research Center, San Diego State University

<sup>2</sup>Computer Science Department, San Diego State University

<sup>3</sup>Seagate Technology

June 4th, 2015





#### Outline

- Background
- The PISO Approach
  - A PISO Operation
  - The Safe Threshold Voltage
  - PISO on LSB/MSB Pages
- An Analytical Model
- Evaluation and Discussions
- Conclusions

#### Flash Memory



(a) A memory cell; (b) MLC threshold voltage distribution.

- The floating-gate of a memory cell stores a number of electrons, which affects the cell's threshold voltage.
- A retention error is caused by electron leakage over time.



Reducing bit errors is a critical way to improve the reliability.

#### Retention Error Reduction Schemes

- Dynamic threshold scheme [1]:
  - $^{\circ}$  It changes the read reference voltages along with the  $V_{th}$ .
  - Finding a suitable reference voltage typically requires a series of read retry operations.
- Flash correct-and-refresh (FCR) [2]:
  - Re-programming data in-place;
  - Re-mapping data to a new place to avoid over-programming issue.
- Read disturb scheme [3]:
  - Inject electrons to already programmed cells to reduce retention errors.
- [1] F. Sala et al, "Dynamic threshold schemes for multi-level non-volatile memories," IEEE TC, 2013.
- [2] Y. Cai and G. Yalcin et al, "Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime," in IEEE ICCD'12, 2012.
- [3] S. Tanakamaru et al, "Highly reliable solid-state drives (ssds) with error-prediction ldpc (ep-ldpc) architecture and error-recovery scheme," in IEEE ASPDAC, 2013.

#### ISPP (incremental step pulse programming)

Flash memory cell programming process



Programming an MLC cell



"Data can be only programmed to an erased cell."

What if we program data to an already programmed cell?

#### A PISO operation

PISO: Programming Initial Step Only

• If the data corresponding to the *lowest threshold voltage* ( *safe threshold voltage*) is deliberately programmed into a already programmed cell, only one programming operation will be carried out.



(a) A PISO operation; (b) before PISO; (c) after PISO.

• The injected electrons can partially compensate charge loss over time so that retention errors can be mitigated.

## The Safe Threshold Voltage

The table below shows an example page layout of an MLC block, which consists of 128 rows of 2<sup>17</sup> cells.

| Row Index | LSB of the 2 <sup>17</sup> cells | MSB of the 2 <sup>17</sup> cells |
|-----------|----------------------------------|----------------------------------|
| 0         | page 0                           | page 2                           |
| 1         | page 1                           | page 4                           |
|           | •••                              | •••                              |
| 127       | page 253                         | page 255                         |

• To program an LSB page, the threshold voltage that represents data '1' is the safe threshold voltage.

• To program an MSB page, the data stored in its associated LSB page represent each individual cell's safe

Distribution Distribution

→ 'E' represents erase state

MSB programming

LSB programming

threshold voltage.

#### PISO on LSB/MSB Pages

- To correct a cell's retention error, a PISO operation can be applied on either its associated LSB page or its MSB page.
- The overhead of performing an MSB page PISO operation is much higher than performing an LSB page PISO:
  - An MSB page PISO operation requires an extra page read;
  - Programming an MSB page demands more clock cycles.



#### An appropriate number of PISOs

A fewer number of PISO operations may not fully reduce retention errors.



Too many PISO operations incur over-programming issue and introduce more errors.



#### An Analytical Model

• The threshold voltage distribution of flash memory follows a sum of Gaussian distribution:

$$f(x) = \sum_{s=0}^{3} \frac{1}{4\sqrt{2\pi}\delta_s} exp\{\frac{-(x-\mu_s)^2}{2\delta_s^2}\} \qquad \stackrel{\text{if } 0.8}{\underset{\text{o. d}}{\text{o. d}}} = 0.8$$



• A higher threshold voltage results in a higher SILC, which leads to a larger loss of electrons.

$$\Delta V_{th,S}^L = \alpha(t) \cdot V_{th,S}$$



## An Analytical Model

- Assume that each PISO operations can shift threshold voltage by  $\Delta V_{th,S}^R$  (i.e., the right shift amount of a cell's threshold voltage in state S).
- After m PISO operations, the threshold voltage distribution can be modified as:

Voltage change due Voltage recovery due to PISO 
$$f(x) = \sum_{s=0}^{3} \frac{1}{4\sqrt{2\pi}\delta_s} exp\{\frac{-[x+m\Delta V_{th,S}^R] - (1-\alpha(t))\mu_s]^2}{2\delta_s^2}\}$$

#### An Analytical Model

- The tail probability function of each state is used to compute errors existing in this distribution model [1].
- The appropriate number of PISO operations can be calculated by solving a minimization problem:

$$min[\frac{1}{4}Q_{S_0}(\frac{|\Delta_0|}{\delta_0}) + \frac{1}{4}Q_{S_1}(\frac{|\Delta_1|}{\delta_1}) + \frac{1}{4}Q_{S_1}(\frac{|\Delta_2|}{\delta_1}) \\ + \frac{1}{4}Q_{S_2}(\frac{|\Delta_3|}{\delta_2}) + \frac{1}{4}Q_{S_2}(\frac{|\Delta_4|}{\delta_2}) + \frac{1}{4}Q_{S_3}(\frac{|\Delta_5|}{\delta_3})],$$

$$V_0^{\text{ref}} V_1^{\text{ref}} V_2^{\text{ref}} \\ 0.8 \\ 0.8 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9 \\ 0.9$$

[1] W. Wang, T. Xie, and D. Zhou, "Understanding the impact of threshold voltage on mlc flash memory performance and reliability," in ACM ICS'14, 2014.

## **Experimental Setting**

Two types of 1y-nm technology MLC flash chips.

|                                  | Flash A | Flash B |
|----------------------------------|---------|---------|
| Page size                        | 16 KB   | 16 KB   |
| Pages per block                  | 512     | 256     |
| Blocks per plane                 | 2,048   | 2,048   |
| Plane per die                    | 2       | 1       |
| Dies per package                 | 4       | 2       |
| Read latency $(\mu s)$           | 47      | 47      |
| LSB page write latency $(\mu s)$ | 471     | 566     |
| MSB page write latency $(\mu s)$ | 1,353   | 1,870   |

• Experiments are carried out on a TRIAD NAND flash memory tester.

## Testing Methodology

- Variable Relaxation Aging
  - All chips are supposed to be used in a 3 year @ 45°C environment.
  - Cycling:

| Group | P/Es |             |         |         |         |         |
|-------|------|-------------|---------|---------|---------|---------|
| A     | 1 K  | 1,000 Loops |         |         |         |         |
| В     | 2 K  |             | 1 P/E   | 2 P/Es  | 12 P/Es | 20 P/Es |
| С     | 4 K  | ] ↓         |         |         |         |         |
| D     | 6 K  |             | Group A | Group B | Group E | Group F |
| Е     | 12 K |             | Group A | Group D | Group L | Gloup I |
| F     | 20 K |             |         |         |         |         |

Endurance bake:

Arrhenius equation, 70.6 hours @ 100°C

- Retention Acceleration
  - 3 monthes@40°C -> 63 hours@70°C

#### **Experimental Results**

- The effectiveness of PISO
  - Flash A



Number of errors under (a), (b) PISO on small cycles; (c), (d) PISO on large cycles.

- (1) The number bit errors on all blocks rapidly decreases within 10 PISO operations.
- (2) After 10 PISO operations, the number of bit errors decreases in a lower rate.
- (3) Further increasing the number of PISOs enlarges the number of bit errors.

#### **Experimental Results**

- Cost comparisons with read disturb
  - The read disturb scheme demands a much larger number of operations in order to reduce a similar number of retention errors.

reducing 17% errors on 6K-cycled blocks:

\*time: 5 PISOs -> 5 \* 1,353μs = 6.8ms 700 reads -> 700\*47μs=32.9ms

\*energy: 5 PISOs -> 5\*30μJ = 150μJ 700 reads -> 700\*1μJ = 700 μJ



#### **Experimental Results**

- Discussions
  - What's the best time to launch PISO operations?
  - How many PISO operations have to be applied?



Flash B

Applying PISO operations 10 times each month can reduce the largest number of retention errors among the four groups.

A dynamic retention error detection mechanism, which periodically samples retention errors.

#### Conclusions

- PISO is efficient and effective compared to other types of retention error reduction methods.
- It can be readily implemented in either the FTL of an SSD or in a flash file system.
- It is simple and do not require a prior knowledge of the original stored data.
- How to apply it in real applications is still an open question.

Thanks!

Questions?