Is SSD can die after period of time?

leman · Feb 2, 2022

mr_roboto said:
Documented by Apple? Not as far as I know.

Ahaha, no way Apple will publicly document bugs like these

I meant by the community (e.g. by looking into the XNU kernel code)

mr_roboto said:
Data actually written? Yes, as I said I tested the NVMe SMART reporting personally and it accurately reported the quantity of data I wrote, and most (all?) people reporting excessive writes were relying on NVMe SMART.

As a disclaimer: I don't won't to appear polemic, I am merely curious to understand what has happened, so please bear with me... but is it possible that there indeed might have been a reporting bug related to the virtual memory management in the kernel? I mean, doesn't Apple use their own NVME extensions with some custom things here and there? Is it possible that the alleged "reporting bug" might have only triggered in the context of the swap management routines, e.g. pages being erroneously marked as written by the SSD coprocessor without actually being written? A bug like that would not affect the usual NVME SMART bookkeeping on the client side.

The reason why I am so curious about this "reporting bug" theory is that we had a lot of people reporting excessive writes, but virtually no actual failure reports even now, almost a year later. I would have though that first M1 Macs should start failing by now after writing the crazy amounts of TBs we have seen. Then, there was this mention of a mysterious Apple source about the "reporting issue" (who started it actually? was it someone on Twitter?). I know that Hector says that the data has been indeed written, but it is not clear to me what evidence he has for this statement. I mean, there has to be some sort of traces of the kernel code responsible for page swap...

darngooddesign · Feb 2, 2022

Isn't the first thing needed to determine if Activity Monitor or SMART tools are reporting the writes correctly?

jdb8167 · Feb 2, 2022

leman said:
Ahaha, no way Apple will publicly document bugs like these I meant by the community (e.g. by looking into the XNU kernel code)

As a disclaimer: I don't won't to appear polemic, I am merely curious to understand what has happened, so please bear with me... but is it possible that there indeed might have been a reporting bug related to the virtual memory management in the kernel? I mean, doesn't Apple use their own NVME extensions with some custom things here and there? Is it possible that the alleged "reporting bug" might have only triggered in the context of the swap management routines, e.g. pages being erroneously marked as written by the SSD coprocessor without actually being written? A bug like that would not affect the usual NVME SMART bookkeeping on the client side.

The reason why I am so curious about this "reporting bug" theory is that we had a lot of people reporting excessive writes, but virtually no actual failure reports even now, almost a year later. I would have though that first M1 Macs should start failing by now after writing the crazy amounts of TBs we have seen. Then, there was this mention of a mysterious Apple source about the "reporting issue" (who started it actually? was it someone on Twitter?). I know that Hector says that the data has been indeed written, but it is not clear to me what evidence he has for this statement. I mean, there has to be some sort of traces of the kernel code responsible for page swap...

There were no reports of the TBW values going back down. Given the scrutiny that this problem got I’m sure if the reporting changed we would have seen an example. It isn’t impossible that the reporting bug was internal to the controller and once the TBW values were written out they couldn’t be adjusted but it seems that Occam’s razor says that the reporting did its job and surfaced a bug in the OS.

As far as I’m aware the original report came from an anonymous source reported by AppleInsider. They never followed up. Probably because it became clear that there was no reporting bug and they didn’t want the backlash for amateur journalism.

mr_roboto · Feb 2, 2022

leman said:
As a disclaimer: I don't won't to appear polemic, I am merely curious to understand what has happened, so please bear with me... but is it possible that there indeed might have been a reporting bug related to the virtual memory management in the kernel?

No. Virtual memory management isn't involved in the reporting. smartmontools uses an IOKit API (IONVMeSMARTInterface->GetLogPage()) to send a Get Log Page command to the drive. The returned log page contains all the SMART data. See this (hopefully the link works): 599

The OS doesn't do anything to the log page, it just hands out whatever bytes the drive gave it. This is a very low level API. It's up to smartmontools to parse and interpret the log page, including the reporting of "Data Units Written" and "Data Units Read". Both are integers where 1 = 512000 bytes. Fairly strange units, but that's what's documented in the NVMe spec, and it's what I observed when I did my tests.

leman said:
I mean, doesn't Apple use their own NVME extensions with some custom things here and there?

The big customization (and it's a substantial one) is that the main NVMe command queue has been changed to support on-device encryption/decryption features tailored to Apple's security architecture. However, I've never heard that they did anything to the SMART log page.

leman said:
Is it possible that the alleged "reporting bug" might have only triggered in the context of the swap management routines, e.g. pages being erroneously marked as written by the SSD coprocessor without actually being written? A bug like that would not affect the usual NVME SMART bookkeeping on the client side.

I think that you may have one or more wrong ideas about how all this fits together? The SSD coprocessor does all the SMART bookkeeping on the "server" side (client side being the OS running on the application cores). The SSD has no clue about whether writes came from VM or not. If macOS is so deeply confused that its VM manager wants to write, but doesn't manage to do so, no write command ever gets put in the NVMe device's command queue, and therefore it neither performs the write nor increments the counters reported through the SMART log page.

leman said:
The reason why I am so curious about this "reporting bug" theory is that we had a lot of people reporting excessive writes, but virtually no actual failure reports even now, almost a year later. I would have though that first M1 Macs should start failing by now after writing the crazy amounts of TBs we have seen.

None of the numbers I ever saw reported were so bad they seemed likely to cause failure in the short term. Keep in mind that TLC 3D NAND should be good for somewhere around 3000 Program/Erase cycles IIRC, and that when people have attempted to torture test SSDs, they usually outlive their nominal P/E cycle limit by a large factor - there's some safety margin built into the ratings.

leman · Feb 2, 2022

mr_roboto said:
No. Virtual memory management isn't involved in the reporting. smartmontools uses an IOKit API (IONVMeSMARTInterface->GetLogPage()) to send a Get Log Page command to the drive. The returned log page contains all the SMART data. See this (hopefully the link works): 599

The OS doesn't do anything to the log page, it just hands out whatever bytes the drive gave it. This is a very low level API. It's up to smartmontools to parse and interpret the log page, including the reporting of "Data Units Written" and "Data Units Read". Both are integers where 1 = 512000 bytes. Fairly strange units, but that's what's documented in the NVMe spec, and it's what I observed when I did my tests.

The big customization (and it's a substantial one) is that the main NVMe command queue has been changed to support on-device encryption/decryption features tailored to Apple's security architecture. However, I've never heard that they did anything to the SMART log page.

I think that you may have one or more wrong ideas about how all this fits together? The SSD coprocessor does all the SMART bookkeeping on the "server" side (client side being the OS running on the application cores). The SSD has no clue about whether writes came from VM or not. If macOS is so deeply confused that its VM manager wants to write, but doesn't manage to do so, no write command ever gets put in the NVMe device's command queue, and therefore it neither performs the write nor increments the counters reported through the SMART log page.

None of the numbers I ever saw reported were so bad they seemed likely to cause failure in the short term. Keep in mind that TLC 3D NAND should be good for somewhere around 3000 Program/Erase cycles IIRC, and that when people have attempted to torture test SSDs, they usually outlive their nominal P/E cycle limit by a large factor - there's some safety margin built into the ratings.

Thanks for elaborating all this! I never looked at the kernel code of the Apple Silicon swapper, was simply wondering whether it might be using some other internal API/protocol to do its work (e.g. some sort of latency-optimised path to the controller that is not part of the normal NVME operation). But if it uses the same protocol as any other filesystem operation then indeed, the notion of a "reporting bug" is indeed nonsensical.

Search

Search

Is SSD can die after period of time?

leman

macrumors Core

darngooddesign

macrumors Core

jdb8167

macrumors 601

mr_roboto

macrumors 6502a

leman

macrumors Core

Our Staff