I don't understand how it can be totally anonymous
Well, it can't. Given enough data, it is always possible to identify the person (there will always be references to places, activities and persons related to you). Also, using methods of forensic phonetics it is often possible to identify the speaker. The later could be circumvented by adjusting the signal (e.g. modifying the base frequency and the formant frequency of the voice) on the client so that the server does not know how the signal has been modified - such obfuscation is usually non-reversible. However, I have no idea whether it will affect the recognition performance (it might) and of course it does not solve the first problem.