A few days ago, I switched to MLX models supported by LMStudio and noticed a nice speed improvement compared to GGUF models. Recently, with the addition of speculative decoding in LMStudio beta, things have gotten even better—I’m seeing a 20 to 50% speed gain!
If you’re not familiar, speculative decoding is a technique that boosts inference for larger models by combining them with a faster, smaller model called the "draft." Interestingly, it seems that using speculative decoding doesn’t compromise quality compared to just using the main model alone. Pretty cool, right?
If you’re not familiar, speculative decoding is a technique that boosts inference for larger models by combining them with a faster, smaller model called the "draft." Interestingly, it seems that using speculative decoding doesn’t compromise quality compared to just using the main model alone. Pretty cool, right?