Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

VitoBotta

macrumors 6502a
Original poster
Dec 2, 2020
912
361
Espoo, Finland
I'm running some Qwen 2.5 models on my M4 Pro, and the inference speed is okay at around 11 tokens per second. However, prompt evaluation slows way down with longer prompts, and it gets even slower as the number of messages in a thread increases.

I built a Ruby proxy to tackle the second issue by enabling prompt caching, which is supported by Llama.cpp, the tool I'm using to run the models. But the initial prompt evaluation and evaluation of longer prompts still takes forever.

I've read that Apple Silicon chips, like the M4 Pro, are much slower than NVIDIA GPUs when it comes to processing prompts. In one case, someone mentioned the difference could be as high as 15 times slower. Does anyone know why Apple Silicon chips are slower for this task?
 
Prompt evaluation is more compute bound than bandwidth bound and Apple silicon has tiny matrix computing power compared to its huge bandwidth.
 
  • Like
Reactions: eas
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.