I'm running some Qwen 2.5 models on my M4 Pro, and the inference speed is okay at around 11 tokens per second. However, prompt evaluation slows way down with longer prompts, and it gets even slower as the number of messages in a thread increases.
I built a Ruby proxy to tackle the second issue by enabling prompt caching, which is supported by Llama.cpp, the tool I'm using to run the models. But the initial prompt evaluation and evaluation of longer prompts still takes forever.
I've read that Apple Silicon chips, like the M4 Pro, are much slower than NVIDIA GPUs when it comes to processing prompts. In one case, someone mentioned the difference could be as high as 15 times slower. Does anyone know why Apple Silicon chips are slower for this task?
I built a Ruby proxy to tackle the second issue by enabling prompt caching, which is supported by Llama.cpp, the tool I'm using to run the models. But the initial prompt evaluation and evaluation of longer prompts still takes forever.
I've read that Apple Silicon chips, like the M4 Pro, are much slower than NVIDIA GPUs when it comes to processing prompts. In one case, someone mentioned the difference could be as high as 15 times slower. Does anyone know why Apple Silicon chips are slower for this task?