I built an open-source LLM inference telemetry suite that measures Tokens Per Joule — the energy efficiency metric, not just raw speed.
Current baseline is on an M1 Pro (32GB UMA):
via psutil per-PID RSS. Each config runs 10 times with 95% confidence intervals.
The M5 Max/Ultra should have significantly higher memory bandwidth and GPU throughput. The open question is whether T/J scales
proportionally — or whether the higher power envelope eats into the efficiency gains. Only real telemetry will answer that.
Setup is straightforward (model download is ~25GB, so budget time for that):
Would be great to get M5 data points to map how T/J evolves across the Apple Silicon generations.
Repo: https://github.com/dilberx/universal-llm-telemetry-suite
Current baseline is on an M1 Pro (32GB UMA):
- 2.42 Tokens/Joule on Qwen-3B Q4_K_M
- 22 t/s on Llama-3.1-8B Q8_0 at 8192 context (13.7GB workload) at 35W whole-SoC power via powermetrics
- Zero thermal throttling across 10+ minute sustained loads
via psutil per-PID RSS. Each config runs 10 times with 95% confidence intervals.
The M5 Max/Ultra should have significantly higher memory bandwidth and GPU throughput. The open question is whether T/J scales
proportionally — or whether the higher power envelope eats into the efficiency gains. Only real telemetry will answer that.
Setup is straightforward (model download is ~25GB, so budget time for that):
-
Bash:
git clone https://github.com/dilbersha/universal-llm-telemetry-suite -
Bash:
python3 -m venv venv && source venv/bin/activate -
Bash:
pip install -r requirements-apple-silicon.txt -
Bash:
python3 -m venv venv && source venv/bin/activate -
Bash:
sudo ./venv/bin/python src/orchestrator.py
Would be great to get M5 data points to map how T/J evolves across the Apple Silicon generations.
Repo: https://github.com/dilberx/universal-llm-telemetry-suite