Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

dilber

macrumors newbie
Original poster
I built an open-source LLM inference telemetry suite that measures Tokens Per Joule — the energy efficiency metric, not just raw speed.

Current baseline is on an M1 Pro (32GB UMA):
  • 2.42 Tokens/Joule on Qwen-3B Q4_K_M
  • 22 t/s on Llama-3.1-8B Q8_0 at 8192 context (13.7GB workload) at 35W whole-SoC power via powermetrics
  • Zero thermal throttling across 10+ minute sustained loads
Power methodology: the suite uses sudo powermetrics for whole-SoC power telemetry (CPU + GPU + memory controller). Memory tracked
via psutil per-PID RSS. Each config runs 10 times with 95% confidence intervals.

The M5 Max/Ultra should have significantly higher memory bandwidth and GPU throughput. The open question is whether T/J scales
proportionally — or whether the higher power envelope eats into the efficiency gains. Only real telemetry will answer that.

Setup is straightforward (model download is ~25GB, so budget time for that):
  1. Bash:
    git clone https://github.com/dilbersha/universal-llm-telemetry-suite
  2. Bash:
    python3 -m venv venv && source venv/bin/activate
  3. Bash:
    pip install -r requirements-apple-silicon.txt
  4. Bash:
    python3 -m venv venv && source venv/bin/activate
  5. Bash:
    sudo ./venv/bin/python src/orchestrator.py
Results land in results/<your-chip>/production_benchmarks.csv with 95% CI. Runtime: ~30-45 minutes.

Would be great to get M5 data points to map how T/J evolves across the Apple Silicon generations.

Repo: https://github.com/dilberx/universal-llm-telemetry-suite
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.