Qwen3.5 50% expert reduction success

6 points | by JThomas-CoE 7 hours ago ago

1 comments

The method: collect 3D activation histograms (layer × expert × rank) over domain-specific corpora, rank experts by a rank-weighted utilisation score, remove the bottom 128 of 256 per layer directly in the GGUF — no finetuning, no safetensors loading, struct-level I/O only.

The earlier work established that a 50% reduction in parameter count could result in near parity in performance in Python coding compared to the full model while losing any significant capacity to code in HTML, while the similarly constructed WEB coding specialist likewise performed at near equivalence to the full model on Web tasks but failed significantly on Python tasks.

This new work substantially completes the full proof of concept that a general MoE model can be histographically indexed and specialist models can be extracted at significant reduction in expert count and memory footprint to give an end user with constrained VRAM access to models, within a given domain, that heretofore would have been inaccessible.

Next will be a full decomposition of the new Gemma4-25b-a4b model to show applicability of this idea to a different base architecture, along with further development of the CoE (College of Experts) orchestration framework — designed to integrate a set of disk-resident specialist models into a functional system where the collective intelligence of serially-invoked specialists exceeds that of any single general model runnable within the same VRAM budget.

All 8 models (Q4_K_M GGUF, ~18B, Ollama-ready), histograms, masks, corpora and pipeline scripts: GitHub: https://github.com/JThomas-CoE/College-of-Experts-AI HF: https://huggingface.co/JThomas-CoE