Oppressor & Tokenizer
Oppressor Oppressor
I’ve been looking at ways to cut training costs while keeping performance high—thought you’d have some thoughts on pruning and parameter sharing. How do you prioritize which components to trim first?
Tokenizer Tokenizer
Start with the parts that have the least effect on the loss. Look at weight magnitudes or the Fisher information of each parameter; prune the smallest or least informative ones first. For transformers, that usually means trimming the feed‑forward hidden layers or the attention heads that show the lowest activation. After you’ve reduced those, move to the embedding layers or the output projection, but only if you’ve confirmed the performance drop is negligible. Keep an eye on layer-wise gradients—if a layer’s gradients are flat, it’s a good candidate for sharing or shrinking. Iterate and monitor the validation curve; that’s the real priority metric.