Digital & RustWolf
Hey RustWolf, I’ve been looking at this old 1970s microprocessor, and I’m thinking about building a tiny neural net on it. Imagine a basic AI running on hardware that’s been forgotten for decades. Thought that might pique your nostalgia‑and‑innovation combo—what do you think?
A neural net on a 1970s microprocessor? That’s like fitting a rocket engine in a pocket watch. The chip will choke on a single multiply, let alone backpropagation. If you can’t find a spare memory chip and a power supply that won’t fry, just stick to the old‑school “logic gates” for the AI. I can help you tweak the timing, but don’t expect it to run TensorFlow in a tin box.
I hear you—those chips aren’t exactly built for deep learning, but that’s part of the fun. Maybe we can start with a very small, fixed‑point network, like a few perceptrons, and see how far we can push the cycle time. If it’s too slow, we’ll cut it down to a logic‑gate style inference engine. Either way, I’ll dig the schematics and we can figure out if the hardware can handle a single forward pass. Let me know what resources you’ve got on hand.
Sounds like a good proof‑of‑concept. I’ve got a few old manuals buried in the basement: the CPU’s datasheet, a handful of assembly samples, and a copy of the “Low‑Power Fixed‑Point DSP” text from the ’80s. I can pull up a quick routine that does a single forward pass in under a millisecond if we keep the network to like five weights. If the clock keeps choking, we can drop the multiplications and just use a lookup table. Let me dig through the archive and we’ll sketch a timing diagram. No need to over‑engineer—just enough to prove the idea before the silicon dies.
Nice, that’s a solid plan. Keep it lean—maybe a 3‑layer MLP with 5 weights total—and let me know what the cycle counts look like. I’ll sketch out a simple forward‑pass loop and we can tweak the multiplier if the timing slips. If it’s still too tight, a lookup table will be our safety net. Let’s make the old silicon do something that feels almost modern.
Okay, give me the lay‑out you have. With a 3‑layer net that only uses five scalar weights, you’ll be doing about 6 multiply–add pairs (one per weight) plus a few adds for the biases. On a 4 MHz 8080‑style CPU that’s roughly 30–35 cycles per forward pass, so you’re looking at about 8–9 µs. That’s generous enough to keep the clock running, but if you add an activation function that’s more than a simple sign test, you’ll need to replace it with a lookup table or an approximate ReLU. Keep the routine tight—load all the weights into registers once, then loop over the input‑to‑hidden multiplications, accumulate, then do the hidden‑to‑output pass. That should keep the cycle count low. If you hit a bottleneck, just strip the activation and stick to the linear part for the demo. Let's get those registers lined up.
Here’s a rough register plan for a 4‑MHz 8080‑style core.
**Registers**
R0–R4 – hold the five weights (W0‑W4)
R5 – bias for hidden node (B1)
R6 – bias for output node (B2)
R7 – accumulator for hidden sum
R8 – accumulator for output sum
R9 – temporary multiplier result
R10–R11 – input values (X0‑X1)
**Pseudocode**
```
LOAD R0, W0
LOAD R1, W1
LOAD R2, W2
LOAD R3, W3
LOAD R4, W4
LOAD R5, B1
LOAD R6, B2
LOAD R10, X0
LOAD R11, X1
; hidden node
CLEAR R7
MUL R0, R10 ; R9 = W0*X0
ADD R7, R9
MUL R1, R11 ; R9 = W1*X1
ADD R7, R9
ADD R7, R5 ; + bias
; output node
CLEAR R8
MUL R2, R10 ; R9 = W2*X0
ADD R8, R9
MUL R3, R11 ; R9 = W3*X1
ADD R8, R9
MUL R4, R7 ; R9 = W4*hidden
ADD R8, R9
ADD R8, R6 ; + bias
; result in R8
```
That’s about 30–35 cycles. If you hit a hiccup, drop the hidden‑to‑output multiplier (W4) and just output the linear sum. Let me know what the actual cycle counts look like when you run it.