Not sure, but the code is quite dense and lacking in comments. `nanoeuler` & `nanoeuler_check` is itself the binary checked straight into git with the `.log` file? All of the commit messages are "Add files via upload" and happened in quick succession.
I suspect this is LLM generated, which is cool, but shouldn't then have the claim "forward and backward passes are written and verified by hand" unless it is true.
Regarding the data, old texts from Gutenberg probably lowers the performance - especially as many texts are on purpose whimsical. Shakespeare for example made up words to be theatrical. You have a mix of different old English styles in the corpus - it's a terrible way to learn modern English. I had some success using .ZIM data archives from Kiwix as a source, you should get a more stable output using that data.
Hi, the uploads are one after the other because it was a long, step-by-step research project where I tested the code on another machine. I admit that I'm slowly making up for the commits on all the projects. For Gutenberg and Shakespeare, I admit that they were the best tests I could do, but I'll always improve!