by fngjdflmdflg 10 hours ago
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.
This is a really interesting "data flywheel" -- better model >> more usable data >> even better model
surely there's an upper limit to this though with models literally eating themselves.
When a human students learns to read more carefully we don't consider that a negative.
More Data for the Data Gods!