xkcd #743: Infrastructures

Jakylla@sh.itjust.works · 1 year ago

xkcd #743: Infrastructures

greenskye@lemm.ee · 1 year ago

Really? I must have had a particularly troublesome PDF. It was almost like running it through OCR, generating hundreds of weird typos and formatting errors when I tried to convert with calibre.

oatscoop@midwest.social · 1 year ago

The OCR struggles with some PDFs for whatever reasons: font, formatting, etc.

There are 3rd party PDF OCR websites/programs that work better. If I’m having issues I run it through one of those first.

greenskye@lemm.ee · 1 year ago

Any suggestions? Even the good ones had error rates that might not matter for a couple of pages, but when scaled to a 500 page book, even a 1% error rate results in an annoying level of typos.

oatscoop@midwest.social · 1 year ago

I use gImageReader + Tesseract, but that probably doesn’t meet your criteria. Unfortunately OCR is very rarely perfect unless the input is perfectly clear and with a “OCR friendly” font/formatting. There are “AI powered” OCR out there, but I can’t speak to how well they work and I don’t know of any free ones.