An open-source optical character recognition engine
Tesseract is an open-source optical character recognition engine. Character data sets for various scripts and languages pre-exist and the engine allows training of additional (custom) data sets.
Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.