Processing Scans

Turning paper into woodpulp while not loosing any information on the way

Created by: Lester Caine, Last modification: 28 October 2024

A number of useful tools allow me to convert a raw PDF with 17 or so A3 pages into a coherent A4 one with a searchable text layer. It is not as small as the fully OCRed content I was producing 5 years ago, but it is a start. Some of you may recall that at one time there were extra A4 pages interleaved with the A3 ones. I know there was an explanation as to why, and I think it was to do with having more material before the centre fold than after, but perhaps someone can fill in some detail? The first few magazines I scanned I pulled out the A4 sheets and did them separately, but the document feeder is actually quite clever, and it will scan them in place in the pile. I just have to take care when they come out to shuffle them to the right since of the pile. While I can leave the document scanner to it's own devices, a little supervision helps where pages are stuck together or do not pick up properly.

Fanning the pages out is something of an art that I seem to have mastered now which helps and breaks the places where the staples have caused two pages to feed as one due to being stuck together. Actually the second page tends not to go through properly at all and can end up folded in half which does not help. A quick grab sorts that out and scanning is not interrupted. Another problem with the larger magazines is that the printer can only handle around 120Mb documents and fails through lack of memory. It does not remember what is has scanned so one starts again with a few pages saved to scan on a second pass.

PDF Arranger ... takes A3 portrait scans, rotates them, splits the pages back to A4 and then allows the second half of the magazine to be shuffled into the correct order.

This process very simple to implement and even having to manually select the pages to move to the second half of the new a4 file does not take long. First step is to assemble all of the A3 pages into the correct order and on the whole the initial scans are already assembled in a single raw pdf, but ocasionally when the scanner has had a problem, two files need to be combined, or pages that have not been scanned properly need to be replaced. One major advantage of PDF Arranger is being able to view all the pages of the magazine as thumbnails, and it is easy to spot sheets that have twisted noticably while being scanned or been folded by the scanner. The rotate function is the first step to orient the pages properly, then one can just scanning the problem pages and import them to the working set and swap out the problem pages. While this was more of a problem when I was first getting used to the scanner, after 14000 A3 pages it does not happen often.

Once we have a set of clean A3 pages, the next step is to split the A3 pages into A4. While most of the time this will be all of the pages, in a few cases it's appropriate to deselect the center page spread to leave it as the full A3 spread. In addition the extra A4 pages that appear in a few issues of the magazine need to be left alone. Once split, the 'back' pages need to be selected. The last page of the magazine is always the first on in the list, start with that, then skip page 1 and 2 of the magazine and select the 2nd page 2 and the first page 3. Scan the whole list skipping every other pair of pages. This will result in an easily recognizable patter of selected pages, and both the selected string, and the unselected on have a consistent page count.

The final steps are easy. Just drag the selected pages to the end of the list, and taking care not to loose the selection, select 'Reverse Order' which will result in the first page 1 now being the very last page in the file. At this point I save the file tagged as '_a4', where the original scans were tagged '_a3'. If the original scans had been merged, then it has been useful to save a clean '_a3' leaving with a clean set of raw scans. I do need another large disk to store all the work in progress as I've clocked up 125Gb in the last few weeks.

The process is much the same when there are A4 inserted pages in the magazine, and rather than splitting the middle when the content flows across both halves of the page, it can be left as an A3 spread. This does cause a problem with the page display in PDFArrange, and as is displayed, may need a change of scale to see all the pages, but everything just works independent of that different page in the middle.

PDFArrange Mag with a4 inserts a3 middle Scans

PDFArrange Mag with a4 inserts a3 middle Split

PDFArrange Mag with a4 inserts a3 middle Post Split

PDFArrange Mag with a4 inserts a3 middle Post Split Select

PDFArrange Mag with a4 inserts a3 middle Pre Reverse

PDFArrange Mag with a4 inserts a3 middle Finished

ocrMYpdf ... produces a text layer that allows both searching and displaying the entry in the document. As yet I haven't investigated if the text layer can be accessed outside of the pdf. The bitweaver framework that I use for the websites does have a content search facility and that should be able to use the text layer if it is loaded into the website. Initially I was having to run ocrmypdf from the command line, but a resent addition to Dolphin has added another layer of functions, and 'OCR PSD-File (English)'. The problem with this way of activating is that it does not provide any display of where it has reached in the process. Something that running from the command line route provides a little excessively. I have been running the System Monitor to keep an eye that the process has started when it has finished. This works well as I can start a second and third conversion after the first one has finished running gs and tesseract processes. This new machine will run 12 pages in parallel, but once they are processed, just a single process is used, so 11 more pages can then be processed on a second file.

Next Steps
I am now moving on to copies of the magazines that are not A3 size pages when deconstructed. I am trying to sort this out and will be discussing this on the blog.