Scanning smaller format Model Engineer Magazines
This is possibly a process I should have adopted for other of the older magazines, but it is working well for the current 1960 and 70's mags. The bulk of the magazine is black and white, and while the paper is a little coloured, by setting the scan parameters to clean up the extra grey elements that this created, a much cleaner scan is being produced which is also helping the OCR stages. This is achieved by selecting the 'Medium' setting for 'Remove Background Colour' and setting 'Contrast' to +25. So the default settings on the FTP scan process have been set to that from the web interface to the printer. This is a bit of fun as the printer seems to be programmed by completely different people who have not got a design guide. Each operating mode is completely different in out it is configured and has a different set of pages to set parts of the setup. I am learning all the time, and now have two profiles set up which I can select on the printer, one for the colour page, and the second for the grey scale. This would benefit from another crib sheet now that I have worked out what to do ...
Scans are identified by their sequence number, but this counts up page by page, so numbers are the order of 24 numbers apart for the body of the scan, while the cover scan is 2 pages. So these need to be reworked to make it easier to marry the two parts together. Another job that I need to do is to tidy up the main linux computer as it's a bit short on disk space, so I am scanning to an old Windows machine that has a lot more spare space, and copying the renamed files to the linux box. I use an application called Beyond Compare to make sure that material on one file system match that of another. This is essential for managing website mirrors, but is useful to ensure that mirror copies of the scans are tidy. I have an 8Tb disk on the server which has a complete copy of the work done so far, and a second copy is on an external USB disk, in addition to original raw scans on the windows machine. Overkill perhaps, but having multiple copies is something I have got used to ensuring happens.
So starting with a stack of magazines that have had the staples removed, I scan the cover, and follow up with the rest of the pages using the correct profile. Then on the windows machine I can renumber them to the issue number with an extra c or m as appropriate. So we end up with ME3385c_a3.pdf and ME3385m_a3.pdf to pass on. These are then ordered correctly so that when loaded into PDF Arranger the cover pages are first. In the archive, the raw scans are sectioned up in their own directories, so in the folder 'ME Raw 3350-3399' where all of the processing takes place. While a little overkill, the first step in PDF Arranger is to save to ME3385_a3.pdf which combines both original files. Then with all pages selected, next step is to 'Crop Margins' as while the scanner crops the width of the pages to the smaller size, the height is a couple of centimetres too high using the 'Ledger' page size. 'Crop Margins' opens a pop-up window which is a little small on my high resolution monitor so hitting full screen, a much more usable display results and it is easy to adjust the 'sides' of the page to just touch the top and bottom of the front cover. On a few occasions, the front cover has not scanned cleanly and has not been central on the larger page, so I have rescanned the cover to solve the problem and replaced the 'c' file with a new one. This is where Beyond Compare comes into it's own ensuring that the right version is archived. Just had fun where I realised I'd selected -25 rather than +25 for the Contrast on the colour profile. At least I had only been working with a couple of mags and so only two pages needed redoing.
OK ... While processing the latest batch, some of the front pages were being 'enhanced' and were not looking right. The processing to loose the grey page facets is not suitable for the only page of the cover that is printed in colour, so I've added a third profile and run the cover through twice. First time takes a clean scan of the front side of the cover, and the second suppresses the paper colour for the other three pages of the cover. A few covers do have two colours on the other pages, but these benefit from the processing and look better so it is only the first page that benefits from NOT processing it. The extra page is tagged with an A so it appears first in the list to select in PDF Arranger, and when looking at the two copies of that A3 scan in that app the difference is quite obvious. The extra steps needed to handle this scan have to take place after having split them into A4 pages. The first four pages of the A4 version need to be reduced to two by deleteing the first and fourth page and just to be tidy, swap the other two pages. This gives a set of pages that can then be shuffled to the right order pulling all the 'back' pages to the second half of the page set. The grey scale images do benefit from the change in contrast and sharpen up nicely while the front cover is somewhat distorted.
Once one has the correctly ordered set of A3 pages, then the procedure to shuffle them to A4 is as before. Ocasionally if a magazine messes up being scanned the right way around I may rescan the wrong way, but normally it is just a matter of 'Rotate Right', deselect the last page if it's an A3 spread, then select Split pages. This defaults to 50:50, so the fact that the actual page is not A3 wide is compensated for. Once the pages have been split to A4 then it is a matter of selecting all the 'back' pages, so the first page on the screen, skip two and select the next two. Repeat until there are aoly 3 pages left, or two with one being A3. If this is not the case then there will be a mistake in the selections. The pattern of brown squares varies depending on the image size, but it's easy to scan to check that the pages selected are in numerical order. Once selected then they are dragged and dropped after middle pages, taking care not to loose the selection, then one can select 'Reverse Order' so that the first page becomes the last is it is in the magazine. One problem in using the button when selecting multiple pages is that if you move the mouse while clicking it duplicates all the selected pages. Easily restored with a , but annoying and one has to start again. Need to keep speed of selecting down a little and halt at each selection, something that I've managed to master now, but early on I kept speeding up and having to reset more than once in a document.
I now have a shortcut to ocrmypdf and can select up to three A4 verions of the magazines to run at once. Sometimes they were not starting, but I have nailed that down to needing to have a configuration file in the folder I am working in. If I copy the tesseract_opencl_profile_devices.dat file from another already processed folder then things fire up straight away, but if I forget, then it's just a matter of clicking the shortcut again. Processing does take a few minutes, but eventually the OCRed version will appear as ME3385_a4_ocr.pdf and it's at this point I rename it to the longer format 'Model Engineer No3385' and move the finished copy to the archive directory. I've been working in blocks of 50, so this one will end up in the folder 'ME 3350-3399'. Having completed a block of magazines they can then be loaded into calibre and have their proper metadata added. I have not found yet how to add that to the PDF file, which would be a nice next step. At the same time as I add it to the library, I also update the speadsheet with the ones I have completed. I've been taking the time to record the publisher and the price on that spreadsheet, and I should probably have made a note of the number of pages as well. Perhaps something for the future. In the short term I have a little list of issue numbers I want to return to to read articles that have caught my eye while doing the processing.