Shiva´s PDF ebook tutorial with use of ABBYY FineReader This tutorial is not a replacement replacement for the ABBYY FineReader FineReader Help File - you get to know most of the things you need to know there. But as there are a lot of ways create an OCRed PDF, I will show one way to do it fast and with good results.
Contents 1 Quick and dirty: main steps Startscreen Scan Windows Save results as PDF Results in Acrobat 2 Options and Settings Scan, Read and Font Options Save and View 3 Image editing - levels 4 OCR in depth Areas and tools (image window) Text areas areas with tables Background images Background Images II Proofreading and spell checking 5 Finding Fonts 6 Additional Software Using Pistop Part I
2 2 3 4 5 6 7 7 8 9 10 10 11 12 13 14 15 16 17
1 Quick and dirty: main steps
1 Quick and dirty: dirty : main steps This is what it looks like when you you start nereader. nereader. In this chapter we use the main buttons: „Scan“ and „Read“ „Read“.. If you want to work on already scanned images or PDF with images, you can import these with „Open”. „Open”.
1 Quick and dirty: main steps - Scan 1.1 Scan After we click „Scan“, „Scan“, we get the preview window. window. In the settings we can choose „ABBYY FineReader Interface“ or „native interface“ I always use „native interface“ because i have more options there. Scanner Settings: Resolution: Even if its often recommended to use 300DPI and more, i have good results at 150/200DPI. Scanning Mode: Greyscale is optimal for OCR. At colored colored pages I switch the scanning mode. mode. Brightness: Manual. No changes here. If needed, do it later at pages with images (using curves) Paper Settings: Draw a rectangle in the preview window - a bit smaller, because this area will be used for all pages. Use a corner of the scanner so that the book is always at the same place. Image Processing: Check all checkboxes - this is also done in options/setting - we come to that later.
Bend the book at different pages before start with page 1. Use one corner of the scanner. Below the screen of my „native „native interface“ (looks dierent depending on the scanner/scansoftware you use/ couldn`t switch to english menu here) There is just one thing that I regularly regularly use: Descreening at pages with images. images. Scanning process process takes a bit longer.
1 Quick and dirty: main steps - windows Back from the preview window (click “close” at preview window after all pages are scanned), we see the scanned pages Left window: window: Icons of all scanned pages. Center: Center : The Image window displays an image of the current page. You can edit image areas, page images, and text properties in this window. But this later - sometimes and in this example the automatic analyzing of the layout works with good results Right Window: Window: We will see the recognized text after the next step. So, what we do next is press the „Read“ button
Optional: Normally I save the project at this step for the rst time. Depending on the stability of your computer/system you could close the preview window every 100 pages (check out, if you interface is keeping the scan area for getting all pages at the same size).
1 Quick and dirty: main steps - Save results as PDF Now we see the results in the right window. red underlined: words not found in the dictionary blue background: nereader is not sure about these characters. In the I mage window you see the recognized text areas (green rectangles)
Both are very helpful for you to check spelling and make the manual corrections (last step before save to PDF) If you don´t correct errors here, they will show up in the PDF. PDF. In that case better save jpg/(jpg-PDF) only.
Press save to PDF button after everything is corrected:
1 Quick and dirty: main steps - Results in Acrobat The result opened opened in Acrobat (two pages):
2 Options and Settings -
Scan, Read and Font Options
Scan/Open General: I work with selected „Do not read and analyze acquired pages images automatically“. automatically“. Sometime it´s more work to correct wrongly analyzed pages. If you have to edit contrast, you have to analyze layout again.
Read Training: I tried once (6 hours) to work with training a user pattern on a di cult scan that I found -> waste of time. Built-in patterns are better. Correcting errors manually takes less time. -> „Use only built-in patterns“
Image processing: all boxed checked. (more on exceptions later)
If you click „Fonts“ you can set the fonts used in recognized text (screenshot to the right)
Scanner: here you nd the selection between the interfaces that I mentioned earlier.
Font Matching Finereader isn´t really good at assigning the right font. I always use just one font. If there are dierent fonts in headlines etc., I edit that manually later (howto in the next chapter) How to nd out, what font is used, where to get and how to use it, I will explain in the Font-chapter. Font-chapter.
2 Options and Settings -
Save and View
Save: Default paper size: Use original image size (I like the original look) Save mode: text and pictures only (no jpg text needed - we use a nice font and get a small PDF) Image Settings: I mostly work with 150DPI. There are many possibilities to set the nal resolution: First at scanning, here or optimizing at Acrobat. Font settings: It´s very important to embed fonts. You never know what fonts are installed at the readers computers. A good layout can be destroyed if another font is used by the reader.
View: Text window: window: Highlight uncertain characters and non-dictionary words words (important for spell checking later.
3 Image editing - Working on levels Working on levels at page with images (copied from nereader help le): Levels allows you to adjust the tonal values of the image by selecting the levels for shadows, highlights, and midtones on a histogram. To increase increase image contrast, contrast, move the right and left sliders on the input input levels histogram. histogram. The tone corresponding corresponding to the position position of the left slider will will be assumed to be the blackest part of the image, and the tone corresponding to the position of the right slider will be assumed to be the whitest part of the image. The remaining levels levels between the sliders will be distributed between level 0 and level 255. Moving the central slider to the right or to the left will make the image darker or brighter respectively. To decrease decrease image contrast, contrast, adjust the sliders for the output levels. levels.
Grey areas in the background. Move the white slider to a point where about 90% of that curve are “whitened”. Black slider to the beginning of the curve will look best
4 OCR in depth - Areas and tools (image window) Remember that you have to edit the pages before you analyze the layout and read it. (sometimes you dont´t need that)
Text: This is the main tool tool to dene text areas areas (green). Don´t give give to much space left and right there may occur errors in layout recognition or dirt on pages may be recognized as characters. Dierent font styles or text areas (headline, page number etc.) can be marked with one rectangle.
Preparing the OCR process: This is done in the image window. window. You You dene the areas areas - mainly into text text or image areas. areas. I will explain the tools/buttons:
Picture: With this tool you dene picture areas (red). As you see in my example to the left, you can save time to dene an image, where text and graphic are mixed. Finereader does a bad job to separate it automatically (sometimes I do that manually) Table: I use use the table tool very very often - not only at tables. tables. Examples (contents/ index) later. later. Background Picture: I rarely use this. One example later. Edit Image: Most used at greyscale images to optimize contrast. In a clean OCR PDR white areas of an image should be white - howto later. Also often used for cropping pages - not needed if you scan yourself but if you OCR a scan frome someone else that has too much wasted space. Analyze: This is the automatic layout analyzation of the current page. You get a feeling with the time if its more eective on special page to analyze automatically and then correct it or to do it manually only. Finereader sometimes has problems with mixed pages (text/images), tables, text in coloumns. Read: This function will OCR the analyzed areas - if there is no analyzed area, nerader will analyze the page before that. If there was an area missing - of ten the page number - you have to add that manually and read the whole page again. Select: With the Select tool you can work on the analyzed areas - change size, dene rows and columns in tables (howto next page) etc.
4 OCR in depth - text areas with tables At the contents pages I often work with tables for a clean layout. As the table area is dened, you get more tools with the Select tool: Add horizontal separator Add vertical separator Merge table cells
2)dene colomns with vertical separator
3)dene rows with horizontal separator
4) select table cells to combine
5) no more separation needed for good results
1) In this example I start drawing the rectangle with the table tool
4 OCR in depth - background images nal page (PDF) Using the background image area: Sometimes I like to have clear characters in schemes and diagrams. Usually you can place image and text areas side by side - sometimes you have to add and cut area parts. When they overlap you can still use background images - rst draw the background image area and overlay text areas. Both is seen in the screenshot to right.
, a y ) a i c c k i a t s h n i c a n a y s m r m P i a . N h ( s 5
, a y a k a e g o l t ) h y b b l u m t S a n . S i a ( s 6
, a y a l k a a s ) u m r y a a l C h e g . D a ( s 7
t i r i p S
l u o S
, l a n o ) i t e v a i r x ( e l f d e n r i f l m e s , d c e i c o n g a e v l d a t A n e . 4 m
S U O I ) l C a S n o N s O r e C p ( F L E S
, , p l a i h b r s r e e v b ( m d e , n l i a m ) c l , i m l a g r y a l o e c l r i o m a h t e a E l c . y a i p b 3 m
, a y a k
e a t k i ) a v e m a t i t u l h l b o U a v s . S b ( a 8
S U ) l O I a n C o S s r N e O p C s n R a r E t P ( U S
g n i e B f s o u n o i i a c h s n C o c t n a U e r d G n u e o h r T G . e 1 h . g T i F
S ) U l a O I n C o S s r N e p O e C r B p ( U S
, c i , n t s s o e m h h r p o y g f i t h e y ( f i l ) l l l y a a i d y c l o i c i g d e B p . o s a b e m 2
i r e t a ) e n m l f i a l , l a i c r i c i i s e t t p y w a e h o m r l o p ( c r i d e r e l o r n p u a ; b t o a e r s r u m u N t r ; . a o l 1 n f a
4 OCR in depth - background images II This example will will show the high capability of background background images tool. Sometimes Sometimes the is a background background image under under text. You You can decide to disdiscard it, but if you want to keep it, FineReader does a good job: Parts of the background image, where the original font is seen, will be replaced by a mix of surrounded pixels. So the new text/font can be overlaid. See close-up to the right.
In a close-up we see the comparison of the original scan (above) and the resulting PDF (below).
nen uns, uns, atmen, atmen, strec en uns uns das Leben und geben ihm e in der Zwischenwelt - zwi Vergangenheit und Zun und dem Unbekannten. Der Tanz des Lebens verzehrt rzen in seinen Flammen, und wie die Wärme dieser Freude ert, wächst, wie ihr Rhyth-
4 OCR in depth - Proofreading and spell checking Proofreading and spell checking:
Setting the sizes of the windows:
Going through the text:
This is the most important important part and its taking 80% of the the time of a project. For example i took a very dicult scan that I found on the net. Sometimes Scans have manually underlined words. To delete that, you have to select all and click two times Underline (Ctrl+u).
At the left I have the icons of the pages to know “where I am” (not necessary). Image window also not needed (not seen in screenshot). The Text Text window is as big as possible to to see as much text as possible possible at once and font big enough to identify recognized characters. Below about 3 lines of the original scan.
I start at page 1, click into the text window and jump forward with PgDn-Key (back with PgUp). The actual cursor position is shown in the window below by a yellow rectangle with blue outline. If there´s a blue marked word or character, move the cursor there and compare the content of the two windows.
The whole process may take from 5 to 15 hours. It depends on the quality of the scan and the number of pages. pages.
5 Finding the fonts used in the book extracting jpg from scans: You can see this as a step for advanced user and simply use a font you like and already have. You You can decide to take a similar (often serif-) font or a non-serif font, that is better for screen reading. When there are dierent fonts on dierent pages (usually one for headline and one for main text), I rightclick the icons of the pages in the left window (select more than one pages with pressing “Ctrl”) and choose “Save selected images”. I open these pages in Photoshop (alternative freeware Gimp) and crop a sentence (minimum two words) and save as jpg. I upload this this jpg to http://www.myfonts.com/Wha http://www.myfonts.com/WhatTheFont tTheFont In most cases the correct font will be identied - sometime you get suggestions for similar fonts. You You can use the font names to continue search here, where similar fonts are shown: http://www.identifont.com/nd-font.html In most cases you nd the font with google - there are a lot of torrents wit font packages or single font downloads. There are collections collections of many GB sorted fonts. Never install too many fonts - it will slow down down your computer. computer. Use a font manager - but thats not needed - you can search the font achive folder for the font name. Finerader has some Problems with otf fonts. You have to convert them to ttf before. You You can do that online her: http:// www.freefontconverter.com/ www.freefontconverter.com/ or here: http://onlinefontconve http://onlinefontconverter.com/ rter.com/
results page at whatthefont:
6 Additional software - Using Pitstop Part I Result in Acrobat Pro: Pitstop ist a great plugin for Acrobat Pro. In the last stepp you can edit the nal PDF. You You can do everything you need: Delete, resize object, add lines, change colors, copy & paste objects between pages and dierent PDF etc. Just one example where i use it when OCR the backcover: Analyzed page in Image window
Recognized text
6 Additional software - Using Pitstop Part II With Pitstop you get many more toolboxes in Acrobat Pro - one is “Pitstop Edit” :
Select text with TouchUp tool Final page in Acrobat.
PHILOSOPHY/RELIGION
If yo want to delete, move or scale an object (text line, image or background), you have to select it with this tool. The object will be marked with blue corners or outlines. (Screenshot below)
rightclick - Properties - change font color from white to black
You can move objects with this tool. Sometimes Finereader has layout errors, that can´t be corrected there. Sometime you work on scans, where text blocks are too close to one side - you can center it with this tool.
"If we cannot carry our practice into sleep," Tenzin Wangyal Rinpoche writes, "if we lose ourselves every night, what chance do we have to be aware when death comes? Look to your experience in dreams to know how you will fare in death. Look to your experience of sleep to discover whether or not you are truly awake." This book gives detailed instructions for dream yoga, including foundational prac tices done during the day. In the Tibetan tradition, the ability to dream lucidly is not an end in itself, rather it provides an additional context in which one can engage in advanced and effective practices to achieve liberation. Dream yoga is followed by sleep yoga, also known as the yoga of clear light. It is a more advanced practice, similar to the most secret Tibetan practices. The goal is to remain aware during deep sleep when the gross conceptual mind and the operation of the senses cease. Most Westerners do not even consider this depth of awareness a possibility, yet it is well known in Tibetan Buddhist and Bon spiritual traditions. The result of these practices is greater happiness and freedom in both our waking and dreaming states. The Tibetan Yogas of Dream and Sleep imparts powerful methods for progressing along the path to liberation.
Tenzin Wangyal Rinpoche , a lama in the Bon tradition of Tibet, presently resides in Charlottesville, Virginia. He is the founder and director of The Ligmincha Institute, an organization dedicated to the study and practice of the teachings of the Bon tradition. He was born in Amritsar, India, after his parents fled the Chinese invasion of Tibet, and received training from both Buddhist and Bon teachers, attaining the degree of Geshe, the highest academic degree of traditional Tibetan culture. He has been in the United States since 1991 and has taught widely in Europe and America.
Text editing editing is possible with with this tool from another another toolbox. Don´t change text, when you have embedded fonts - do that in nereader. I use this tool only for textcolor (example to the right)
"A detailed guide to using our night-lives for Awakening; thought-provoking, inspiring, and lucid."—Stephen LaBerge, Ph.D., author of Lucid Dreaming "This explication of the dream and sleep practices becomes a window on the entire teachings of Tibetan Tantra and Dzogchen. I enjoyed this book immensely...powerfully and beautifully presented."—Martin Lowenthal, Ph.D., co-author of Opening the Heart of Compassion
select black background rectangle(s) and delete S now L ion
ISBN 1-55939-101-4 1-55939-101-4 Cover design: Jesse Townsley/ Sidney Piburn Printed in Canada $16.95 in USA £11.50 in UK