reveal.js

## Textual Document digitization process
- Original document (book, manuscript, newspaper, microfilm, etc.)
- scan or photograph original document
- machine (OCR) or human transcription from images
- transcription is optionally encoded  for advanced search, display, and analysis.
---
## Textual Document digitization process

![Text digitization process](images/text_process.png)
---
## Textual Document digitization process

![Text digitization process](images/text_process_1.png)
---
## Textual Document digitization process

![Text digitization process](images/text_process_2.png)
---
## Textual Document digitization process

![Text digitization process](images/text_process_3.png)
---
## Textual Document digitization process

![Text digitization process](images/text_process_4.png)
---
## Textual Document digitization process

![Text digitization process](images/text_process_5.png)
---
## Textual Document digitization process

![Text digitization process](images/text_process_6.png)
---
## Images to Text:
### Machine Transcription, or
### Optical Character Recognition (OCR)

- [ABBYY FineReader](http://finereader.abbyy.com)
- [Tesseract](http://code.google.com/p/tesseract-ocr/)
- [Prime Recognition](http://www.primerecognition.com)
- [Adobe Acrobate](https://helpx.adobe.com/document-cloud/help/using-ocr-exportpdf.html#)

---
## Online OCR Services

- [Adobe Acrobat](https://helpx.adobe.com/document-cloud/help/using-ocr-exportpdf.html#)
- <https://www.newocr.com>
- <https://onlineocr.net>
- <https://ocr.space>
---
## Images to Text:
### Human Transcription
- Human transcription is usually done by: 
	- individuals or small teams (e.g., [Chymistry of Isaac Newton](http://chymistry.org/) or [VWWP](http://.indiana.edu/collections/vwwp/)
	- commercial vendors (e.g., [Aptara](http://www.aptaracorp.com/)) that transcribe massive amounts of documents using double-keying and triple-keying methods.
---
## Unicode, and other character encodings
- A character encoding system, i.e., a system that maps characters to to some other representation, e.g, code points, or numbers, that the computer uses to represent characters.
- ASCII, a common text encoding for “plain text” documents, has only 128 code points. Thus, it cannot possibly accommodate the thousands of characters used by historical, current, and fictional scripts and symbol systems.
---
## Unicode, and other character encodings

[Unicode 15](https://home.unicode.org/announcing-the-unicode-standard-version-15-0/), the current version as of 2022-09-13, has:

- 1,112,064 assignable code points 
- 149,186 characters from the 161 of the world’s scripts 
- 4,193 CJK (Chinese, Japanese, and Korean) ideographs.

See:
- <https://en.wikipedia.org/wiki/Unicode#Codespace_and_Code_Points>
- <https://home.unicode.org/announcing-the-unicode-standard-version-15-0/>
- <https://www.babelstone.co.uk/Unicode/HowMany.html>

---
## characters vs. glyphs
![Many m glphs](images/glyphs_1.png)

---
## characters vs. glyphs

![Many a glphs](images/glyphs_2.png)

All the same Unicode code point: U+0061

---
## characters vs. glyphs

![v and nu glphs](images/glyphs_3.png)

Not the same code point: U+0076 and U+03BC
---
## Unicode encodings
- The mappings between code points (numbers) to characters are consistent; however, there are various ways for computers to represent these code points numbers.
	- UTF-8
	- UTF-16
	- UTF-32
- UTF-8 is the most common Unicode encoding, and it is “ASCII-compatible”.

For more information on Unicode encodings see: <http://www.unicode.org/faq/utf_bom.html>
---
## ASCII
### Character encoding
![ascii chart](images/ascii.png)
---
## Alchemical symbols in Unicode
<https://en.wikipedia.org/wiki/Alchemical_Symbols_(Unicode_block)>
---
![achemical symbol in Unicode app](images/alchsymbol_1.png)