What is Alfresco Tesseract OCR?
Alfresco Tesseract OCR is a full-page Alfresco OCR addon developed by Skytizens is an Optical Character Recognition engine incorporated into the Alfresco Document Content Management system. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats, recognize the text content, and convert the documents into real text. An OCR creates a digital file from a hard copy and this module does it in conjunction with the Alfresco system.
Skytizens not only integrated Tesseract with Alfresco, but we went a step further by developing additional features for the module. Enhanced features include an audit log of OCR activity with dynamic search capabilities, batch processing, and pre-/post- processing to increase the quality of the OCR output.
Why is Alfresco Tesseract OCR so important?
- Save time – no more double-work; use this module to scan text content from hard copy paper files and make a huge dent in work hours spent typing
- Quick digital searches – OCR software converts scanned text into a word-processing file, giving you the opportunity to search document content using a keyword or phrase
- Increased Accessibility – OCR software is great for increasing your company’s accessibility level; employees with sight issues can benefit throughout the workday from using this technology
- Save money with integration – no need to purchase separate OCR software—this module is effective enough to take care of all your business-related OCR needs
The Benefits of Alfresco Tesseract OCR
Skytizens has developed the Alfresco OCR addon because OCR technology makes easy work of scanning hard copies of text documents. Once in text format, documents become easier for everybody to get their work done. For Alfresco users with visual impairments or other accessibility issues, this module facilitates document-related work that might otherwise be prohibitively difficult. This module has been developed by Skytizens for multinational companies who are based in Asia. Thus far, the module supports documents written in English, Thai, Japanese, Chinese (Traditional), and Chinese (simplified).
How Does It Work?
Optical Character Recognition or OCR is a method of converting a scanned image into text. Although people can read scanned images with their eyes by recognizing the characters of the language they speak, the computer reads the scan as a series of black and white dots like a copy machine. The computer sees the page of text more like a picture of a page.
OCR is technology that helps a computer function more like human eyes by identifying letters based on their shape and reading them as text. Once a printed page is machine-readable you can do all the things that make word-processing on a computer so much more convenient than handwritten text on paper. You can search by keyword, copy and paste, use the text to index or post in other locations, and compress the text in other file formats. Machine-readable text also becomes accessible as it can be decoded by screen readers, which are tools that use speech synthesizers to read out text for blind and visually-impaired users.
Alfresco OCR is a full-page OCR application integrated with the Alfresco Document Management System. The first step is to scan files into the system. Next, users can send scanned files to the full-page OCR module for processing into text. Once the file is processed, the text content from the file is available for the user to access. Using this module, the text is also indexed and becomes searchable by keyword in Alfresco’s Live Search.
Alfresco users can choose to send their files for OCR processing in three ways:
- Single Files
- Batch Mode
- Auto OCR Conversion (by Folder Rule)
Basic Alfresco OCR settings are located in the Admin Tools menu. Administrators can choose the module language as well as set the default path for document processing. This module supports 5 language settings: English, Thai, Japanese, and Chinese in both Traditional and Simplified script.
Users can also designate the destination path where the Alfresco OCR module will store output files after processing.
Main Features
Language Settings– Language settings can be set in two locations. The Alfresco administrators can set the default language for the entire system. Individual users can also set the OCR language for scanned files coming into the Alfresco system via their personal profile. Users can have their OCR module language settings different than the system default, depending on their scope of work.
-
- Language – Select the language to recognize on incoming documents:
- English
- Thai
- Japanese
- Chinese (Simplified)
- Chinese (Traditional)
- Language – Select the language to recognize on incoming documents:
Administrative Settings –Save your files to the same directory after OCR processing. If this box is ticked, the system will ask you for a default destination folder and all OCR files will end up in the same place. If the box is left unticked, the Alfresco system will ask you for a destination folder each time the OCR module is used.
-
- Admin – The Alfresco administrators can set a default destination folder for the entire system. This is located under Admin Tools in the OCR menu.
- Users – Other users can also set a destination file path for documents they process individually. This setting is located in the user Profile, under the Skytizens Features menu. At the bottom of the options list, users will see where to configure OCR settings.
OCR Processing (Single File)–On a daily basis, users will probably use this feature to process image files one-by-one into text format.
Batch Mode (Multiple Files)–When a user needs to process a large volume of files at once, this option is best. The files are selected in the Document Library and can be sent as a group using the Skytizens Features menu.
Auto OCR Conversion (by Folder Rule) – Users can create one or many special folders to automatically convert documents using the Alfresco OCR addon from the folder’s action menu, under Manage Rules.
- Perform Action: Send to OCR– A user can dictate that all files entering this folder will be sent to OCR.
- Relative Folder Path – Designate file destination other than root folder for OCR processed files coming from this folder. If no path is chosen, Alfresco will save the completed files in this same Root Folder. The system uses JAVA date format.
- Languages – Choose the detection language for the OCR activities related to this folder.
OCR to Live Search – After processing documents using the OCR module, the Alfresco system takes everything one step further. All text converted from images gets indexed. This means users can use the live search function to search documents by content—even scanned hard copies.
OCR Log–Skytizens enhanced the OCR module tohave a feature that allows users to check OCR-related activity in the system. Under the Tool menu users will see Skytizens Features. This area of the system will show a dynamic log of files that have undergone OCR processing in the Alfresco system.
OCR Log Search – is dynamic because the table can be searched and viewed in many convenient ways.
- Search – Users can search each heading by keyword or phrase to find specific documents
- Calendar – Users can designate a beginning and end date to display only a range of documents processed during that time
- Search Magnifying Glass – User can use the information provided in the keyword searches and the calendar settings to search
- Manual Clear – User can use this button to clear the search bars and calendar settings and start a fresh search of the OCR audit log
OCR Log Sort – The dynamic OCR audit log appears as a dynamic table with the following sortable headings:
- Source File
- Pages
- Modified by – Alfresco username
- Requester – Alfresco username
- Timestamp – date and time of day
- Status – Shows the current stage in the OCR process
- Completed – files available for text search, etc.
- In Progress – files currently being processed
- In Queue – for batch files, the module will automatically spool and queue OCR jobs
- Cancelled – files that were cancelled before completion
- Actions
The OCR Audit Log can be exported to Excel file format for easy browsing.
Permissions Control – Access to the feature is managed by Group and Role.
- Group Access – Permission to use the feature is given by the client’s administrator by designating members of a group.
- Role Access–Permission to use this feature on certain files is given by file managers based on role access in the Alfresco system.
Conclusions
From hard copies to searchable text in one effortless step. Cutting-edge technology means your system does all the reading and typing for you.
The Alfresco OCR addon was developed by Skytizens to give users the ability to convert hard copy scans into text content directly within the Document Library. The module not only saves time but makes work easier for every employee, especially those with accessibility needs.
This conversion module includes all the functional options for how, when, where their document conversions are handled so that it’s convenient for all Alfresco users. This enhanced module also supports multiple languages within one system, which is vital for businesses at the multinational level. Once the document conversions are done, this module allows users to search through the text content to find specific information. This module makes digitizing anything a total breeze.
Alfresco Version | |
---|---|
Alfresco Component Type | |
Development Status | |
Extension Point | |
Installation Method | |
Addon Name | |
Alfresco Product |