Alfresco Document Data Capture Pro (Zonal Template Detection)

Zonal OCR and data-filed extraction gives companies the ability to electronically identify common documents, extract data, and sort without manual data entry by a staffer.

The Alfresco Document Data Capture Pro addon gives clients everything that Lite provides but with advanced functionality. OCR-assisted smart templates and integrated document input cuts down on slow data entry. The auto-detection of zonal OCR and data-filed extractions offers process efficiency. Custom search saves time. To put it simply, this Alfresco addon was developed by Skytizens as a solution specifically for enterprise clients with high-volume OCR needs.

Call for Price

What is Alfresco Document Data Capture Pro?

The Alfresco Document Data Capture Pro addon is the extended version of Alfresco Data Capture Lite, which includes the ability to execute zonal optical character recognition (OCR) and data-filed extraction for smarter document processing. This module allows enterprise companies to “read” scanned documents and images, identify the document using smart templates, zone in on important data, recognize the text content, convert the document’s data into real text, and assign that data as document properties in the system. While the Lite version can OCR and handle automated forms processing, the Pro version can assign extracted data to properties, file sorting, naming, and more.

Skytizens not only roduces zonal OCR and data-filed extraction into the Alfresco Document Content Management (DCM) system with this addon, but also developed additional features for sorting and searching data from these documents post-processing.

Why is Alfresco Document Data Capture Pro so important?

  • OCR better– Zonal OCR and data-filed extraction offer advanced OCR capabilities
  • Work smarter– use intelligent templates to increase the quality of OCR output
  • Increase efficiency – reduce skipping errors beyond Liteand make searching faster
  • Tighter security –avoid data risk by keeping documents in your company system
  • Versatile support– OCR documents in non-PDF file format with built-in automatic conversion

The Benefits of Alfresco Document Data Capture Pro

The Alfresco Document Data Capture Pro addon is designed to take automated forms processing to the next level without exposing sensitive data to a public cloud. Beyond your basic OCR capabilities, this module offers zonal OCR and data-filed extraction. This further reduces costs and increases efficiency for streamlining data input jobs. Alfresco Document Data Capture Pro has advanced detection capabilities to recognize documents using intelligent templates. It can even handle OCR on TIFF files, which do not have text fields and can’t be processed by traditional OCR processing. For big OCR jobs, this module is safer, better organized, and more independent than anything else available on the market for enterprise companies.

Skytizens developed zonal OCR and data-filed extraction for companies concerned about cyber security. This same service exists for enterprise companies to handle big OCR jobs via API from tech giant Google. To use Google Vision, clients must scan sensitive documents to the Google cloud, representing a data security risk. Alfresco Document Data Capture Pro handles everything in-house so sensitive data never leaves your company cloud. Skytizens is currently developing related machine learning (ML) capabilities to fully match Google Vision’s feature list.

How Does It Work?

The Alfresco Document Data Capture Pro addon is integrated with the Alfresco system. For clients with this module, it appears as an option under Folder Rule. It is also located in the Tools console under Skytizens Features.

This module works in two ways. First, it allows Alfresco users to create document templates for OCR use by extracting data manually. This means a person highlights the parent PDF document to tell the system where the important text is located on the page. Users can add additional parent documents to a single template, empowering the system to process many documents for the same output.

Second, it can auto-extract using zonal OCR and data-filed extraction. That means capturing full documents from an input point, usually a designated folder within the Alfresco Document Library. The folder is setup with the Alfresco Document Data Capture Pro rule so that every new document entering the folder gets processed. The module uses an intelligent template, configured using parent documents, to begin by identifying what type of document it is. Let’s use a factory packing list as an example. The template configuration can house 2 or more similar parent documents. In this case, multiple formats of the factory packing list are provided as parent documents to be used in building a template. From these documents, an XML document is created that pinpoints the (x,y) coordinates of each text field. Those text field locations are used to make a template map. This map is what Alfresco Document Data Capture Pro uses to compare incoming files and match them with their proper template. After initial template configuration, a template can always be expanded using additional parent documents. If a document does not match, it gets moved to an error folder meant to catch exceptions. System administrators can add un-processed documents to expand an existing template or use it to create a new template.

TIFF files have no text fields or identifiable text objects in the image. They are unable to be read by traditional OCR processing. Alfresco Document Data Capture Pro handles non-PDF file formats as special cases by setting them aside for manual OCR zone identification. Using a PDF conversion process to add pixels to a newly created file, this module allows companies to “read” TIFF files using OCR.

In the case of non-PDF files such as a TIFF file, this module converts and adds pixels to the text field location in order to run OCR and do the automated forms processing. This OCR-assisted template building or “zonal OCR” is the main functionality that differentiates the Lite and Pro Alfresco OCR modules from Skytizens. Liteworks by assigning each template to various upload folders, while Pro auto-detects the template from a single upload folder.

Once the document is identified as a factory packing list (FPL), the FPL template informs the system where to find important objects on the page. Objects are found by location and the text at these locations gets the OCR treatment. The full text is extracted from the object. Next, the text is edited by specific parameters using regular expressions. The edited text is then used to assign document properties such as Title, Description, Invoice number, Booking number, Plant number, P.O. number, etc. This is called data-filed extraction.

Finally, this document gets automatically named and sorted. If a landing folder doesn’t exist, a file path and nesting files are created according to configurations using folder rules or Alfresco SkyArea Sort. Once the newly processed files are in the Alfresco Document Library, they

The custom search feature of Alfresco Document Data Capture Pro is integrated into the search toolbar. You can find a target document using any number or combination of properties. In the case of the Factory Packing List example, this can include filters by document TYPE (FPL), Booking Number, Cartons, Gross Kgs, Invoice Number, Material, Plant, PO, PO item, and more. Custom searches can be saved for later use. They can also be used to pull reporting information using extracted data without running a full system report.

Steps for Alfresco Document Data Capture Pro functionality:

  • Create & choose model manager for template setup
  • Template setup using parent document(s) for system layout recognition
  • Link edited data to assigned Alfresco properties
  • Create site folder for input repository
  • Setup file feeder pathways using ERP integration and APIs (drag & drop also enabled)
  • Set rule by folder – SkyArea Pro Rule
  • Execute auto-convert for non-PDF files, OCR, data extraction, and auto-sort on all incoming files
  • Advanced search – SkyArea Pro repository

Alfresco Document Data Capture Pro is made for big OCR jobs and can handle upwards of 1,000 documents per hour.  Developed by Skytizens, this module currently reads four languages with a total of five scripts:

  • English
  • Thai
  • Japanese
  • Chinese (Simplified) and Chinese (Traditional)

Main Features

Template Building –Templates for common company documents created using Alfresco Model Manager.

  1. Auto-detect new documents using OCR
  2. Number of templates is unlimited.
    • Toggle to enable/disable
  3. Number of parent documents per template is unlimited.
    • Add
    • Upload
    • Delete
    • Toggle to enable/disable parent documents

Non-Match Reserve – Documents that don’t match a template are set aside in an error folder. Non-matches can be used as parent documents to expand an existing template or incorporate a new template. Skytizens offers training for clients to do this.

System Integration – Alfresco Document Data Capture Pro uses APIs and other methods of ERP system integration for document input. Drag & drop is also supported.

Batch Upload – Up to 1,000 files/hour.

Auto Convert – Automatically converts non-PDF file formats into PDF with searchable, readable text fields that are OCR compliant. Supported file formats:

  1. TIFF – original saved, new file created

Zonal OCR – Optical character recognition (OCR) for selected zones in PDF and converted files.

  1. Template matching
  2. Manual identification of text fields

Data-filed Extraction – Employs zonal OCR to configure parameters for data extraction and assignment. Data is assigned to property chosen by the user (e.g. File Name, Description, Invoice number, Booking number, Plant number, P.O. number, etc.)

  1. Template setup via linking OCR zones and regular expressions.
    • Title – text field to input
    • Text – highlight text
    • Regular Expression – text field
    • Value – by group it is possible to link 2+ rows ( e.g. {group2}{group3})
    • Test result – Double-check correct data extraction.
    • Assign – set extracted data as shown for the selected property

Organizational Management – Manage organization using Alfresco rules setup.

  1. Folder rule – sets input folder to trigger Alfresco Document Data Capture Pro processing on all incoming files
  2. Sort rule – employs automatic sort by task

Auto-Sort – Automatically sets file path, automatically creates nested files when necessary, and automatically sorts output documents by folder.

  1. Create folders by Rule
  2. Sort by document type
    • Create folders by individual document type rule
    • Secondary assignment of properties by regular expression (Blank fields will inherit properties from the native file)
    • Secondary permissions – set by User/Group, Role, or inherited from destination folder
    • Validation – set Value and Equals
    • Destination Folder – set Template, Path in template, Set property, Value
    • Options

Custom Search – Advanced search of documents processed by Alfresco Document Data Capture Pro module. Find your target document using a full view of properties.

  1. Search parameters – can be saved
    • document template
    • keyword
    • assigned properties
  2. Search Criteria – can be saved
    • Query Text
    • Insert Field
  3. Search Results – can be filtered and saved
    • Metadata
    • Order
    • Max Results
    • Set Default

Permissions Control – Access to this addon is managed by Group and Role.

  1. Group Access – Permission to use the module is given by the client’s administrator by designating members of a group.
  2. Role Access – Permission to use this module on certain files is given by file managers based on role access in the system.

Conclusions

The Alfresco Document Data Capture Pro addon gives clients everything that Lite provides but with advanced functionality. OCR-assisted smart templates and integrated document input cuts down on slow data entry. The auto-detection of zonal OCR and data-filed extractions offers process efficiency. Custom search saves time. To put it simply, this Alfresco addon was developed by Skytizens as a solution specifically for enterprise clients with high-volume OCR needs.

Alfresco Version

Alfresco Component Type

,

Development Status

Extension Point

Installation Method

Alfresco Product

,

Addon Name

Back to top