Advanced OCR Techniques For Extracting Data From Tables

November 11, 2024 (1y ago)

13 min read

Extracting data from tables in images or PDFs is a common challenge in document processing. While several free OCR (Optical Character Recognition) tools exist, choosing the right approach depends heavily on your specific use case. This post explores various techniques for achieving high-accuracy data extraction from complex table formats.

Image Pre-processing

Common OCR Solutions and Their Limitations

Popular OCR libraries like TesseractOCR and EasyOCR work well for simple use cases such as:

  • Searchable PDFs
  • High-quality photographs
  • Clean, printed documents

However, these tools often fall short when dealing with:

  • Multiple tables in a single image
  • Handwritten content
  • Poor quality photographs
  • Domain-specific formats
Image Pre-processing

Challenges of Custom Model Training

While training your own OCR model can improve accuracy for specific use cases, it comes with significant drawbacks:

  • Requires substantial computational resources
  • Demands extensive training data
  • Takes considerable development time
  • May not be feasible for rapid prototyping or POC development

Improving OCR Results Without Custom Training

1. Image Pre-processing

Optimize your input images for better OCR accuracy:

  • Remove excess white space
  • Isolate table boundaries
  • Increase contrast between text and background
  • Focus on table extraction before text recognition
def find_content_bounds(image):
    """
    Finds tables boundries
    """
    # Convert image into np array
    img_array = np.array(image)
   
    # Convert gray 
    if len(img_array.shape) == 3:
        img_gray = np.mean(img_array, axis=2)
    else:
        img_gray = img_array
   
    # Finds pixels which are not white (threshold: 250)
    non_white = img_gray < 250
   
    # Finds content boundries
    rows = np.any(non_white, axis=1)
    cols = np.any(non_white, axis=0)
   
    # Takes indexes of boundries
    top = np.argmax(rows)
    bottom = len(rows) - np.argmax(rows[::-1])
    left = np.argmax(cols)
    right = len(cols) - np.argmax(cols[::-1])
   
    # Adding padding 
    padding = 10
    top = max(0, top - padding)
    bottom = min(img_gray.shape[0], bottom + padding)
    left = max(0, left - padding)
    right = min(img_gray.shape[1], right + padding)
   
    return left, top, right, bottom

def convert_pdf_to_images(pdf_path, output_dir, dpi=600):

    os.makedirs(output_dir, exist_ok=True)
    image_paths = []
   
    try:
        images = convert_from_path(
            pdf_path,
            dpi=dpi,
            fmt='png',
            grayscale=False,
            size=None,
            transparent=False,
            use_pdftocairo=True,
            thread_count=4
        )
       
        for i, image in enumerate(images):
            left, top, right, bottom = find_content_bounds(image)
            cropped_image = image.crop((left, top, right, bottom))
            target_width = 2000
            aspect_ratio = cropped_image.size[1] / cropped_image.size[0]
            target_height = int(target_width * aspect_ratio)
           
            resized_image = cropped_image.resize(
                (target_width, target_height),
                Image.Resampling.LANCZOS
            )
            output_file = os.path.join(output_dir, f'page_{i+1}.png')
            resized_image.save(
                output_file,
                'PNG',
                optimize=True,
                quality=100
            )
           
            image_paths.append(output_file)
            
        return image_paths
    except Exception as e:
        print(e)
        return []

Result:

Image Pre-processing

2. Content Segmentation

Break down large documents into manageable chunks:

  • Process content page by page
  • Extract individual tables separately
  • Reduce input size to improve processing efficiency

3. Tool Selection

Choose the right OCR tool for your needs:

  • TesseractOCR and EasyOCR for simple cases
  • docTR for higher accuracy requirements (95-99%)
  • Consider specialized tools for specific document types

Structuring Extracted Data

Traditional Approach: Regular Expressions

  • Requires writing multiple complex patterns
  • Time-consuming to develop and maintain
  • Prone to errors with typos or unsupported languages
  • May need constant updates for new formats

Modern Approach: Large Language Models (LLMs)

  • More flexible and adaptable
  • Faster to implement than regex patterns
  • Better handling of variations and errors
  • Leverages OpenAI's API for structured data extraction

Hybrid Approach: Combining OCR and LLMs

While OpenAI's models aren't optimized for direct OCR processing, combining traditional OCR tools with LLMs offers the best of both worlds:

  1. Use specialized OCR tools for initial text extraction
  2. Process the extracted text using LLMs for structured data parsing
  3. Achieve higher accuracy and more reliable results

This hybrid approach provides:

  • Better accuracy than either method alone
  • More flexible processing capabilities
  • Faster development time
  • Improved handling of complex documents

Note: If you are using OpenAI api for extraction beaware of token limits of the model you are using. You can read it on their api documentation page.

Conclusion

For complex table data extraction, combining traditional OCR tools with modern LLM processing provides the most effective solution. This approach balances accuracy, development time, and processing efficiency while avoiding the need for custom model training.