document-processing

Star

Here are 1,133 public repositories matching this topic...

run-llama / liteparse

Star

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated Jul 2, 2026
Rust

OpenSenseNova / SenseNova-Skills

Star

Modular SenseNova skills for building AI-powered office assistants and productivity workflows

agent data-analysis ai-agents document-processing presentation-slides office-automation ai-assistant agent-skills

Updated Jul 1, 2026
Python

ucbepic / docetl

Star

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Jun 26, 2026
Python

wxyhgk / retain-pdf

Star

在保留版面、公式与结构的前提下进行 PDF 翻译，适用于科研与技术文档

pdf ocr translation document-processing scientific-papers typst document-ai layout-preserving

Updated Jul 2, 2026
Python

enoch3712 / ExtractThinker

Star

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

Updated Aug 27, 2025
Python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

ocr document-analysis document-processing scene-text-recognition scene-text-detection ocr-pytorch chineseocr document-parsing

Updated May 20, 2026
Python

eclaire-labs / eclaire

Star

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

open-source automation privacy ocr ai rest-api bookmarks self-hosted data-extraction note-taking web-archiving bookmark-manager task-management document-processing on-device-ai local-first personal-knowledge-management ai-assistant llm

Updated May 14, 2026
TypeScript

yfedoseev / pdf_oxide

Star

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated Jul 2, 2026
Rust

SylphxAI / pdf-reader-mcp

Star

📄 The PDF intelligence layer for AI agents — Agent Document Twin, evidence-first extraction, visual crops, OCR provenance, trust reports, and benchmark-gated releases. MCP server for Claude, Cursor, VS Code, and any MCP client.

Updated Jul 1, 2026
TypeScript

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Mar 15, 2026
Java

vorojar / Folio-OCR

Star

Open-source batch OCR workbench — a free, local alternative to ABBYY FineReader. Powered by Ollama + GLM-OCR + PP-DocLayoutV3, ~0.5s/page on RTX 4090. Three-panel editor, layout-aware, PDF/image batch processing, Markdown/Word export. 批量OCR工作台，纯本地运行，免费平替ABBYY，适合书籍文档数字化。

privacy ocr offline book-digitization document-processing document-ocr layout-detection markdown-export pdf-ocr local-ai ollama batch-ocr glm-ocr abbyy-alternative

Updated Jun 18, 2026
Python

ShapeCrawler / ShapeCrawler

Star

PowerPoint .NET library for reading, modifying, and generating PPTX presentations without Microsoft Office

csharp dotnet presentation slides powerpoint openxml pptx document-processing office-open-xml

Updated Jun 11, 2026
C#

dhlab-epfl / dhSegment

Star

Generic framework for historical document processing

tensorflow python3 segmentation historical-data document-processing

Updated Jul 9, 2021
Python

ucbepic / TWIX

Star

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-processing document-data-extraction

Updated May 10, 2026
Python

awslabs / project-lakechain

Star

⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

aws machine-learning natural-language-processing computer-vision serverless hacktoberfest document-processing aws-cdk generative-ai retrieval-augmented-generation

Updated Jan 22, 2026
TypeScript

iamarunbrahma / pdf-to-markdown

Star

Turn PDFs into clean, structured Markdown

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Jul 2, 2026
Python

bzsanti / oxidizePdf

Star

Pure Rust PDF library for AI/RAG: structure-aware chunking, no ML, no C deps.

Updated Jul 2, 2026
Rust

docling-project / docling-graph

Star

Transform unstructured documents into validated, rich and queryable knowledge graphs.

ai convert knowledge-graph document-processing docling

Updated Jul 2, 2026
Python

stevereiner / flexible-graphrag

Star

Python, LlamaIndex, LangChain, Docker Compose: 15 Property Graph, 4 RDF , 10 Vector, OpenSearch, Elasticsearch, Alfresco DBs. 13 data sources (9 auto-sync), KG auto-building, Ontologies, LLMs, Docling or LlamaParse doc processing, GraphRAG, RAG only, Hybrid Search, AI Chat. TypeScript React, Vue, Angular frontends, FastAPI REST backend, MCP Server.

Updated Jun 20, 2026
Python

formkiq / formkiq-core

Star

Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.

aws ocr serverless headless cloud-storage document-database amazon-web-services dms document-management optical-character-recognition document-processing document-management-system document-api document-apis intelligent-document-processing document-layer

Updated Jul 1, 2026
Java

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 1,133 public repositories matching this topic...

run-llama / liteparse

OpenSenseNova / SenseNova-Skills

ucbepic / docetl

wxyhgk / retain-pdf

enoch3712 / ExtractThinker

Topdu / OpenOCR

eclaire-labs / eclaire

yfedoseev / pdf_oxide

SylphxAI / pdf-reader-mcp

ExtractPDF4J / ExtractPDF4J

vorojar / Folio-OCR

ShapeCrawler / ShapeCrawler

dhlab-epfl / dhSegment

ucbepic / TWIX

awslabs / project-lakechain

iamarunbrahma / pdf-to-markdown

bzsanti / oxidizePdf

docling-project / docling-graph

stevereiner / flexible-graphrag

formkiq / formkiq-core

Improve this page

Add this topic to your repo