A fast, helpful, and open-source document parser
-
Updated
Jul 2, 2026 - Rust
A fast, helpful, and open-source document parser
Modular SenseNova skills for building AI-powered office assistants and productivity workflows
A system for agentic LLM-powered data processing and ETL
在保留版面、公式与结构的前提下进行 PDF 翻译,适用于科研与技术文档
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
📄 The PDF intelligence layer for AI agents — Agent Document Twin, evidence-first extraction, visual crops, OCR provenance, trust reports, and benchmark-gated releases. MCP server for Claude, Cursor, VS Code, and any MCP client.
Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.
Open-source batch OCR workbench — a free, local alternative to ABBYY FineReader. Powered by Ollama + GLM-OCR + PP-DocLayoutV3, ~0.5s/page on RTX 4090. Three-panel editor, layout-aware, PDF/image batch processing, Markdown/Word export. 批量OCR工作台,纯本地运行,免费平替ABBYY,适合书籍文档数字化。
PowerPoint .NET library for reading, modifying, and generating PPTX presentations without Microsoft Office
Generic framework for historical document processing
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
⚡ Cloud-native, AI-powered, document processing pipelines on AWS.
Turn PDFs into clean, structured Markdown
Pure Rust PDF library for AI/RAG: structure-aware chunking, no ML, no C deps.
Transform unstructured documents into validated, rich and queryable knowledge graphs.
Python, LlamaIndex, LangChain, Docker Compose: 15 Property Graph, 4 RDF , 10 Vector, OpenSearch, Elasticsearch, Alfresco DBs. 13 data sources (9 auto-sync), KG auto-building, Ontologies, LLMs, Docling or LlamaParse doc processing, GraphRAG, RAG only, Hybrid Search, AI Chat. TypeScript React, Vue, Angular frontends, FastAPI REST backend, MCP Server.
Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."