Complete Project Guide

Khmer OCR

A high-performance Optical Character Recognition engine for Khmer script — trained on 3 million text lines, powered by deep learning, runs everywhere.

3M
Training Lines
800+
Khmer Fonts
98
Characters
6
Font Styles
5
Platforms
🪟 Windows
🍎 macOS
🐧 Linux
📱 iOS
🤖 Android
Scroll to explore
What is it?

Reading Khmer Text From Images

KhmerOCR is like Google Lens, but built specifically for Cambodia's language. Give it a photo of any Khmer document — it gives you back editable digital text.

ព្រះរាជាណាចក្រកម្ពុជា → "Kingdom of Cambodia"
📸
Image In, Text Out
Give it a scanned letter, newspaper photo, or government document — it reads all the Khmer text and returns it as editable content.
🧠
Two AI Models Working Together
One model finds WHERE the text is. Another model reads WHAT it says. Together they form a complete OCR pipeline.
🔤
Font Style Detection
Not just the text — it also detects the font style: Regular, Bold, Italic, BoldItalic, Moul, or MoulLight. Preserves document structure.
📄
Multiple Export Formats
Export results as plain .txt, Markdown .md, styled .html, or formatted Word .docx with proper Khmer fonts applied automatically.
3M

Why KhmerOCR exists

Khmer script is uniquely complex — stacked consonants, vowels above and below letters, 800+ font variations. Generic OCR tools like Tesseract fail badly on Khmer. This model was trained from scratch on 3 million real Khmer text lines to solve this properly.

Capabilities

Everything It Can Do

🎯
Khmer Text Recognition
Reads Khmer characters from photos, scans, PDFs — even low-quality images.
ONNX Runtime Speed
Runs on CPU or GPU via ONNX Runtime — optimized for inference speed on any device.
📐
Layout Detection
Separates text regions from embedded figures and images in complex document layouts.
🔡
Font Style Awareness
Detects Moul, Bold, Regular, Italic and applies correct fonts to .docx output automatically.
📑
PDF Support
Processes multi-page PDFs via PyMuPDF, rendering each page at 2x resolution for accuracy.
🍎
Apple CoreML
Convert models to .mlpackage for native iOS/macOS inference on Apple Neural Engine.
🔧
C++ Native Engine
Full C++ implementation with C API for FFI — embed in any language: Swift, Kotlin, Dart, Rust.
🐍
Python Package
Install with pip, use detect() and recognize() in two lines of Python code.
🎯
Confidence Scores
Every result includes text confidence and font confidence — uncertain text marked red in HTML/docx.
How it works

The OCR Pipeline Step by Step

From raw image to clean Khmer text — 10 steps, two AI models, one clean output.

1
Load the Image or PDF
Accept a JPG, PNG, or PDF file. For PDFs, PyMuPDF renders each page as a 2× resolution RGB image for maximum detail.
INPUT
2
Resize & Normalize for Detection
Resize image to 1024×1024 with black padding (preserving aspect ratio). Convert pixel values from 0–255 to 0.0–1.0. Rearrange axes to CHW format (channels first).
PREPROCESS
3
Detection Model — det.onnx
A YOLO-style neural network scans the entire image and outputs hundreds of candidate bounding boxes — each with x, y, width, height, and class probabilities (text vs. figure).
det.onnx
4
Non-Maximum Suppression (NMS)
Multiple boxes often overlap around the same word. NMS keeps only the highest-confidence box for each region by removing any box that overlaps more than 45% with a better one (IoU > 0.45).
FILTER
5
Sort Boxes into Reading Lines
Boxes are grouped into text lines by comparing their Y positions. If two boxes' Y values differ by less than 50% of a box height, they're on the same line. Lines are then sorted top-to-bottom, words left-to-right.
SORT
6
Crop Each Text Region
For every text bounding box, cut out that small piece from the original image using the detected coordinates. Skip figure boxes (class_id = 0) — those are saved as embedded images.
CROP
7
Preprocess Each Crop for Recognition
Convert to grayscale (using luminosity formula: 0.299R + 0.587G + 0.114B). Resize to exactly 32px tall, keeping aspect ratio for width. Normalize to 0–1 range.
PREPROCESS
8
Recognition Model — rec.onnx
A CTC-based sequence model processes the 32px-tall grayscale image and outputs two things: (1) text_logits — character probabilities at each time step, and (2) font_logits — probabilities for each of 6 font styles.
rec.onnx
9
CTC Decode → Final Text + Font
Apply softmax to get probabilities. Take argmax at each time step. Remove consecutive duplicate characters. Remove blank tokens. Apply softmax to font_logits and pick the highest — giving font style + confidence.
DECODE
10
Assemble & Export
All recognized text lines are assembled in reading order. Exported as .txt (plain), .md (Moul = **bold**), .html (CSS styled, red = uncertain), or .docx (Khmer/Moul fonts, embedded figures).
OUTPUT
AI Models

Two Brains, One Engine

Both models are in ONNX format — run anywhere, on any device, without framework dependencies.

det.onnx
11 MB · Detection Model
Finds WHERE text is in the image. Draws bounding boxes around every text line and figure.
  • ArchitectureYOLO-style detector
  • Input Shape1 × 3 × 1024 × 1024
  • Input TypeRGB, normalized 0→1
  • OutputBounding boxes + probs
  • Classestext (1) · figure (0)
  • Conf Threshold0.25
  • NMS IoU0.45
rec.onnx
14 MB · Recognition Model
Reads WHAT the text says. Also detects font style — Regular, Bold, Italic, Moul, and more.
  • ArchitectureCTC sequence model
  • Input Shape1 × 1 × 32 × W
  • Input TypeGrayscale, 32px tall
  • Output 1text_logits (chars)
  • Output 2font_logits (6 styles)
  • Vocabulary98 Khmer characters
  • Font StylesRegular · Bold · Italic · BoldItalic · Moul · MoulLight
Project Structure

Every File, Explained

seanghay / KhmerOCR
📁 khmerocr/ Main Python package folder
🐍 __init__.py Core engine — loads models, implements detect() and recognize(), NMS, CTC decoding, softmax
🐍 cli.py CLI tool — accepts image/PDF input, exports to .txt / .md / .html / .docx with font-aware formatting
🧠 det.onnx 11MB detection model — YOLO-style, finds text and figure bounding boxes in 1024×1024 image
🧠 rec.onnx 14MB recognition model — CTC decoder, reads 98 Khmer chars + detects 6 font styles
📁 cpp/ Native C++ engine — same pipeline, maximum performance
📁 src/ C++ source files
⚙️ khmerocr.cpp Main engine — combines Detector + Recognizer, exports C API for FFI
⚙️ detector.cpp Detection pipeline — ONNX Runtime C++ API, parses model output
⚙️ recognizer.cpp Recognition pipeline — CTC decoding, UTF-8 Khmer token lookup
⚙️ image_utils.cpp Image processing — grayscale, bilinear resize, crop, CHW normalize
⚙️ nms.cpp Non-Maximum Suppression + reading-order line sorting
📁 cli/ C++ command-line tool
⚙️ main.cpp Standalone binary with --detect-only, --recognize-only, --json flags
📄 stb_image.h Single-header image loader — loads JPG/PNG with zero extra dependencies
📁 include/khmerocr/ C++ headers — types.h, detector.h, recognizer.h, image_utils.h, khmerocr.h
📁 cmake/ Build helpers
🔨 FindONNXRuntime.cmake Locates ONNX Runtime on macOS, Linux, Windows
🔨 Platform.cmake Platform flags for iOS, Android, Windows, macOS, Linux
🔨 CMakeLists.txt Build config — defines shared library, CLI binary, tests, install targets
📁 scripts/ Utility scripts
🐍 convert_to_coreml.py Converts det.onnx + rec.onnx → Apple CoreML .mlpackage for iOS/macOS Neural Engine
📄 pyproject.toml Python package config — name, version, dependencies, CLI entry point (khmerocr command)
📄 README.md Project overview, installation, usage examples, milestones
Code Examples

Using It in Python

python — install
# Install from GitHub (includes models automatically)
pip install git+https://github.com/seanghay/KhmerOCR
python — basic usage
from PIL import Image
from khmerocr import detect, recognize

# Load your image
img = Image.open('khmer_document.jpg')

# Step 1: find text regions
lines = detect(img)

# Step 2: read each box
for line in lines:
  for box in line:
    x1, y1, x2, y2 = box[:4]
    class_id = int(box[4]) # 0=figure, 1=text
    if class_id == 1:
      result = recognize(img.crop((x1, y1, x2, y2)))
      print(result['text'], '|', result['font'])
shell — CLI commands
# Image → plain text
khmerocr document.jpg

# Image → Word document
khmerocr document.jpg --format docx

# PDF → Markdown
khmerocr report.pdf --format md

# Custom output path
khmerocr scan.png -o result/output.html --format html
c — C API (for any language)
// Create engine
khmerocr_t ocr = khmerocr_create("/path/to/models");

// Recognize text from image data
float confidence;
char* text = khmerocr_recognize(ocr, data, w, h, 3, &confidence);

printf("Text: %s (%.0f%%)\n", text, confidence * 100);

khmerocr_free_string(text);
khmerocr_destroy(ocr);
Under the hood

Key Algorithms Explained Simply

Non-Maximum Suppression (NMS) — After detection, many overlapping boxes cover the same word. NMS keeps only the best one.

  1. Sort all detected boxes by confidence score, highest first
  2. Take the top box — it's the most confident — and add it to the keep list
  3. Calculate the IoU (Intersection over Union) between this box and every remaining box
  4. Remove any box that overlaps more than 45% with the kept box (IoU > 0.45)
  5. Move to the next remaining box and repeat until all boxes are processed
  6. Result: one clean box per word, no duplicates

CTC Decoding — The recognition model outputs character probabilities at every time step. CTC turns this noisy sequence into clean text.

  1. The model outputs a probability distribution over 98 Khmer characters + blank token at each time step
  2. Apply softmax to convert raw logits to probabilities (values that sum to 1)
  3. At each step, pick the argmax — the character with highest probability
  4. Remove consecutive duplicate characters — e.g. k, k, k, k → k
  5. Remove all blank tokens (special symbol used during training to handle gaps)
  6. Remaining characters form the final recognized word with average confidence score

Line Sorting — After NMS, boxes must be grouped into reading lines and sorted in natural reading order.

  1. Sort all boxes by their Y position (top of image = first)
  2. Start the first line with the topmost box
  3. For each next box, check if its Y value differs from the current line's last box by less than 50% of box height
  4. If yes → same line, append to current line
  5. If no → new line started
  6. Within each line, sort boxes by X position (left → right) for natural reading order

Image Preprocessing — Raw images must be transformed before feeding into the neural network.

  1. Resize to 1024×1024 — detection model expects fixed input; black-pad sides to preserve aspect ratio
  2. Normalize pixels — divide all values by 255 to map 0–255 range to 0.0–1.0
  3. CHW format — rearrange from Height×Width×Channel to Channel×Height×Width (PyTorch convention)
  4. Grayscale conversion for recognition: gray = 0.299R + 0.587G + 0.114B (luminosity formula)
  5. Resize to 32px tall — recognition model was trained on 32px height; width scales proportionally
  6. Bilinear interpolation — smooth, artifact-free resizing for both upscaling and downscaling
Architecture

Complete Data Flow

How data moves through the entire system from input to output.

📸
INPUT
Image (.jpg/.png) or PDF file
⚙️
PREPROCESS
Resize → 1024×1024 · Normalize 0→1 · CHW format
🧠
det.onnx — DETECTION
YOLO model outputs bounding boxes with class probabilities
🎯
NMS + LINE SORT
Remove duplicate boxes · Group into reading lines · Sort left→right
✂️
CROP EACH TEXT BOX
Cut out each word/line region from original image
🔲
PREPROCESS CROP
Grayscale · Resize to 32px tall · Normalize
🧠
rec.onnx — RECOGNITION
CTC model → text_logits + font_logits
🔤
CTC DECODE
Softmax → argmax → remove blanks/duplicates → Khmer text + font + confidence
📤
OUTPUT
Assembled text → .txt · .md · .html · .docx
Get Started

Install & Run It

python — pip install
# Install directly from GitHub
pip install git+https://github.com/seanghay/KhmerOCR

# This automatically installs:
# numpy · pillow · onnxruntime · click · python-docx · PyMuPDF
c++ — build from source
# Requirements: cmake 3.16+, C++17 compiler, ONNX Runtime 1.14+
cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Then run:
./khmerocr image.png
./khmerocr -j image.png # JSON output
python — coreml conversion (macOS only)
pip install coremltools onnx numpy

# Convert both models to .mlpackage
python scripts/convert_to_coreml.py

# Float16 for smaller size (~50% smaller)
python scripts/convert_to_coreml.py --fp16