KhmerOCR — Complete Project Guide

What is it?

Reading Khmer Text From Images

KhmerOCR is like Google Lens, but built specifically for Cambodia's language. Give it a photo of any Khmer document — it gives you back editable digital text.

ព្រះរាជាណាចក្រកម្ពុជា → "Kingdom of Cambodia"

📸

Image In, Text Out

Give it a scanned letter, newspaper photo, or government document — it reads all the Khmer text and returns it as editable content.

🧠

Two AI Models Working Together

One model finds WHERE the text is. Another model reads WHAT it says. Together they form a complete OCR pipeline.

🔤

Font Style Detection

Not just the text — it also detects the font style: Regular, Bold, Italic, BoldItalic, Moul, or MoulLight. Preserves document structure.

📄

Multiple Export Formats

Export results as plain .txt, Markdown .md, styled .html, or formatted Word .docx with proper Khmer fonts applied automatically.

Why KhmerOCR exists

Khmer script is uniquely complex — stacked consonants, vowels above and below letters, 800+ font variations. Generic OCR tools like Tesseract fail badly on Khmer. This model was trained from scratch on 3 million real Khmer text lines to solve this properly.

Capabilities

Everything It Can Do

🎯

Khmer Text Recognition

Reads Khmer characters from photos, scans, PDFs — even low-quality images.

⚡

ONNX Runtime Speed

Runs on CPU or GPU via ONNX Runtime — optimized for inference speed on any device.

📐

Layout Detection

Separates text regions from embedded figures and images in complex document layouts.

🔡

Font Style Awareness

Detects Moul, Bold, Regular, Italic and applies correct fonts to .docx output automatically.

📑

PDF Support

Processes multi-page PDFs via PyMuPDF, rendering each page at 2x resolution for accuracy.

🍎

Apple CoreML

Convert models to .mlpackage for native iOS/macOS inference on Apple Neural Engine.

🔧

C++ Native Engine

Full C++ implementation with C API for FFI — embed in any language: Swift, Kotlin, Dart, Rust.

🐍

Python Package

Install with pip, use detect() and recognize() in two lines of Python code.

🎯

Confidence Scores

Every result includes text confidence and font confidence — uncertain text marked red in HTML/docx.

How it works

The OCR Pipeline Step by Step

From raw image to clean Khmer text — 10 steps, two AI models, one clean output.

Load the Image or PDF

Accept a JPG, PNG, or PDF file. For PDFs, PyMuPDF renders each page as a 2× resolution RGB image for maximum detail.

INPUT

Resize & Normalize for Detection

Resize image to 1024×1024 with black padding (preserving aspect ratio). Convert pixel values from 0–255 to 0.0–1.0. Rearrange axes to CHW format (channels first).

PREPROCESS

Detection Model — det.onnx

A YOLO-style neural network scans the entire image and outputs hundreds of candidate bounding boxes — each with x, y, width, height, and class probabilities (text vs. figure).

det.onnx

Non-Maximum Suppression (NMS)

Multiple boxes often overlap around the same word. NMS keeps only the highest-confidence box for each region by removing any box that overlaps more than 45% with a better one (IoU > 0.45).

FILTER

Sort Boxes into Reading Lines

Boxes are grouped into text lines by comparing their Y positions. If two boxes' Y values differ by less than 50% of a box height, they're on the same line. Lines are then sorted top-to-bottom, words left-to-right.

SORT

Crop Each Text Region

For every text bounding box, cut out that small piece from the original image using the detected coordinates. Skip figure boxes (class_id = 0) — those are saved as embedded images.

CROP

Preprocess Each Crop for Recognition

Convert to grayscale (using luminosity formula: 0.299R + 0.587G + 0.114B). Resize to exactly 32px tall, keeping aspect ratio for width. Normalize to 0–1 range.

PREPROCESS

Recognition Model — rec.onnx

A CTC-based sequence model processes the 32px-tall grayscale image and outputs two things: (1) text_logits — character probabilities at each time step, and (2) font_logits — probabilities for each of 6 font styles.

rec.onnx

CTC Decode → Final Text + Font

Apply softmax to get probabilities. Take argmax at each time step. Remove consecutive duplicate characters. Remove blank tokens. Apply softmax to font_logits and pick the highest — giving font style + confidence.

DECODE

Assemble & Export

All recognized text lines are assembled in reading order. Exported as .txt (plain), .md (Moul = **bold**), .html (CSS styled, red = uncertain), or .docx (Khmer/Moul fonts, embedded figures).

OUTPUT

AI Models

Two Brains, One Engine

Both models are in ONNX format — run anywhere, on any device, without framework dependencies.

det.onnx

11 MB · Detection Model

Finds WHERE text is in the image. Draws bounding boxes around every text line and figure.

ArchitectureYOLO-style detector
Input Shape1 × 3 × 1024 × 1024
Input TypeRGB, normalized 0→1
OutputBounding boxes + probs
Classestext (1) · figure (0)
Conf Threshold0.25
NMS IoU0.45

rec.onnx

14 MB · Recognition Model

Reads WHAT the text says. Also detects font style — Regular, Bold, Italic, Moul, and more.

ArchitectureCTC sequence model
Input Shape1 × 1 × 32 × W
Input TypeGrayscale, 32px tall
Output 1text_logits (chars)
Output 2font_logits (6 styles)
Vocabulary98 Khmer characters
Font StylesRegular · Bold · Italic · BoldItalic · Moul · MoulLight

Project Structure

Every File, Explained

seanghay / KhmerOCR

📁 khmerocr/ Main Python package folder

🐍 __init__.py Core engine — loads models, implements detect() and recognize(), NMS, CTC decoding, softmax

🐍 cli.py CLI tool — accepts image/PDF input, exports to .txt / .md / .html / .docx with font-aware formatting

🧠 det.onnx 11MB detection model — YOLO-style, finds text and figure bounding boxes in 1024×1024 image

🧠 rec.onnx 14MB recognition model — CTC decoder, reads 98 Khmer chars + detects 6 font styles

📁 cpp/ Native C++ engine — same pipeline, maximum performance

📁 src/ C++ source files

⚙️ khmerocr.cpp Main engine — combines Detector + Recognizer, exports C API for FFI

⚙️ detector.cpp Detection pipeline — ONNX Runtime C++ API, parses model output

⚙️ recognizer.cpp Recognition pipeline — CTC decoding, UTF-8 Khmer token lookup

⚙️ image_utils.cpp Image processing — grayscale, bilinear resize, crop, CHW normalize

⚙️ nms.cpp Non-Maximum Suppression + reading-order line sorting

📁 cli/ C++ command-line tool

⚙️ main.cpp Standalone binary with --detect-only, --recognize-only, --json flags

📄 stb_image.h Single-header image loader — loads JPG/PNG with zero extra dependencies

📁 include/khmerocr/ C++ headers — types.h, detector.h, recognizer.h, image_utils.h, khmerocr.h

📁 cmake/ Build helpers

🔨 FindONNXRuntime.cmake Locates ONNX Runtime on macOS, Linux, Windows

🔨 Platform.cmake Platform flags for iOS, Android, Windows, macOS, Linux

🔨 CMakeLists.txt Build config — defines shared library, CLI binary, tests, install targets

📁 scripts/ Utility scripts

🐍 convert_to_coreml.py Converts det.onnx + rec.onnx → Apple CoreML .mlpackage for iOS/macOS Neural Engine

📄 pyproject.toml Python package config — name, version, dependencies, CLI entry point (khmerocr command)

📄 README.md Project overview, installation, usage examples, milestones

Code Examples

Using It in Python

python — install

        # Install from GitHub (includes models automatically)

        pip install git+https://github.com/seanghay/KhmerOCR

python — basic usage

        from PIL import Image

        from khmerocr import detect, recognize

        # Load your image

        img = Image.open('khmer_document.jpg')

        # Step 1: find text regions

        lines = detect(img)

        # Step 2: read each box

        for line in lines:

          for box in line:

            x1, y1, x2, y2 = box[:4]

            class_id = int(box[4])  # 0=figure, 1=text

            if class_id == 1:

              result = recognize(img.crop((x1, y1, x2, y2)))

              print(result['text'], '|', result['font'])

shell — CLI commands

        # Image → plain text

        khmerocr document.jpg

        # Image → Word document

        khmerocr document.jpg --format docx

        # PDF → Markdown

        khmerocr report.pdf --format md

        # Custom output path

        khmerocr scan.png -o result/output.html --format html

c — C API (for any language)

        // Create engine

        khmerocr_t ocr = khmerocr_create("/path/to/models");

        // Recognize text from image data

        float confidence;

        char* text = khmerocr_recognize(ocr, data, w, h, 3, &confidence);

        printf("Text: %s (%.0f%%)\n", text, confidence * 100);

        khmerocr_free_string(text);

        khmerocr_destroy(ocr);

Under the hood

Key Algorithms Explained Simply

Non-Maximum Suppression (NMS) — After detection, many overlapping boxes cover the same word. NMS keeps only the best one.

Sort all detected boxes by confidence score, highest first
Take the top box — it's the most confident — and add it to the keep list
Calculate the IoU (Intersection over Union) between this box and every remaining box
Remove any box that overlaps more than 45% with the kept box (IoU > 0.45)
Move to the next remaining box and repeat until all boxes are processed
Result: one clean box per word, no duplicates

CTC Decoding — The recognition model outputs character probabilities at every time step. CTC turns this noisy sequence into clean text.

The model outputs a probability distribution over 98 Khmer characters + blank token at each time step
Apply softmax to convert raw logits to probabilities (values that sum to 1)
At each step, pick the argmax — the character with highest probability
Remove consecutive duplicate characters — e.g. k, k, k, k → k
Remove all blank tokens (special symbol used during training to handle gaps)
Remaining characters form the final recognized word with average confidence score

Line Sorting — After NMS, boxes must be grouped into reading lines and sorted in natural reading order.

Sort all boxes by their Y position (top of image = first)
Start the first line with the topmost box
For each next box, check if its Y value differs from the current line's last box by less than 50% of box height
If yes → same line, append to current line
If no → new line started
Within each line, sort boxes by X position (left → right) for natural reading order

Image Preprocessing — Raw images must be transformed before feeding into the neural network.

Resize to 1024×1024 — detection model expects fixed input; black-pad sides to preserve aspect ratio
Normalize pixels — divide all values by 255 to map 0–255 range to 0.0–1.0
CHW format — rearrange from Height×Width×Channel to Channel×Height×Width (PyTorch convention)
Grayscale conversion for recognition: gray = 0.299R + 0.587G + 0.114B (luminosity formula)
Resize to 32px tall — recognition model was trained on 32px height; width scales proportionally
Bilinear interpolation — smooth, artifact-free resizing for both upscaling and downscaling

Architecture

Complete Data Flow

How data moves through the entire system from input to output.

📸

INPUT

Image (.jpg/.png) or PDF file

⚙️

PREPROCESS

Resize → 1024×1024 · Normalize 0→1 · CHW format

🧠

det.onnx — DETECTION

YOLO model outputs bounding boxes with class probabilities

🎯

NMS + LINE SORT

Remove duplicate boxes · Group into reading lines · Sort left→right

✂️

CROP EACH TEXT BOX

Cut out each word/line region from original image

🔲

PREPROCESS CROP

Grayscale · Resize to 32px tall · Normalize

🧠

rec.onnx — RECOGNITION

CTC model → text_logits + font_logits

🔤

CTC DECODE

Softmax → argmax → remove blanks/duplicates → Khmer text + font + confidence

📤

OUTPUT

Assembled text → .txt · .md · .html · .docx

Get Started

Install & Run It

python — pip install

        # Install directly from GitHub

        pip install git+https://github.com/seanghay/KhmerOCR

        # This automatically installs:

        # numpy · pillow · onnxruntime · click · python-docx · PyMuPDF

c++ — build from source

        # Requirements: cmake 3.16+, C++17 compiler, ONNX Runtime 1.14+

        cd cpp

        mkdir build && cd build

        cmake .. -DCMAKE_BUILD_TYPE=Release

        make -j$(nproc)

        # Then run:

        ./khmerocr image.png

        ./khmerocr -j image.png  # JSON output

python — coreml conversion (macOS only)

        pip install coremltools onnx numpy

        # Convert both models to .mlpackage

        python scripts/convert_to_coreml.py

        # Float16 for smaller size (~50% smaller)

        python scripts/convert_to_coreml.py --fp16

Khmer OCR

Reading Khmer Text From Images

Why KhmerOCR exists

Everything It Can Do

The OCR Pipeline Step by Step

Two Brains, One Engine

Every File, Explained

Using It in Python

Key Algorithms Explained Simply

Complete Data Flow

Install & Run It