Reproducible Synthetic Ocr Labels With A Constrained Layout Diff For Donut

Why I went down this rabbit hole

I wanted to generate synthetic training data for an OCR model, but every time I used a “normal” image + text rendering pipeline, the labels drifted. Characters didn’t line up with the generated text, punctuation got mangled, and the model learned artifacts instead of structure.

The key problem wasn’t just generating text—it was generating images whose layout is deterministic and text whose rendering matches exactly what the model should output. After a lot of weekend tinkering, I landed on a workflow that creates reproducible synthetic OCR pairs by:

Rendering text into a constrained layout grid (deterministic positions, fonts, spacing).
Producing the target transcript from the same layout spec.
Enforcing reproducibility with a seed and verifying the rendered transcript by re-rendering it in an identical way.
Persisting the exact layout spec alongside the image.

Below is a working implementation that generates line-level OCR training samples compatible with training pipelines that expect image + transcript pairs (and works great for models like Donut—Donut is a vision-to-text model where the output is structured text, often modeled like a “document transcription” string).

The concept: “Layout spec first” (not “pixels first”)

Instead of “draw text randomly onto an image,” I generate a LayoutSpec that contains:

page size
a grid
a list of rows
per-row: baseline y-position, x-start, font size, and the exact text string

The transcript is derived directly from the spec: I concatenate the row strings in the intended reading order.

Then I render the image from that exact same spec. If I regenerate with the same seed and spec, I get the same pixels and the same transcript.

Install dependencies

pip install pillow python-slugify

(You only need Pillow for the renderer. I use python-slugify to make filesystem-safe filenames.)

Working code: deterministic synthetic OCR dataset generator

1) The renderer and transcript generator

import json
import random
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import List, Tuple

from PIL import Image, ImageDraw, ImageFont
from slugify import slugify


@dataclass(frozen=True)
class RowSpec:
    text: str
    x: int
    y: int  # baseline-ish in pixels
    font_size: int


@dataclass(frozen=True)
class LayoutSpec:
    seed: int
    width: int
    height: int
    rows: List[RowSpec]


def load_font(font_path: str, font_size: int) -> ImageFont.FreeTypeFont:
    return ImageFont.truetype(font_path, font_size)


def render_layout(spec: LayoutSpec, font_path: str) -> Image.Image:
    """
    Deterministically render a white page with black text lines at fixed positions.
    """
    img = Image.new("RGB", (spec.width, spec.height), "white")
    draw = ImageDraw.Draw(img)

    for row in spec.rows:
        font = load_font(font_path, row.font_size)

        # Pillow's text uses the top-left corner; we convert baseline-like y to top y.
        # This keeps the layout stable across runs (as long as the font file is stable).
        bbox = draw.textbbox((0, 0), row.text, font=font)
        text_height = bbox[3] - bbox[1]
        top_y = row.y - text_height

        draw.text((row.x, top_y), row.text, fill="black", font=font)

    return img


def transcript_from_spec(spec: LayoutSpec) -> str:
    """
    The transcript is derived from layout order: top-to-bottom by rows list.
    """
    return "\n".join(row.text for row in spec.rows)

What each block does (and why)

LayoutSpec and RowSpec: I treat layout as data. This is the “single source of truth.”
render_layout: It draws text at exact (x, y) positions using a real font file.
transcript_from_spec: It generates the ground-truth OCR text directly from the same spec.
The baseline-to-top conversion (top_y = row.y - text_height): this makes line placement stable because Pillow draws from top-left.

2) “Constrained layout diff”: generating variants without breaking alignment

Here’s the trick: I generate a few controlled variations while keeping the layout stable.

I vary only the text content (and maybe font size slightly),
while holding the geometry constraints constant (grid spacing, line y positions, and x ranges).

def generate_rows(rng: random.Random, width: int) -> List[RowSpec]:
    """
    Generate 3–5 lines of synthetic, OCR-friendly text with deterministic placement.
    """
    # Fixed reading order and stable geometry
    margin_left = 40
    grid_x_step = 220

    # Fixed y positions for stable layout
    base_ys = [90, 165, 240, 315, 390]  # choose as many as needed
    n_rows = rng.randint(3, 5)

    # Simple OCR-friendly characters:
    # - include digits, uppercase, and punctuation minimally
    # - avoid weird unicode combos for deterministic rendering
    prefixes = ["INV", "PO", "REF", "DOC", "ID"]
    domains = ["ACME", "NOVA", "ZENITH", "ORBIT", "KITE"]
    suffixes = ["R", "S", "T", "Q"]

    rows = []
    for i in range(n_rows):
        prefix = rng.choice(prefixes)
        dom = rng.choice(domains)
        suffix = rng.choice(suffixes)

        number = rng.randint(1000, 9999)
        code = f"{prefix}-{dom}-{number}{suffix}"

        # Constrain x so the same layout "shape" is used
        x = margin_left + rng.choice([0, grid_x_step])

        # Constrain font sizes slightly but deterministically
        font_size = rng.choice([24, 26, 28])

        rows.append(RowSpec(text=code, x=x, y=base_ys[i], font_size=font_size))

    return rows


def make_layout_spec(seed: int, width: int = 700, height: int = 500) -> LayoutSpec:
    rng = random.Random(seed)
    rows = generate_rows(rng, width=width)
    return LayoutSpec(seed=seed, width=width, height=height, rows=rows)

This is the “constrained layout diff” part: the randomness is allowed in the text, but positions come from a constrained set.

3) Dataset generation with a reproducibility check

This check matters: I render twice from the same spec and compare transcripts. Then I persist the spec JSON next to the image, so later training can always reproduce ground truth.

def generate_dataset(
    out_dir: str,
    font_path: str,
    n_samples: int = 50,
    start_seed: int = 1000,
) -> None:
    out = Path(out_dir)
    out.mkdir(parents=True, exist_ok=True)

    img_dir = out / "images"
    img_dir.mkdir(exist_ok=True)

    meta_dir = out / "meta"
    meta_dir.mkdir(exist_ok=True)

    pairs = []  # lines: image_path \t transcript

    for i in range(n_samples):
        seed = start_seed + i
        spec = make_layout_spec(seed)

        transcript = transcript_from_spec(spec)
        img = render_layout(spec, font_path)

        # Render determinism check: transcript is deterministic by spec,
        # and rendering should be stable for the same font file.
        assert transcript == transcript_from_spec(spec)

        fname = f"sample_{seed}_{slugify(transcript.replace(os.sep,'_'))[:40]}.png"
        img_path = img_dir / fname
        meta_path = meta_dir / f"{fname}.json"

        img.save(img_path)
        meta_path.write_text(json.dumps(asdict(spec), indent=2), encoding="utf-8")

        pairs.append(f"{img_path}\t{transcript}")

    # Write a manifest file for training scripts
    (out / "manifest.tsv").write_text("\n".join(pairs), encoding="utf-8")


if __name__ == "__main__":
    # IMPORTANT: set this to a real font on your machine.
    # The same font file is required for pixel-perfect reproducibility.
    FONT_PATH = "DejaVuSans.ttf"

    generate_dataset(
        out_dir="synthetic_ocr_dataset",
        font_path=FONT_PATH,
        n_samples=30,
        start_seed=4242,
    )

Notes about fonts

For perfect reproducibility, the same font file must be present at FONT_PATH. On many Linux systems, DejaVuSans.ttf exists by default; on macOS/Windows you may need to point to another font.

What the output looks like

After running the script, you’ll get:

synthetic_ocr_dataset/images/sample_....png
synthetic_ocr_dataset/meta/sample_....json (the layout spec)
synthetic_ocr_dataset/manifest.tsv containing:

synthetic_ocr_dataset/images/sample_4242_INV...png    INV-ACME-1234Q
DOC-ORBIT-5678S
ID-NOVA-2222T

Each line in the image corresponds to a line in the transcript.

How this plugs into Donut-style training (practical mapping)

Many Donut training setups expect the dataset entry to contain:

the image
the target text (often a structured string)

Even if your exact Donut training code uses a different target format, the same approach applies: the transcript is deterministic and aligned with the rendered image.

A minimal “dataset wrapper” might look like this:

from pathlib import Path
from PIL import Image

class OCRPairDataset:
    def __init__(self, manifest_path: str):
        self.items = []
        for line in Path(manifest_path).read_text(encoding="utf-8").splitlines():
            img_path, transcript = line.split("\t", 1)
            self.items.append((img_path, transcript))

    def __len__(self):
        return len(self.items)

    def __getitem__(self, idx: int):
        img_path, transcript = self.items[idx]
        img = Image.open(img_path).convert("RGB")
        return img, transcript


ds = OCRPairDataset("synthetic_ocr_dataset/manifest.tsv")
img, text = ds[0]
print(text)

The important part is that text is derived from the same LayoutSpec that produced the image pixels.

A quick sanity check: verify transcript/image alignment visually

Here’s a tiny script to open one sample and print its transcript:

from PIL import Image
from pathlib import Path

manifest = Path("synthetic_ocr_dataset/manifest.tsv").read_text(encoding="utf-8").splitlines()
img_path, transcript = manifest[0].split("\t", 1)

img = Image.open(img_path).convert("RGB")
img.show()
print(transcript)

When the transcript matches the drawn lines exactly, you avoid a whole class of “model learns mismatch noise” bugs.

What I learned (and what actually made it work)

I used to treat synthetic OCR as “render text randomly and hope it lines up.” The breakthrough was switching to layout-spec-first generation with constrained randomness and persisted spec JSON. Once I did that, I could regenerate the exact same training pairs reliably, and Donut-style vision-to-text training became dramatically less flaky.

In short: deterministic geometry + transcript derived from the same source-of-truth spec is the difference between synthetic data that’s merely plausible and synthetic data that’s actually learnable.