📜 Farhangan OCR Workflow

This repository contains the high-performance OCR pipeline for farhangan.com, specifically optimized for Persian scholarly manuscripts and Naskh-style typography.

🚀 1. Running OCR on New Images

To transcribe a new image using the custom farhangan_v1_best.mlmodel, use the following command. This pipeline includes binarization (cleaning the image) and layout analysis (segmentation).
Bash

~/kraken_env/bin/kraken -i input_image.png \
  binarize segment -bl \
  ocr -m farhangan_v1_best.mlmodel \
  > output_text.txt

Batch Processing

If you have a folder full of images, use the provided batch script:

Batch Processing

If you have a folder full of images, use the provided batch script:
Bash

./ocr_batch.sh

🛠️ 2. The Fine-Tuning Loop (The 99% Strategy)

When the model makes a mistake, follow this 4-step process to retrain and improve accuracy.
Step A: Generate New Training Data

Extract line-snippets from a page to create "Ground Truth" pairs (image + text).
Bash

python3 extract_lines.py --image new_manuscript.png --xml results.xml

This creates .png and .gt.txt pairs in the ground_truth/ folder.

Step B: Run Scholarly Corrections

Never correct common errors manually. Run the automated correction script to fix "Persian vs. Arabic" character confusion and common hallucinations.
Bash

bash clean_gt.sh

Step C: Execute Fine-Tuning

Train the next version of the model using your existing "Best" model as a base.
Bash

~/kraken_env/bin/ketos train \
  -i farhangan_v1_best.mlmodel \
  -o farhangan_v2 \
  ground_truth/*.png

📂 3. Directory Structure & File Inventory
File	Purpose
farhangan_v1_best.mlmodel	Primary Engine. Use this for all production OCR.
OpenITI_AOCP.mlmodel	Base model for starting new training branches.
clean_gt.sh	Master sed script for automated text normalization.
ground_truth/	Contains .png (line images) and .gt.txt (correct text).
normalize_persian.py	Python utility for NFC/NFKC unicode normalization.

💡 Important Tips for Scholarly Accuracy
Character Normalization

Persian scholars distinguish between Arabic ك/ي and Persian ک/ی. The clean_gt.sh script enforces this. Always run it before training to ensure the model doesn't "hallucinate" Arabic characters in a Persian context.
Image Quality

For best results, images should be at least 300 DPI. If an image is blurry, the model may confuse «ش» (Shin) with «س» (Sin) or add extra teeth (e.g., reading «نقش» as «نقیش»).
Overfitting Protection

The training process uses Early Stopping. If the "Validation Accuracy" stops improving for 10 stages, the process will end automatically. This ensures the model remains flexible enough to read different pages and doesn't just "memorize" your current samples.
🛡️ Cleanup

To keep the workspace clean, move old training snapshots (v1_0, v1_1, etc.) to a backup folder:
Bash

mkdir -p backups
mv farhangan_v1_[0-9]*.mlmodel backups/

Does this cover everything you need for the site's deployment, or should I add a section on the magick image preprocessing?
