Tutorial

Handling Thai-Script Addresses in Freight Documents

Thai freight documents frequently mix Thai and English text. This tutorial explains how KabyTech detects, parses, and normalizes Thai addresses for downstream systems.

Overview

Thai addresses on air waybills and shipping documents present unique OCR challenges. Thai script is abugida-based with no spaces between words, uses combining marks above and below the baseline, and has 44 consonants plus numerous vowel forms. Standard Latin-trained OCR engines struggle with these characteristics.

KabyTech uses a dedicated Thai OCR model trained on freight-specific documents — AWBs, house bills, customs declarations, and packing lists. The model handles mixed Thai-English text, which is common because IATA codes, flight numbers, and weight units remain in English even on Thai-language documents.

Beyond raw OCR, the address parsing pipeline understands Thailand's administrative hierarchy: จังหวัด (province), อำเภอ (district), and ตำบล (sub-district). This structure is critical for address matching and delivery routing in Thai logistics operations.

Step 1 — Thai Character Detection

The first stage identifies which regions of the document contain Thai script. The API uses a script-detection classifier that runs before full OCR, flagging bounding boxes as Thai, Latin, numeric, or mixed. This lets the system route each region to the appropriate OCR model.

You can see detection results in the regions array of the API response. Each region includes a script field and a bounding box. For debugging, enable debug_regions=true in your request to receive annotated page images.

curl -X POST https://api.kabytech.com/v1/parse \
  -H "Authorization: Bearer $KABY_API_KEY" \
  -F "file=@thai-awb.pdf" \
  -F "debug_regions=true"

Step 2 — Address Parsing (จังหวัด / อำเภอ / ตำบล)

Once Thai text is extracted, the address parser segments it into structured components. Thai addresses typically follow a bottom-up order: house number, street, ตำบล (sub-district), อำเภอ (district), จังหวัด (province), and postal code. The parser uses a gazette of all 77 provinces, 928 districts, and 7,255 sub-districts to validate and correct extracted text.

The parser also handles common abbreviations and variations. For example, กรุงเทพมหานคร (Bangkok) is often written as กทม., and อำเภอเมือง (Mueang district) is frequently shortened to อ.เมือง. These normalizations ensure consistent output regardless of how the original document was written.

# Parsed Thai address output
{
  "shipper_address": {
    "raw": "123/45 ถ.สุขุมวิท ต.คลองเตย อ.คลองเตย กทม. 10110",
    "house_number": "123/45",
    "street": "ถนนสุขุมวิท",
    "sub_district": "คลองเตย",
    "district": "คลองเตย",
    "province": "กรุงเทพมหานคร",
    "postal_code": "10110"
  }
}

Step 3 — Transliteration and Matching

Many downstream systems — TMS platforms, customs brokers, airline cargo systems — require Latin-script addresses. KabyTech provides automatic transliteration from Thai to the Royal Thai General System (RTGS), which is the standard used by Thai government agencies and IATA.

Transliteration is paired with fuzzy matching against a reference database of known addresses. This corrects OCR errors that produce valid Thai characters but wrong words — for example, confusing ก (ko kai) with ถ (tho thung) in degraded scans. The matcher cross-references postal codes and province/district relationships to select the most likely correct address.

# Transliterated output
{
  "shipper_address_latin": {
    "street": "Thanon Sukhumvit",
    "sub_district": "Khlong Toei",
    "district": "Khlong Toei",
    "province": "Krung Thep Maha Nakhon",
    "postal_code": "10110"
  },
  "match_confidence": 0.96
}

Summary

Handling Thai addresses in freight documents requires specialized OCR, administrative-hierarchy parsing, and transliteration. KabyTech's pipeline covers all three steps, producing structured Thai addresses with Latin transliterations ready for integration with international cargo systems.

For best results, ensure scanned documents are at least 200 DPI. The Thai OCR model performs well on degraded scans, but extremely low resolution (under 150 DPI) can reduce accuracy on combining marks and tone marks that distinguish otherwise identical characters.

Processing Thai freight documents?

KabyTech's Thai OCR model is trained on real freight data from Thai logistics operations.