KabyTech can now extract structured data from AWBs that contain Thai script, mixed Thai-English text, and transliterated Thai names — with the same accuracy as pure English documents.
Air Waybills are, by IATA convention, English-language documents. The standard Cargo-IMP message format uses ASCII text, and the printed AWB form has English field labels. But in practice, Thai air freight documents frequently contain Thai script. Shipper names are written in Thai. Consignee addresses mix Thai and English. Nature-of-goods descriptions include Thai product names. And supplementary documents attached to AWBs — packing lists, invoices, phytosanitary certificates — are often entirely in Thai.
Until now, AWB parsing tools have treated Thai text as noise — either skipping it, replacing it with garbled characters, or failing to process the document entirely. Today we are announcing native Thai language support in KabyTech's AWB Intelligence API. Thai characters are now recognized, extracted, and returned as properly encoded UTF-8 text in the JSON response, with the same field-level accuracy we achieve on English-only documents.
Despite the IATA convention, Thai text appears on air cargo documents for several practical reasons:
Parsing a document that contains both Thai and English text is significantly harder than parsing either language alone. Here is why:
Thai and English use completely different writing systems. English uses the Latin alphabet with clear word boundaries (spaces). Thai uses the Thai script, which is an abugida — consonants carry inherent vowels that are modified by diacritical marks placed above, below, before, or after the base consonant. Crucially, Thai does not use spaces between words. Spaces in Thai text indicate clause or sentence boundaries, not word boundaries.
When Thai and English text appear on the same line — for example, a shipper name like "บริษัท Thai Silk Export จำกัด" — the parser must switch between two entirely different recognition models at the character level. A model trained only on English will attempt to interpret Thai characters as distorted Latin characters, producing nonsensical output.
Thai script includes diacritical marks (tone marks, vowel marks) that are positioned above or below the base consonant. In scanned documents, especially at lower resolutions, these marks can be confused with document noise (dust, print artifacts, fax degradation). A generic OCR engine might strip these marks as noise, fundamentally changing the meaning of the word. For example, the Thai characters for "rice" and "news" differ only by a tone mark.
Thai addresses on AWBs frequently code-switch between languages within a single field. An address might read: "123/45 ซอยสุขุมวิท 55 Sukhumvit Rd, Watthana, Bangkok 10110". The parser must handle the transition from Thai soi name to English road name within a single address line, maintaining the correct reading order and field segmentation.
Our approach uses a three-stage pipeline specifically designed for multilingual document processing:
Before attempting text recognition, we analyze the document layout to identify text regions and their probable language. This uses visual features: Thai script has a distinctive vertical profile (tall ascenders from vowel marks, descenders from certain consonants) that differs from Latin text. We classify each text region as Thai-primary, English-primary, or mixed, and route it to the appropriate recognition model.
We run two specialized OCR models in parallel: one optimized for Thai script (including all 44 consonants, 32 vowel forms, 4 tone marks, and Thai numerals) and one optimized for English text and Latin numerals. For mixed regions, we use a fusion model that handles character-level language switching. The fusion model was trained on a corpus of 50,000+ real Thai air cargo documents, so it understands the specific patterns of Thai-English mixing that occur in AWB contexts.
After text recognition, we apply field-level normalization rules. For example, AWB numbers are always numeric regardless of whether surrounding text is Thai or English. IATA airport codes are always three Latin characters. Weight values always use Latin numerals. Thai text typically appears only in name, address, and nature-of-goods fields. By applying these field-level constraints, we can correct recognition errors that would be ambiguous at the character level.
We benchmarked our Thai language support against a test set of 500 AWBs: 250 English-only documents and 250 documents containing Thai text (ranging from a few Thai characters to fully bilingual documents).
| Metric | English-only AWBs | Thai/Mixed AWBs |
|---|---|---|
| Overall field accuracy | 97.8% | 96.9% |
| AWB number accuracy | 99.9% | 99.9% |
| Shipper name accuracy | 96.2% | 95.1% |
| Consignee name accuracy | 96.5% | 95.4% |
| Address accuracy | 94.8% | 93.2% |
| Routing accuracy | 99.7% | 99.7% |
| Weight/pieces accuracy | 99.4% | 99.3% |
| Rate description accuracy | 97.1% | 96.5% |
| Nature of goods accuracy | 95.8% | 94.6% |
| Processing time (median) | 1.4s | 1.7s |
The accuracy gap between English and Thai/mixed documents is less than 1 percentage point on most fields. The largest gap is in address accuracy (1.6 percentage points), which reflects the inherent complexity of Thai address formatting rather than a language recognition limitation. Structured fields like AWB numbers, routing, and weight values show virtually identical accuracy regardless of the document's language.
Processing time increases by approximately 300 milliseconds for Thai/mixed documents due to the dual-model recognition pipeline. At 1.7 seconds median processing time, this remains well within the sub-2-second SLA.
Thai language support has several practical implications for our customers:
Thai language support is enabled by default for all KabyTech accounts. There is no configuration change needed. When you submit a document containing Thai text, the API automatically detects the language mix and applies the appropriate recognition pipeline. The JSON response includes a language_detected field that indicates whether Thai text was found in the document.
If you want to force English-only processing (e.g., for benchmarking purposes), you can pass the language_hint: "en" parameter in your API request. But for production use, we recommend leaving the automatic detection enabled.
Thai and English are our first two fully supported languages, reflecting our focus on the Thai air freight market. We are currently developing support for Chinese characters (Simplified and Traditional), which appear frequently on AWBs for Thailand-China routes — the single largest air cargo corridor for Thai perishable exports. Chinese language support is expected in Q3 2026.
Beyond that, our roadmap includes Japanese, Korean, and Vietnamese — covering the key air cargo routes in and out of Thailand. Each language requires a dedicated recognition model trained on air cargo-specific text, so these additions take time to develop and validate to our accuracy standards.
Upload a Thai or mixed Thai-English AWB and see the results. Free 30-day trial.