close

DEV Community

Phasu  Yeneng
Phasu Yeneng

Posted on

Why your RAG chatbot fails in Thai — and how to fix it

Why your RAG chatbot fails in Thai — and how to fix it

A real-world walkthrough of how we built a customer service chatbot for a Thai e-commerce company — and the chunking problem nobody warns you about.


When I started building a RAG (Retrieval-Augmented Generation) chatbot for a Thai e-commerce company, I made the same mistake every developer makes: I copied the LangChain quickstart example, set chunk_size=500, and expected things to just work.

They didn't.

This is the story of why naive chunking fails for Thai text, what we built instead, and the full pipeline from PDF product manuals to chatbot answers — using Python, Qdrant, and OpenAI.


The Problem Nobody Warns You About

Most RAG tutorials are written with English in mind. The chunking logic looks like this:

# Works fine for English
chunks = text.split('. ')
# or
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
Enter fullscreen mode Exit fullscreen mode

This works because English has clear word boundaries — spaces between every word. When you split on periods or character count, you still get coherent, searchable chunks.

Thai is completely different.

Thai has no spaces between words.

This sentence — "ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ" — means "Our store has many product categories to choose from." But to a naive chunker, it looks like one enormous, unsplittable blob. There are 7 meaningful words in there, with zero whitespace between them.

Here's what happens when you embed that raw blob versus properly tokenized words:

Input to embedding model What it sees
ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ One opaque token sequence
`ร้านค้า \ ของเรา \

The second form produces embeddings that actually capture the meaning of each concept — "store", "product", "category" — which leads to better retrieval when a user asks "มีสินค้าหมวดหมู่ไหนบ้าง" (what product categories are available?).


The Pipeline We Built

Here's the full architecture:
{% raw %}

PDF product manuals / FAQ documents
    |
Python (PyMuPDF) → extract raw text
    |
Sentence splitting by '. '
    |
[Stored in MongoDB as raw sentences]
    |
Python → pythainlp tokenization
    |
OpenAI text-embedding-3-small
    |
Qdrant vector database (cosine similarity, 1536 dims)
    |
User query → tokenize → embed → search → top-7 chunks
    |
GPT-4o-mini + context → answer
Enter fullscreen mode Exit fullscreen mode

Let's walk through each step with real code. Here are the dependencies we'll use:

# requirements.txt
pymupdf==1.27.2.2
pythainlp==5.2.0
openai==2.32.0
qdrant-client==1.17.1
pymongo==4.10.1
Enter fullscreen mode Exit fullscreen mode

Step 1 — Extract Text from PDF

We use PyMuPDF (the fitz library) instead of PyPDF2 because it handles Thai character encoding much more reliably.

# app/python/PdfToSentences.py
import pymupdf as fitz  # PyMuPDF 1.27+ (legacy: import fitz)
import re
import uuid
import requests

def extract_sentences_from_pdf(pdf_path):
    pdf_file = fitz.open(pdf_path)
    text = ""
    for page in pdf_file:
        text += page.get_text("text")

    # Split on English period + space — works for mixed Thai/English documents
    sentences = [sentence.strip() for sentence in text.split('. ') if sentence.strip()]
    return sentences

def clean_text(text):
    cleaned_text = re.sub(r'\u2022', '', text)  # Remove bullet points
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text
Enter fullscreen mode Exit fullscreen mode

Two things to note here:

Why PyMuPDF over PyPDF2? Thai PDF documents often use non-standard font encodings. PyMuPDF handles these much better — with PyPDF2 you'd frequently get garbled output or empty strings for Thai text blocks. Note: as of PyMuPDF 1.24+, the recommended import is import pymupdf (the old import fitz still works but is considered legacy).

Why split on . (period + space)? Our documents are mixed Thai/English — product names, SKUs, and technical specs are often in English, while descriptions are Thai. The period-space split is a pragmatic middle ground that preserves Thai paragraphs as single chunks rather than fragmenting them randomly at character 500.

⚠️ Limitation: Formal Thai text often ends paragraphs with a line break rather than a period. If your PDFs have no periods at all, text.split('. ') will return one giant chunk per page. In that case, use pythainlp's sentence tokenizer instead:

from pythainlp.tokenize import sent_tokenize
sentences = sent_tokenize(text, engine="crfcut")

Step 2 — Thai Word Tokenization Before Embedding

This is the most important step, and the one that differs most from English RAG.

Before sending any Thai text to the embedding model, we tokenize it with pythainlp:

# thai_tokenizer.py
from pythainlp.tokenize import word_tokenize

def word_cut(text: str) -> str:
    tokens = word_tokenize(text, engine="newmm")
    # Join with pipe separator so the embedding model sees distinct units
    return "|".join(tokens)
Enter fullscreen mode Exit fullscreen mode

pythainlp uses a dictionary-based approach (newmm engine) to segment Thai text into individual words:

Input:  "สินค้าอิเล็กทรอนิกส์ราคาถูกส่งฟรี"
Output: "สินค้า|อิเล็กทรอนิกส์|ราคาถูก|ส่งฟรี"
Enter fullscreen mode Exit fullscreen mode

Now the embedding model sees four distinct semantic units instead of one long string. The cosine similarity between "ส่งฟรี" (free shipping) and a user's query "จัดส่งฟรีไหม" (is shipping free?) will be much higher and more meaningful after proper tokenization.

We also tried attacut (a neural-network-based engine in pythainlp) but settled on newmm for its speed and dictionary coverage — important when your domain includes product jargon and Thai promotional phrases like "ลดราคา", "ส่งฟรี", "ผ่อนชำระ".


Step 3 — Generate and Store Embeddings

We use OpenAI's text-embedding-3-small for embeddings — the current-generation model that replaced text-embedding-ada-002. It scores 44% on the MIRACL multilingual benchmark vs 31.4% for the old model, and costs 5x less. The key is that we tokenize before embedding — not after:

# ingest_embeddings.py
from thai_tokenizer import word_cut
from openai_module import create_embedding

for item in data:
    # ✅ Tokenize Thai text FIRST
    tokenized = word_cut(item["keyword"])

    # Then embed the tokenized version
    result = create_embedding(tokenized)

    if result["status"]:
        sentence = {
            "id": item["id"],
            "sentence": item["text"],      # store original for display
            "keyword": item["keyword"],    # store original keyword
            "embeded": result["embed"],    # embed the tokenized version
        }
        sentences_collection.insert_one(sentence)
Enter fullscreen mode Exit fullscreen mode

Notice we store the original text as the payload but create the embedding from the tokenized version. This way, when a match is found, the chatbot returns the human-readable original sentence — not the pipe-separated tokenized form.

The embedding function itself:

# openai_module.py
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MAX_INPUT_LENGTH = 10000

def create_embedding(text: str) -> dict:
    if len(text) > MAX_INPUT_LENGTH:
        return {"status": False, "message": "Text too long"}

    response = client.embeddings.create(
        model="text-embedding-3-small",  # replaces text-embedding-ada-002
        input=text,
        dimensions=1536,                 # if you change this, update Qdrant collection size too!
    )

    return {
        "status": True,
        "embed": response.data[0].embedding,
    }
Enter fullscreen mode Exit fullscreen mode

Step 4 — Qdrant as the Vector Store

We use Qdrant running in Docker as our vector database. It's fast, lightweight, and the REST API is straightforward to call with Python's requests:

# qdrant_module.py
import os
import requests

QDRANT_URL = os.environ.get("QDRANT_URL", "http://localhost:6333")

def create_rag_collection(collection_name: str, vector_size: int):
    requests.put(
        f"{QDRANT_URL}/collections/{collection_name}",
        json={
            "vectors": {
                "chatgpt_vector": {
                    "size": vector_size,  # 1536 for text-embedding-3-small (default)
                    "distance": "Cosine",
                }
            }
        },
    )

def search(collection_name: str, vector: dict, limit: int = 5) -> dict:
    response = requests.post(
        f"{QDRANT_URL}/collections/{collection_name}/points/search",
        json={
            "vector": vector,
            "limit": limit,
            "with_payload": True,
        },
    )
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Start Qdrant locally with one Docker command:

docker run -dt --name VectorDB \
  -p 6333:6333 \
  -v /your/path/storage:/qdrant/storage \
  qdrant/qdrant:latest
Enter fullscreen mode Exit fullscreen mode

We use Cosine similarity rather than Euclidean distance. For semantic search in Thai, cosine similarity performs better because it measures the angle between vectors (meaning similarity) rather than the absolute distance, which is sensitive to text length differences.


Step 5 — The RAG Query Flow

When a user asks a question, here's what happens:

# chat_module.py
from openai_module import create_embedding
from qdrant_module import search

def rag(question: str, category_name: str) -> str:
    # 1. Build a context-rich search query
    search_query = "สินค้า" + category_name  # "Product [category]"

    # 2. Embed the search query (tokenization happens upstream before this call)
    question_embed = create_embedding(search_query)

    # 3. Search Qdrant for the top 7 most similar sentences
    gpt_vector = {"name": "chatgpt_vector", "vector": question_embed["embed"]}
    search_result = search("chatgpt", gpt_vector, limit=7)

    # 4. Assemble context from the matched payloads
    context = retrieve_relevant_context(search_result["result"])
    return context


def retrieve_relevant_context(results: list) -> str:
    context = ""
    for item in results:
        context += item["payload"]["sentence"] + "\n\n"
    return context
Enter fullscreen mode Exit fullscreen mode

The assembled context is then injected into GPT-4o-mini's system prompt:

system_content = f"""Use the attached context to answer the user's questions.
Answer only questions related to our company's products and services:

{context}

ภาษาที่ใช้ตอบกลับ User ให้ยึดจากภาษาของคำถามล่าสุดของ User เท่านั้น"""
Enter fullscreen mode Exit fullscreen mode

That last Thai instruction tells the model: "Reply in the same language as the user's most recent message." This handles the bilingual nature of our users — some ask in Thai, some in English, some mix both.


Step 6 — Question Classification Before RAG

One non-obvious optimization: not every question needs a RAG lookup. We classify questions first with GPT-4o-mini to decide which path to take:

# chat_module.py
import json
from openai import OpenAI

client = OpenAI()

def question_classification(question: str) -> dict:
    prompt = """วิเคราะห์คำถามของ User ว่าเป็นคำถามประเภทไหน โดยให้ตอบเป็น JSON { "type": value }

    type 0 = ทักทาย / ไม่เกี่ยวกับสินค้าหรือบริการ
    type 1 = ถามเกี่ยวกับโปรโมชั่น / ส่วนลด / หมวดหมู่สินค้า
    type 2 = ถามเกี่ยวกับสาขา / พื้นที่จัดส่ง
    type 3 = ถามเกี่ยวกับข้อมูลสินค้าหรือบริการ  ← needs RAG
    type 4 = ถามทั่วไปเกี่ยวกับบริษัท"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": question},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Only type 3 (specific product info questions) triggers the full RAG pipeline. Promotion and branch questions (type 1-2) use structured data from a JSON catalog instead. Greetings (type 0) go straight to the LLM without any retrieval at all.

This classification step saves both latency and API cost — you're not doing a vector search for "สวัสดีครับ" (hello).


What We Learned

1. Tokenize before embedding, always. The single biggest quality improvement came from running pythainlp on every piece of text before it touches the embedding model — both at ingest time and at query time. Without this, retrieval quality was noticeably worse for Thai-only queries.

2. Use PyMuPDF, not PyPDF2. For Thai PDF documents, PyMuPDF is dramatically more reliable. PyPDF2 would silently drop or garble Thai characters from complex layouts. Also note: as of v1.24+, use import pymupdf instead of the legacy import fitz.

3. Store original text, embed tokenized text. Users should see natural language in responses. Keep these as separate fields.

4. Sentence-level chunks beat character-level chunks for Thai. Because Thai sentences naturally carry complete thoughts, splitting at sentence boundaries (.) gives the model coherent context units rather than arbitrary fragments. A chunk_size=500 cut might land in the middle of a Thai word — or more precisely, in the middle of a run of characters that spans multiple words, since there's no space to safely break at.

5. Question classification as a router saves money. Not every user message needs vector search. A cheap classification step routes simple questions to a direct LLM call and complex ones to the full RAG pipeline.


The Stack at a Glance

Layer Tool Version
PDF extraction PyMuPDF (pymupdf) 1.27.2.2
Thai tokenization pythainlp (newmm engine) 5.2.0
Embedding model OpenAI text-embedding-3-small (1536d)
Vector database Qdrant + qdrant-client 1.17.1
LLM OpenAI GPT-4o-mini
OpenAI SDK openai 2.32.0
Backend Python / FastAPI or Flask
Chat history MongoDB

Final Thoughts

Building RAG for Thai taught me that most of the "standard" chunking advice assumes English. Once you work with a language that has no word boundaries, the whole pipeline has to be rethought — from how you split sentences to how you normalize text before embedding.

The good news: the fix is not complicated. A single tokenization step with pythainlp before embedding makes a significant difference. The hard part is knowing you need it in the first place.

If you're building RAG for other Asian languages — Japanese, Chinese, Korean — the same principle applies. Never assume your text has whitespace-delimited tokens. Always pre-process with a language-appropriate tokenizer before hitting your embedding model.

Top comments (0)