Classification data using AI : Tips and tricks

Task : Classify data, based on their names. The dataset consists of demand names in Slovak. We need to classify them to visualize which demands prevail.

We chose to classify by:

  1. Product or service
  2. General category
  3. Addtional details

Tip 1: Batching

The Problem: If you have 10,000 rows and you send them to the AI one by one, it will take forever. You also pay for the “overhead” tokens of the instructions every single time.

The Solution: Group your data. In the script, we process 30 items at a time. This drastically reduces network latency and cost.

# Settings
BATCH_SIZE = 30  # Process 30 queries in one API call

# Calculating how many batches we need
num_batches = (total_to_process + batch_size - 1) // batch_size

for batch_index in range(num_batches):
    # Slice the dataframe to get just the next 30 items
    batch_df = df_to_process.iloc[start_idx:end_idx]
    batch_dopyty = batch_df[INPUT_COLUMN].tolist()
    
    # Send the list of 30 items to the AI function
    analyzed_results = analyze_dopyt_batch(batch_dopyty)

Why this works: The AI is smart enough to handle a list. You say “Here are 30 items, classify them all,” and it returns a list of 30 answers.

Tip 2: Force JSON Output

The Problem: AI loves to chat. If you ask it to classify something, it might say: “Sure! Here is your classification: Item 1 is…” This breaks your code. You need machine-readable data.

The Solution: Use JSON Mode. We tell the AI to output strictly a JSON array and nothing else. I heard several times this is not the best idea to “limit” AI, but so far I don’t have a better alternative, and it works most of the time.”

# Inside the prompt, we give strict instructions:
SYSTEM_PROMPT = """
...
INSTRUCTION: Return EXCLUSIVELY a JSON array ([...]), where every element is an object with keys: "dopyt", "typ", "kategoria", "detail". NO other text.
...
"""

# Inside the API call, we enforce it:
chat_completion = client.chat.completions.create(
    model=DEEPSEEK_MODEL,
    messages=[...],
    response_format={"type": "json_object"} # <--- THE MAGIC LINE
)

Why this works: response_format={"type": "json_object"} ensures the AI doesn’t write an intro or outro. It just gives you the raw data structure you can immediately load into Python using json.loads().

Tip 3: DeepSeek

The Problem: Using GPT-4 for simple classification is overkill and expensive.

The Solution: Use DeepSeek-V3 (or deepseek-chat). It is significantly cheaper (often 10x-20x cheaper) and for classification tasks, the logic is just as strong.

Tip 4: Ask AI to prompt itself (and use examples)

The Problem: “Classify this” is too vague. The AI doesn’t know your business rules (e.g., is “repair” a service or a product?).

The Solution: Provide Few-Shot Prompting (giving examples) and clear definitions inside the System Prompt.

SYSTEM_PROMPT = """
Rules:
1. "type": "Service" if it involves installation, repair, transport. "Product" if it is buying material.
2. "category": Be concise (e.g., 'boiler service', not 'I need a boiler service').

EXAMPLES:
Input: "Dopytujem: zabezpečenie hudobnej produkcie na oslavu"
Output: {"dopyt": "...", "typ": "Služba", "kategoria": "hudobná produkcia", "detail": "oslava"}

Input: "Dopytujem: stavebný materiál zámkovú dlažbu, 38 m2"
Output: {"dopyt": "...", "typ": "Produkt", "kategoria": "dlažba", "detail": "zámková, 38 m2"}
""

Why this works: By showing the AI exactly what you want the output to look like for specific Slovak inputs, you align its logic with yours. Also, always use AI to prompt itself, it’s fast and it results in a better, more specific prompt.

Tip 5: Safety Saves (Don’t lose your work)

The Problem: Scripts crash. Internet connections fail. If you are processing 10,000 rows and the script crashes or something else happened, you don’t want to start over.

The Solution: Save the CSV incrementally.

SAVE_INTERVAL = 100  # Save to file every 100 processed queries

# Inside the processing loop:
if (processed_in_iteration - last_save_count) >= save_interval:
    df_output.to_csv(output_file, index=False, sep=';', encoding='utf-8')
    print(f" 💾 Saved progress...")

Why this works: This creates “checkpoints.” If the script fails, you simply restart it. The script is smart enough to see the output file exists, read it, and only process the rows that have a blank category (nan or '').