Skip to main content
Back to blog
8 min read

How to prep documents and text for ChatGPT and Claude, free and in your browser

Cleaning, counting, chunking and scrubbing text before you paste it into an LLM saves money and protects your data. Here's a free, browser-only workflow that never uploads a thing.

Most people paste raw text straight into ChatGPT or Claude: a whole PDF, an exported spreadsheet, a wall of logs. It works, but it quietly costs you on three fronts. You pay for tokens you didn't need to send. You blow past the context window and the model silently drops the start of your document. And you hand a third party data that may contain names, card numbers, or API keys you never meant to share.

A few minutes of prep fixes all three. Here is a simple, free workflow that runs entirely in your browser, so the text never leaves your machine until you decide to send it.

1. Get clean text out of your files

LLMs want text, not layout. A .docx carries styles, a .pptx is a zip of XML, a PDF is a bag of positioned glyphs. Feeding the raw file (or a sloppy copy-paste) wastes tokens on formatting noise and confuses the model.

For Office files, drop them into DOCX / PPTX / XLSX to Text and you get clean, labelled text, slides marked as ## Slide 1, sheets as ## Sheet. For PDFs, use PDF to AI-ready text, which strips page numbers and de-hyphenates wrapped lines so the model reads prose, not artefacts.

2. Read text that's trapped in images

Screenshots, scans and photos are opaque to an LLM unless you run OCR first. Instead of uploading a screenshot to some server, pull the text out locally with Image to Text (OCR). The recognition engine and the language model both download once and run in your tab, so the image itself never gets uploaded.

3. Count tokens (and cost) before you send

Words are not tokens. "Internationalization" is one word but several tokens; punctuation and whitespace count too. Since tokens decide both the price and whether your text fits, it pays to check first. LLM Token Counter gives you the exact GPT token count and shows, model by model, what the call would cost and whether it fits the context window. A green dot means you're fine; a red one means step 4.

4. Trim or split text that doesn't fit

If your text is over the limit, you have two good options. To keep the gist in far fewer tokens, run it through the Text Summarizer, which extracts the most important sentences without rewriting them. To keep everything but feed it in pieces, for retrieval or embeddings, use the Text Chunker. It splits on sentence or paragraph boundaries, holds each chunk under a token budget you set, and adds overlap so ideas that straddle a boundary still appear whole in at least one chunk.

5. Scrub anything sensitive

This is the step everyone skips. Logs and spreadsheets are full of emails, phone numbers, card numbers and secrets. Before you paste, run the text through Redact PII & Secrets. It detects and replaces them with stable placeholders, the same email always becomes [EMAIL_1], so the model can still follow the thread without ever seeing the real value. Card numbers are Luhn-checked to cut false positives.

Why browser-only matters here

Every tool above processes your text on your own device. There is no upload step, no "we keep your data for 24 hours to improve our service," no account. You can verify it yourself: open your browser's Network tab while you use any of them and watch it stay empty. For prep work, where the whole point is controlling what leaves your machine, that property isn't a nice-to-have. It's the entire idea.

The short version

  1. Convert files to clean text (Office, PDF).
  2. OCR any images (Image to Text).
  3. Count tokens and cost (Token Counter).
  4. Summarize or chunk to fit (Summarizer, Chunker).
  5. Redact what's sensitive (Redact PII).

Do it once and it becomes muscle memory, cheaper prompts, fewer truncations, and nothing private leaving your laptop.

Tools mentioned in this post

Read next