Skip to content
topical proximity classifier workflow
Home » Blog » Topical Proximity Classifier For Topical Mapping

Mapping Topics to Core Services/Products With Topical Proximity Classifier: Beyond Vector Embeddings and Cosine Similarity

If you’ve ever run an SEO content audit, you’ve probably used embeddings somewhere in the pipeline. Pull post titles, generate vectors, calculate cosine similarity between them, identify clusters, flag the duplicates. It’s the default approach in every “AI-powered” content audit tool on the market right now. It’s also not the best for the job.

So I built something that does a better job.

What is the Topical Proximity Classifier?

Topical Proximity Classifier is an open-source content audit classifier I built. It uses an LLM with structured output to decide what each blog post on a service business’s site actually is, its commercial role, its alignment to specific services, and what to do with it, instead of just how similar it is to other posts.

It’s free and it lives in my GitHub. I deployed it as a Python tool. Under the hood it uses GPT-5.4 with structured output, or if you prefer (as I do) Claude Sonnet 4.6 depending on how much you want to spend on API.

The Actual Problem it Solves

Whether a core update hit your site, or you’ve been on a nice downward glide for months, tough decisions need to be made. Sometimes these decisions include pruning, re-aligning, and a complete revamp of your content. Specifically:

  • Which pieces of content are doing real work by supporting a service the business sells with content that converts?
  • Which posts are product reviews or comparisons that don’t directly support a service but might still earn topical authority?
  • Which posts have drifted off-topic so far that they’re actively hurting the site’s topical clarity in Google’s eyes, although they bring in a few visits?
  • Which posts are duplicates or near-duplicates competing for the same query?
  • For everything in the middle, what’s the preferred action? Refresh? Merge? Redirect? Delete?

The job is making those decisions at scale, with reasoning that a stakeholder will accept, in a timeframe that doesn’t require 80 hours of manual reading. The available tools are wrong for the job:

  • Generic SEO audit tools prune by traffic decay or page age. A page with 12 impressions/month for a high-intent service query is worth keeping. A page with 2,000 impressions for an off-topic listicle is worth killing. Traffic-based pruning gets both wrong.
  • Cosine similarity finds posts that share vocabulary. It cannot distinguish between two posts that mention “fiber cement siding” but serve completely different commercial roles. It also requires arbitrary threshold tuning to be actionable.
  • Manual classification works but doesn’t scale. A consultant charging $5k for an audit can’t spend 30 hours reading 400 posts.

How is This Better Than Vector Embeddings and Cosine Similarity?

Embeddings encode semantic similarity of language. They’re optimized to put related concepts close together in vector space so search and clustering work well. They’re not optimized to encode commercial role, buyer journey stage, content format, or alignment to a specific business’s offering.

Those are reasoning tasks, not similarity tasks. An LLM with structured output can perform reasoning tasks. Given the services config plus the post content, GPT-5.4 or Claude Sonnet 4.6 produces an explicit decision about each dimension that matters for an audit.

Cosine similarity will never produce that because it isn’t the kind of thing similarity scores can answer. Imagine a project management SaaS with a blog of 400 posts built up over four years.

Two of them: “How to run effective sprint retrospectives” and “The 12 best sprint retrospective tools in 2026.

Cosine similarity on these two embeddings comes back around 0.91.

A cosine-based audit flags them as cannibalization candidates and recommends merging one into the other. Upon manual review, we would usually find that these two posts are not cannibalizing each other. They target two different personas:

  • The first is a top-of-funnel educational piece: a scrum master searching for retrospective formats, not software. It supports brand awareness and topical authority, but doesn’t directly drive trials.
  • The second is a bottom-of-funnel commercial comparison: someone actively evaluating tools who’s three clicks from signing up. It probably ranks the SaaS’s own product in position one or two, and converts at a better rate than the first piece.

Cosine couldn’t see the difference because it doesn’t understand commercial role, only language similarity.

The same pattern shows up in real test data. I ran the Classifier against a 20-post sample for a hypothetical California solar installer (“Helios Solar”). Three of those posts were about the Tesla Powerwall. Here’s what the Classifier did:

Post Tier post_type commercial_intent
Tesla Powerwall 3 Review: Is It Worth It? CORE_PRODUCT PRODUCT_REVIEW MEDIUM
Tesla Powerwall vs Enphase IQ Battery ADJACENT_COMPARISON COMPARISON MEDIUM
Tesla Powerwall Installation Cost: Complete Breakdown CORE_SERVICE SERVICE_INTENT HIGH

When to Use Cosine Similarity?

Cosine similarity is cheap and fast. You can embed 10,000 posts for under a dollar and run pairwise comparisons in seconds. Topical Proximity costs about $2 to $3 in API calls for 500 posts and takes 5 to 10 minutes. That’s still trivial, but it’s not the same order of magnitude as embeddings.

If your job is “find near-duplicate posts”, cosine is the right tool. If your job is “decide which posts belong, which should merge, and which should die, with reasoning per URL” that’s an audit, not a similarity problem. Cosine can’t do that.

The Technology Behind it

Topical Proximity Classifier on the other hand, uses an LLM with structured output instead of an embedding model. Specifically GPT-5.4 via the OpenAI API with a Pydantic-defined schema constraining the response. Anthropic’s Claude API supports the same pattern via its tool_use schema constraint. The flow:

  1. Load a services config file describing the business (services offered, in-scope vs out-of-scope examples).
  2. For each post in the audit CSV, construct a prompt that includes the services config + the post’s title + the post’s content (or extracted summary).
  3. Send the prompt to GPT-5.4 (or Claude Sonnet 4.6) with a structured-output parameter (response_format on OpenAI, tool_use on Anthropic) pointing at a Pydantic schema that defines the seven output fields (service, service_alignment, <code “>post_type, commercial_intent, tier, reasoning, overridden)
  4. The API guarantees the response is valid JSON matching the schema. The model can’t return free text or hallucinate field names.
  5. The seven fields are appended to the original CSV as new columns.

The fundamental technological difference with cosine similarity:

Cosine similarity Topical Proximity Classifier
Core technology Vector arithmetic on text embeddings LLM reasoning with structured output
Knowledge source Embedding model’s training corpus Services config + post content + LLM’s reasoning capacity
Output Single float per pair Multi-dimensional typed object per post
Cost per 500 posts ~$0.05 ~$2–3
Knows your business No Yes (via the services config)
Produces reasoning No Yes (one paragraph per classification)

Who is This For?

This tool is for SEO consultants running blog audits for clients, in-house SEO leads at content-heavy businesses, or agencies selling fixed-scope audits. Anyone who needs to make defensible keep/prune decisions across hundreds of posts without spending 30 hours reading.

Setting It Up

Three things to have ready before you start:

  • Python 3.10 or higher installed locally
  • An OpenAI API key with access to GPT-5.4, or a Claude API key
  • Your blog posts exported to a CSV with at least three columns, URL, title, and full post content

If you have those, the install is two commands:

git clone https://github.com/lumkamishi/topical-proximity.git
cd topical-proximity
pip install -e .

I’d recommend doing this inside a virtual environment so you don’t pollute your system Python:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Then set your API key as an environment variable, if your terminal is PowerShell: $env: OPENAI_API_KEY="sk-...", or $env: ANTHROPIC_API_KEY="..."

How to Use it?

The Classifier needs two inputs to do its job — a description of the business’s services, and the blog audit it’s classifying. The services file is where you could potentially get it wrong. This is the file the Classifier uses to decide what counts as on-topic, off-topic, core service, peripheral, and so on. A vague services file produces vague classifications. Spend twenty minutes on it. Here’s the format, with a project management SaaS as the example:

service_name description in_scope_examples
out_of_scope_examples
Task Management Core task tracking, assignment, and workflow features task prioritization, workflow automation, assignment best practices
general productivity, time tracking software reviews
Sprint Planning Agile sprint planning and retrospective tooling sprint planning templates, retrospective formats, velocity tracking
scrum theory unrelated to tooling, certification advice
Team Collaboration Real-time collaboration, comments, mentions, notifications async communication workflows, team coordination patterns
Slack vs Teams comparisons, general remote work advice

Four columns per service. Be specific about both what’s in scope and what’s not — the negative examples matter as much as the positive ones for keeping drift content out of the core tier. The blog audit file is straightforward. Three columns, one row per post.

You can pull this from a Screaming Frog crawl, a CMS export, or a script hitting your site’s RSS or sitemap. Once you have both files, run the classifier:

topical-proximity classify \
 --services services.csv \
 --audit blog_audit.csv \
 --output classified.csv

For 500 posts, expect 5–10 minutes of runtime and $2–3 in OpenAI API spend. The Classifier handles the first-pass classification, runs the verification pass to catch edge cases, and writes everything to classified.csv.

Reading the output

classified.csv has your original three columns plus seven new ones appended:

Column What it tells you
service Which service this post supports, or NONE
service_alignment How directly the post supports that service (0.0–1.0)
post_type Content category — SERVICE_INTENT, PRODUCT_REVIEW, COMPARISON, EDUCATIONAL, DRIFT, etc.
commercial_intent HIGH / MEDIUM / LOW
tier Final pruning bucket — the column you sort by
reasoning One paragraph explaining the call
overridden Whether the verification pass corrected the first-pass classification

The tier column is what drives decisions. Sort by it and act:

Tier Default action
CORE_SERVICE Keep, optimize
CORE_PRODUCT Keep, link internally to product pages
ADJACENT_COMPARISON Review case-by-case
ADJACENT_INFORMATIONAL Keep if performing, prune if not
PERIPHERAL Strong candidate for prune or merge
OFF_TOPIC Prune

Here’s a real output from a test run:

Title tier service_alignment post_type commercial_intent
How Much Does Solar Installation Cost in 2025? CORE_SERVICE 0.95 SERVICE_INTENT HIGH
Tesla Powerwall 3 Review: Is It Worth It? CORE_PRODUCT 0.70 PRODUCT_REVIEW MEDIUM
Monocrystalline vs Polycrystalline Solar Panels ADJACENT_COMPARISON 0.65 COMPARISON MEDIUM
How Solar Panels Actually Work: The Science ADJACENT_INFORMATIONAL 0.45 EDUCATIONAL LOW
10 Beautiful Solar Roof Designs PERIPHERAL 0.35 DESIGN_INSPIRATION LOW
Smart Thermostats for Energy Savings OFF_TOPIC 0.10 DRIFT LOW

Each row also includes a reasoning column (truncated here for space). Example reasoning for the Tesla Powerwall Installation Cost post:

“This post is about the cost of installing a Tesla Powerwall, covering labor, permits, and panel integration — all of which are core components of Helios Solar’s Battery Backup Systems service. The reader is a homeowner actively evaluating whether to purchase and install a Powerwall, making this close-to-hire intent. The next action would likely be requesting a quote from a local installer.”

Limitations

Every tool has its limitations. This one as well. Here are some caveats worth knowing about before using it:

  • The classifier is only as good as your services config. If services.csv doesn’t accurately describe what the business sells, the alignment scores will be wrong. The twenty minutes you spend on that file determines the quality of the entire output.
  • The post_type taxonomy is opinionated. It’s tuned for service businesses: renovation, legal, contracting, professional services, B2B SaaS with a service layer. It works for pure-product SaaS with some interpretation.
  • No SERP grounding (yet). The current version classifies posts based on their content and the services config alone. It doesn’t pull SERP data to verify how Google actually treats each URL and each query. Adding SERP analysis via a paid SERP API (DataForSEO, Serper, SerpAPI) or scraping would meaningfully improve the quality of the output, particularly anything sitting in ADJACENT_COMPARISON or PERIPHERAL. However, the cost trade-off is real. API calls run $0.50 to $2.00 per 1,000 queries depending on provider, which would push a 500-post audit from $2–3 to $5–15 total. If you’re using this and have thoughts on integrating SERP data, open an issue and tell me how you’d integrate it.
  • LLMs have a randomness factor built in. Run the same input twice and you’ll get ~95% identical output, with the remaining 5% being edge cases on borderline content.
  • Like every AI tool, this one doesn’t replace human judgement. The tier column tells you what the classifier thinks. You still need to review the borderline cases, particularly anything sitting in ADJACENT_COMPARISON or PERIPHERAL. The tool exists to do the 80% of the work that’s mechanical so you can spend your time on the 20% that requires actual thinking.

Before Running

Don’t run it on your full blog the first time. Run it on a sample of 20–30 posts that you already know well — ones where you have a strong intuition about whether they should stay or go. Check the reasoning column on each one. If the calls match your intuition, you can trust the full run.

Tune the services.csv file, re-run the sample, repeat until the output matches your judgement on the cases you already know. Then run the full audit, use it and tell me what’s not working. The repo is at github.com/lumkamishi/topical-proximity. Issues and PRs welcome. Specifically interested in:

  • Taxonomy proposals for verticals beyond service businesses
  • Edge cases where the classifier confidently picks the wrong tier
  • Performance optimizations for audits over 10,000 posts

If you’d rather have someone else run it against your blog with a proper services config and reviewed output, I do that as a fixed-scope audit. Two weeks. Get in touch.