The Second Index: How LLMs are Forming a Parallel Internet

Through training on web-scale data and RAG (retrieval-augmented generation), LLMs are forging a second index of the internet.

Here’s what we mean.

Google’s massive search index (the first index) is based on syntax.

It relies on keywords, backlinks, URLs, and crawlable site structures. It’s optimized for efficient retrieval, not conceptual reasoning.

Searches match strings, then they’re ranked by authority and relevance.

The second index created by LLMs is built on semantics, which revolve around meaning, relationships, entities, and context.

Instead of listing web pages, LLMs reconstruct knowledge by mapping entities (brands, organizations, people, etc.) and understanding their connections across countless documents.

Essentially, the second index is a conceptual mirror of the web. LLMs do not index the web the way search engines or humans see it; they index the meaning itself.

This matters for brands because it’s fundamentally altered the search marketing playbook.

Marketers aren’t competing for rankings anymore; they’re competing for representation inside AI entity graphs.

Keep reading to learn what the second index is and how to optimize your site for maximum visibility within it.

What Is the Second Index? Why Does It Exist?

The second index is a compressed, conceptual map containing entities, facts, and relationships distilled from training data, not live crawling.

How was it built?

LLMs are pre-trained on web-scale datasets, and we’re not using the term ‘web-scale’ lightly.

Think hundreds of terabytes worth of scraped and crawled web content, books, press releases, and all manner of other written content.

During training, all the text gets converted into high-dimensional vector embeddings that capture:

Semantic meaning
Relationships
Context

An LLM then uses billions of parameters to encode these embeddings into a compressed, probabilistic knowledge graph that serves as its internal second index.

TL;DR?

LLMs created a second, meaning-based search index that runs parallel to Google’s syntax-based search index. It consists of embeddings that represent word meanings and conceptual relationships.

Your brand lives in this index as patterns of embeddings that represent who you are, what you do, and why you matter.

Here’s a breakdown of the differences between the first and second index:

Aspect	First Index (Google)	Second Index (LLMs)
Core Structure	Document-based: URLs, pages, inverted keyword index	Semantic embeddings: vector representations of meaning, entities, and relationships
Data Source	Live crawling of web pages	Massive static training on web-scale data and RAG
Organization	Syntax-driven: keywords, links, metadata	Semantics-driven: concepts, narratives, probabilistic knowledge graphs
Access Method	Keyword matching, ranking, 10 blue links	Query embedding, similarity search, synthesized answer
Entity Handling	Recognized via schema, links, and mentions (post-ranking)	Core unit: distilled across documents into interconnected nodes
Update Frequency	Continuous via live web crawlers	Static (training cutoffs) and retrieval-augmented generation for pulling real-time web data
Brand Visibility	Rankings, featured snippets, site links	Direct mentions in generated answers, entity associations
Optimization Focus	Keywords, technical SEO, link-building	Entity clarity, topical depth, semantic chunking, multi-source citations
Analogy	A rolodex of web pages	A conceptual map in a storyteller’s mind

Why the Second Index Matters for Maintaining Online Visibility

If you’ve kept track of your SEO for a while now, then you know how devastating it is when crucial web pages aren’t crawled or indexed.

If Google can’t index your content, it’s virtually invisible to your audience.

Well, the same principle applies to the second index, just with semantics instead of syntax.

If your content lacks clear entity signals, a chunk-friendly structure, and topical depth, your brand simply won’t register cleanly in LLM’s conceptual map.

The result?

Your brand won’t get cited or recommended for important AI prompts related to your business.

To avoid drops in visibility, you must adopt the basics of GSO (generative search optimization) alongside your existing SEO best practices.

Also, according to research by Anthropic, it’s extremely easy for LLMs to form narratives about brands through consistent signals. All it takes is approximately 250 documents containing similar claims for a narrative to form inside an AI system.

Once it does, the narrative becomes self-reinforcing and difficult to undo.

This presents the need for marketers to own their brand’s narrative within the second index, which is only possible with clear, consistent trust signals from content and external mentions.

How Traditional SEO Stops at Syntax (and Doesn’t Improve AI Search)

Traditional SEO tactics are ineffective for improving your brand’s visibility on AI search platforms like ChatGPT, even if you’re ranked #1 for your most important keywords.

How is that?

It’s because those optimizations are entirely centered around retrieval efficiency instead of reinforcing meaning.

Keyword-era SEO entailed matching queries to exact phrases, which was an entirely lexical pursuit. On-page keyword placement, link-building, and basic technical hygiene were sufficient for securing high rankings, regardless of industry.

Also, classic SEO efforts are isolated. You target one keyword at a time, often creating hundreds or even thousands of pages to capitalize on trending industry terms.

Therein lies the problem for AI search optimization.

LLMs prefer content that’s interconnected and cohesive. Instead of fragmented signals from siloed keyword pages, they prefer topical depth achieved through interlinked content clusters.

This structure reinforces entity recognition, ensuring that LLMs have a clear understanding of who you are and what you do.

That’s why some brands can dominate the organic SERPs while remaining practically invisible to LLMs. Their signals are too fragmented to form a consistent narrative about their brand.

In other words, your keyword wins won’t translate to the second index without a semantic structure.

Optimizing for the Second Index: Basics of Semantic SEO

Semantic SEO, also called generative search optimization (GSO), shifts the focus from keyword matching to building machine-readable meaning for your brand’s entity.

Here’s a look at the most important optimizations you need to make to cement a place for your brand in the second index.

Token chunking

First, you need to understand how token limits work.

LLMs don’t read words; they process tokens, which are typically words or parts of words. Each AI model has a hard token limit, like ChatGPT-4’s cutoff of 128,000 tokens.

Many online documents exceed this limit, so the model chunks the content into smaller, overlapping segments of roughly 300 – 500 tokens (although the exact number varies per model).

These tokens are what get converted into the vector embeddings that comprise the second index.

Token limits apply both to training data and live web retrieval, meaning LLMs won’t process each piece of your content in its entirety.

Instead, they’ll break the piece into self-contained chunks of around 500 tokens. For instance, an H3 describing the ‘pros and cons of a high-yield savings account’ would get processed as its own chunk.

Here’s where problems may occur.

If your content isn’t split into topical boundaries with clear subheadings, your content chunks may not make sense to LLMs. This is especially true if you tend to drift off topic or go on random tangents.

If chunks break mid-concept or lack clear boundaries, LLMs can miss important context.

At the same time, since each chunk is self-contained, the greater context from the rest of the article may be missed.

Here are some optimization tips for token chunking:

Use subheadings (H2s and H3s) to split articles into clearly defined subtopics (like ‘What is a high-yield savings account?’ And ‘How do high-yield savings accounts work’?)
Do not venture off topic or go on tangents during each subsection
Immediately answer questions that you pose in the very next sentence
Make frequent use of comparison tables and bulleted lists (with corresponding subheadings)

Also, try not to exceed 400 words per subheading to not exceed token limits.

Structured data

Implementing sitewide structured data is how you make your content explicitly machine-readable.

In particular, schema markup defines entities in your content like organizations, authors, places, and brands.

Crucial schema types and properties include Organization, Person, LocalBusiness, Author, and sameAs. Schema markup removes ambiguity so that LLMs can clearly understand who and what you’re talking about.

Structured data helps create a unique node for your brand in AI knowledge graphs (i.e., the second index), so it’s crucial to include it. You can check out our guide on structured data as the new SEO cheat code to dive deeper.

Topical depth

Remember, single-keyword pages lead to fragmented signals, so you should transition to interconnected content clusters instead.

Topical authority matters to LLMs, and they prefer to cite brands that have consistent, high-quality content that fleshes out their area of expertise in as much detail as possible.

To create content clusters, you need to designate a pillar topic and cluster topics to go along with it. Pillar topics are general, overarching deep dives into a topic, while cluster pages flesh out each subtopic.

For example, The Ultimate Guide to Photography would be a pillar page, while White Balancing Basics and Understanding Composition would be cluster pages.

Internal links are the glue that hold content clusters together, so interlink your pillar pages and cluster pages.

Check out our guide on mastering content clusters to learn more.

Essential off-page trust signals

It’s not enough for LLMs to recognize your brand as a consistent entity; you also need to be deemed credible.

Earning credibility on AI search platforms looks different from building authority on classic search engines.

Instead of building a large quantity of backlinks, the focus is on editorial quality, relevance, and context.

Third-party brand mentions and backlinks from credible news sites and media outlets carry the most weight, which is why digital PR is so popular for improving AI search visibility.

Concluding Thoughts: The Second Index Built by LLMs

To wrap up, LLMs are actively building a parallel, semantic-based index that rests on top of Google’s syntax-based index.

It presents the immediate need for marketers to alter the way they optimize web content to perform well in search.

Instead of targeting single keywords, it’s about forming a consistent, high-quality narrative about your brand in AI knowledge graphs.

Do you need expert help adapting to the second index?

HOTH Blog