> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pebchip.top/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Elasticsearch: Full-Text Search and Index Management

> Learn how to use Elasticsearch for full-text search, document indexing, and query DSL with practical examples for backend engineers.

Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene. You interact with it over a REST API using JSON, and it stores all data as JSON documents inside named indexes. Unlike MySQL, which finds rows by scanning or following B+ tree indexes, Elasticsearch builds an **inverted index** at write time so that every term in every text field points directly to the documents that contain it. This makes full-text search fast at any scale, but it also means Elasticsearch is optimized for search-read workloads rather than transactional writes or strict relational joins.

## Core concepts

### Index

An index is a named collection of documents that share a similar structure — analogous to a table in MySQL. Index names must be lowercase. You can have one index per entity type (e.g., `products`, `orders`) or combine related entities into a single index with distinct field sets.

```http theme={null}
PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}
```

Index health states:

* **green** — all primary and replica shards are allocated and active.
* **yellow** — all primary shards are active, but at least one replica shard is unallocated. Reads and writes work; the cluster has no redundancy for affected shards.
* **red** — at least one primary shard is unallocated. Some data is unavailable.

### Document

A document is a single JSON object stored inside an index. It is the smallest unit Elasticsearch indexes and returns. Every document has a `_id` field (auto-generated or user-specified) and an `_index` field indicating which index it belongs to.

```json theme={null}
{
  "_index": "products",
  "_id": "1",
  "_source": {
    "title": "Wireless Headphones",
    "price": 89.99,
    "created_at": "2024-03-01",
    "description": "Over-ear noise-cancelling headphones with 30-hour battery."
  }
}
```

### Shard

Elasticsearch horizontally partitions each index into **shards**. Each shard is an independent Lucene index that can be hosted on any node in the cluster. Sharding lets you store more data than fits on a single machine and parallelize search queries across multiple nodes.

You set `number_of_shards` at index creation time and **cannot change it afterward** without reindexing. Choose a shard count that fits your expected data volume and leaves room to grow.

### Replica

A **replica** is an exact copy of a primary shard hosted on a different node. Replicas serve two purposes:

1. **Fault tolerance** — if the node holding a primary shard fails, a replica is promoted to primary automatically.
2. **Read throughput** — search queries can be routed to any replica, distributing read load.

You can change `number_of_replicas` on a live index without reindexing.

***

## Mapping and field types

**Mapping** defines how Elasticsearch stores and indexes each field — the equivalent of a table schema. Elasticsearch can infer mapping from the first document you index (dynamic mapping), but for production use you should define explicit mappings to control field types and prevent unintended behavior.

```http theme={null}
PUT /products
{
  "settings": { "number_of_replicas": 0, "number_of_shards": 1 },
  "mappings": {
    "properties": {
      "id":          { "type": "integer" },
      "title":       { "type": "keyword" },
      "price":       { "type": "double" },
      "created_at":  { "type": "date" },
      "description": { "type": "text" }
    }
  }
}
```

<Warning>
  Once a mapping is created, you cannot modify or delete field types. To change a field's type, you must delete the index, create it with the new mapping, and reindex your data.
</Warning>

### Key field types

| Type               | Behavior                                                        | Use when                               |
| ------------------ | --------------------------------------------------------------- | -------------------------------------- |
| `keyword`          | Not analyzed; exact-match only                                  | IDs, status codes, tags, enum values   |
| `text`             | Analyzed by the configured tokenizer; supports full-text search | Product names, descriptions, body text |
| `integer` / `long` | Numeric integer                                                 | Counts, IDs, ages                      |
| `float` / `double` | Floating-point                                                  | Prices, scores, coordinates            |
| `date`             | ISO 8601 string or epoch milliseconds                           | Timestamps                             |
| `boolean`          | `true` / `false`                                                | Flags                                  |

The critical distinction is between `keyword` and `text`:

* `keyword` fields store the raw string and support only equality and prefix queries.
* `text` fields are tokenized — split into individual terms by an analyzer — and support full-text queries. The trade-off is that `text` fields cannot be sorted or aggregated efficiently.

***

## Query DSL

Elasticsearch's **Query DSL** lets you express searches as JSON objects sent in the request body of a `GET /_search` request. Every query returns a `hits` array with matching documents and a `_score` representing relevance.

### match\_all

Returns every document in the index.

```http theme={null}
GET /products/_search
{
  "query": { "match_all": {} }
}
```

### term

Exact-match query for `keyword`, numeric, date, or boolean fields. Does not analyze the query value.

```http theme={null}
GET /products/_search
{
  "query": {
    "term": { "id": { "value": 1 } }
  }
}
```

### match

Full-text query for `text` fields. Analyzes the query string using the same analyzer as the field.

```http theme={null}
GET /products/_search
{
  "query": {
    "match": { "description": "noise cancelling headphones" }
  }
}
```

### range

Returns documents where a field value falls within a specified range.

```http theme={null}
GET /products/_search
{
  "query": {
    "range": {
      "price": { "gte": 20, "lte": 100 }
    }
  }
}
```

### bool

Combines multiple queries with boolean logic. Use `must` (AND), `should` (OR), and `must_not` (NOT).

```http theme={null}
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "headphones" } },
        { "range":  { "price": { "lte": 150 } } }
      ],
      "must_not": [
        { "term": { "title": { "value": "out of stock" } } }
      ]
    }
  }
}
```

### multi\_match

Runs the same query string against multiple fields simultaneously.

```http theme={null}
GET /products/_search
{
  "query": {
    "multi_match": {
      "query":  "wireless headphones",
      "fields": ["title", "description"]
    }
  }
}
```

### Highlighting

You can ask Elasticsearch to return the matched fragment with the matching terms wrapped in HTML tags.

```http theme={null}
GET /products/_search
{
  "query": { "match": { "description": "noise cancelling" } },
  "highlight": {
    "pre_tags":  ["<mark>"],
    "post_tags": ["</mark>"],
    "fields":    { "description": {} }
  }
}
```

***

## How the inverted index works

Elasticsearch builds an **inverted index** for every `text` field. A normal (forward) index maps documents to words; an inverted index maps words to documents. This is what makes full-text search fast.

**Build time (indexing)**:

1. The analyzer splits the field value into terms (tokenization, lowercasing, stop-word removal, stemming depending on configuration).
2. For each term, Elasticsearch records the document ID, the position of the term within the document, and frequency of occurrence.
3. The resulting mapping from term → document list is stored in the Lucene segment files on disk.

**Query time**:

1. The query string is analyzed using the same analyzer.
2. Elasticsearch looks up each resulting term in the inverted index to get a list of document IDs.
3. For multi-term queries, Elasticsearch intersects (AND) or unions (OR) the document lists.
4. Documents are scored using TF-IDF or BM25 and sorted by score.

Because the term-to-document mapping is precomputed at index time, search does not scan documents — it performs a direct lookup.

***

## Index management

### Create an index with settings

```http theme={null}
PUT /products
{
  "settings": {
    "number_of_shards":   3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "title":       { "type": "keyword" },
      "description": { "type": "text" },
      "price":       { "type": "double" },
      "created_at":  { "type": "date" }
    }
  }
}
```

### Index a document

```http theme={null}
POST /products/_doc/1
{
  "title":       "Wireless Headphones",
  "description": "Over-ear noise-cancelling headphones with 30-hour battery.",
  "price":       89.99,
  "created_at":  "2024-03-01"
}
```

### Update a document (partial)

```http theme={null}
POST /products/_update/1
{
  "doc": { "price": 79.99 }
}
```

### Delete a document

```http theme={null}
DELETE /products/_doc/1
```

### Bulk operations

The `_bulk` API processes multiple create, update, and delete operations in a single request. Operations in a bulk request are **not atomic** — individual operations can fail without rolling back the others.

```http theme={null}
POST _bulk
{"index": {"_index": "products", "_id": 2}}
{"title": "Bluetooth Speaker", "price": 49.99, "created_at": "2024-04-01", "description": "Portable waterproof speaker"}
{"update": {"_index": "products", "_id": 1}}
{"doc": {"price": 75.00}}
{"delete": {"_index": "products", "_id": 3}}
```

### Check index health

```http theme={null}
GET /_cat/indices?v
```

***

## When to use Elasticsearch vs. MySQL vs. Redis

| Requirement                                   | Best fit               | Reason                                    |
| --------------------------------------------- | ---------------------- | ----------------------------------------- |
| Full-text search with relevance scoring       | Elasticsearch          | Inverted index with BM25 scoring          |
| Transactional writes with ACID guarantees     | MySQL                  | MVCC, two-phase commit, foreign keys      |
| Simple key lookups at sub-millisecond latency | Redis                  | In-memory, O(1) hash lookup               |
| Range queries on numeric or date fields       | MySQL or Elasticsearch | B+ tree (MySQL) or range filter (ES)      |
| Aggregations over large document sets         | Elasticsearch          | Distributed aggregation framework         |
| Relational joins across normalized tables     | MySQL                  | JOIN optimizer, foreign key constraints   |
| Session storage, counters, leaderboards       | Redis                  | Purpose-built data structures             |
| Log analytics and time-series search          | Elasticsearch          | Scalable inverted index + date histograms |

<Note>
  Elasticsearch is eventually consistent by design. After you index a document, it becomes searchable only after the next **refresh** (default every 1 second). Do not use Elasticsearch as your primary database for transactional data that must be immediately consistent.
</Note>
