How I Built a High-Performance Geocoding Engine from Scratch

A Practical Guide to Designing a Lightning-Fast Address-to-Location Engine in Rust

How I Built a High-Performance Geocoding Engine from Scratch
Image Credit: thitivong

During years working in logistics, and automotive, I often encountered the same technical bottleneck. Whenever a route needed planning, or a delivery had to be scheduled, there was an address behind it. And that address had to be resolved into coordinates.

This task is known as geocoding. For some systems, a delay of a few hundred milliseconds is acceptable. But in logistics, especially when processing large volumes or optimizing entire fleets, every additional millisecond has an impact. In a typical logistics morning, it’s routine to geocode 1–5 million stops within minutes; shaving even 200–300 µs per lookup can save tens of minutes end-to-end.

I wanted to see what’s possible when you reduce everything to the essentials without relying on heavy infrastructure, complex setups, or oversized hardware. My goal was simple: resolve addresses as fast as possible, using open data, efficient formats, and focused code.

So I built a geocoding engine from scratch. It’s pure Rust, memory-mapped (mmap) and keyed by a minimal-perfect hash (boomphf). The server is read-only at runtime; freshness comes from nightly rebuilds. It processes address queries and returns precise coordinates, with sub-millisecond latency and very low resource usage. The system works without a database, runs entirely on a memory-mapped file, and stays fully read-only during operation.

What is Geocoding?

Geocoding is essentially address translation. Give the geocoder a street address, and it returns the latitude and longitude of that location on Earth. For example, the address “Pariser Platz 1, 10117 Berlin, Germany” (the area of the Brandenburger Tor in Berlin) might geocode to roughly 52.5163° N, 13.3777° E. In everyday life, you use geocoding whenever you type an address into a maps app and see a pin on the map. Geocoding can be forward (address → coordinates) or reverse (coordinates → nearest address). In this article, I want to focus on forward geocoding: given a full address, find its coordinates almost instantly.

But why does speed matter? In logistics, many workflows begin with a batch of addresses that need to be turned into coordinates, often in large volumes. Delivery platforms geocode thousands of stops each morning to prepare optimized routes. Autonomous vehicles receive destination addresses that must be resolved before navigation can begin. And location validation systems check every incoming order or shipment in real time to confirm that the address is accurate and complete. These lookups happen continuously, sometimes at a rate of hundreds of thousands per hour. In such scenarios, even small delays per address accumulate quickly. A geocoder that can respond in microseconds allows these systems to stay responsive at scale, reduce lead times, and support smooth daily operations.

What the Engine Produces

The geocoder takes an address query and responds with a lightweight, consistent result. Each response includes the precise coordinates, a geohash, a rough accuracy estimate in meters, and metadata about the original data source. It also returns a normalized address and a formatted version that’s suitable for display.

For example, a query for:

GET /geocode?addr=Brunnenstraße 107, 13355 Berlin

The response also includes method, indicating how the point was obtained (e.g., “rooftop”, “interpolated”, “postcode_centroid”).

{
  "formatted": "Brunnenstraße 107, 13355 Berlin, Deutschland",
  "components": {
    "house_number": "107",
    "street": "Brunnenstraße",
    "postcode": "13355",
    "city": "Berlin",
    "state": "Berlin",
    "country": "Deutschland"
  },
  "location": {
    "lat": 52.50934168844568,
    "lon": 13.419302240918833,
    "geohash": "u33d9r59h",
    "method": "rooftop",
    "accuracy_m": 3
  },
  "lookup_ns": 320000,
  "source": "osm+oa"
}

This information includes the rooftop-level coordinate, an estimate of the spatial accuracy in meters, a geohash representation of the point, and metadata about the source dataset. This result is typically returned in well under one millisecond, even on standard hardware.

The formatted field represents the human-readable form of the address. The components field holds the structured parts, such as city, postcode, and house number.

accuracy_m is a 95% error‑radius estimate in meters derived from the match method. Typical ranges:
- rooftop/entrance: 2–10 m
- building‑centroid: 5–20 m
- interpolated house number: 20–150 m (short urban blocks → low end; long rural segments → high end)
- street‑level: 20–150 m
- postcode centroid: 500–5,000 m (area‑based)
- city centroid: 1–20 km (area‑based)

The source flag is “osm”, “oa”, or “osm+oa” when both datasets agree.

By keeping the structure predictable and compact, the response can be parsed quickly and reliably by downstream systems, whether it’s used for routing, enrichment, or display.

What Is a Geohash?

Geohash is a compact way to reference a place on the earth by encoding latitude and longitude into a short string. It works by dividing the world into a grid, then further subdividing each cell and translating each division into a Base32 character. The longer the geohash, the smaller the grid cell — and the more precise the location.

For example, a 5-character geohash represents an area roughly 5 km wide, whereas 8 characters zoom into tens of meters. As you add letters, you zoom in.

A handy feature of geohash is that nearby places often share the same prefix. If two addresses are geographically close, their geohash strings tend to match in the first few characters. This property makes geohash useful not just for encoding. It becomes a lightweight way to group nearby locations together.

To avoid border misses, reverse geocoding scans the target cell and its eight neighbors.

3×3 geohash neighborhood. Search the center cell (prefix p) plus N, NE, E, SE, S, SW, W, NW; perform prefix range scans per cell.

In my Rust engine, every address record includes its precision‑9 geohash. We sort the records by that value. Because neighboring addresses share prefixes, they also sit next to each other in the sorted list. This makes future features like reverse geocoding fast and simple: to find addresses around a point, you compute its geohash, then scan records that share the same prefix. It becomes a search by string range, not spatial math. I still have not implemented the reverse geocoding part yet, but will do it soon.

Once the geohashes are stored in the sorted blob, close addresses naturally cluster. This spatial clustering gives us a clear path to efficient geographic queries without heavy indexing structures. We can rely on simple string comparisons and rely on the file layout to guide us.

Geohash Precision from World to Street

How Geohash Fits into the Lookup Flow

The previous section covered how geohash encodes location and creates spatial groupings. Now let’s look at how this plays a role inside the actual geocoding engine both during the build and at runtime.

Each address in the system is stored as a small binary record. It includes latitude and longitude, accuracy in meters, a source flag, and the address parts in both canonical and display form. Alongside this, we store a geohash with precision 9.

Precision 9 covers an area of around 4.7 × 4.7 meters, roughly the size of a small room or entrance. That level of detail is well-suited for rooftop-level results and still keeps the string short and indexable.

Here’s what a few addresses in Berlin might look like after geohashing:

{
  "address": "Brunnenstraße 107, Berlin",
  "geohash": "u33dc0fj0"
},
{
  "address": "Brunnenstraße 110, Berlin",
  "geohash": "u33dc0fj3"
},
{
  "address": "Torstraße 1, Berlin",
  "geohash": "u33dc0fsn"
}

All of them start with the same prefix: u33dc0, which represents a shared area on the grid. When sorted by geohash, they appear next to each other. Reverse geocoding isn’t implemented yet; the geohash-sorted layout makes it straightforward future work. That’s what I use.

1 unter den linden 10117 berlin 52.5164 13.3777 u33db2m65 1 unter den linden 10117 berlin de 1 Unter den Linden 10117 Berlin Berlin DE
1 brandenburger tor 10117 berlin 52.5163 13.3777 u33db2m3e 1 brandenburger tor 10117 berlin de 1 Brandenburger Tor 10117 Berlin Berlin DE
1 hauptstr 10117 berlin 52.5200 13.4050 u33dc0cpp 1 hauptstr 10117 berlin de 1 Hauptstr 10117 Berlin Berlin DE
1 kanzlerstr 09127 chemnitz 50.8325 12.9081 u311jtz2u 1 kanzlerstr 09127 chemnitz de 1 Kanzlerstr 09127 Chemnitz Sachsen DE
5 marienplatz 80331 munchen 48.1372 11.5756 u281z7j5e 5 marienplatz 80331 munchen de 5 Marienplatz 80331 München Bayern DE
107 honinger weg 50969 koln 50.9155 6.94124 u1hctuykq 107 honinger weg 50969 koln de 107 Höninger Weg 50969 Köln Nordrhein-Westfalen DE

During the build step, we assign geohashes to all address records and sort the full list by their value. This ensures that spatially close addresses are also physically close in the final file layout.

Implementation note: many libraries (including Rust geohash) expect (lon, lat) as Coordinate { x, y }. Use precision 9 for rooftop-level cells.

At runtime, this layout enables efficient fallbacks. If a query only includes a postcode or a city, the engine doesn’t need to scan all records. It can take a known coordinate (like a postcode centroid), compute its geohash, and scan only the records that start with the same prefix. That covers a reasonable area and returns nearby results quickly.

Because the file is sorted by geohash, this is just a range scan over a section of the data. There’s no tree traversal or coordinate math required during lookup.

In short, geohash helps us cluster addresses during the build, and limit the search space during lookups. Precision 9 gives enough detail for rooftop matches, and its prefix makes group-based lookups efficient and predictable.

To make all this work, each query goes through a tightly defined flow, from parsing the address to reading a memory-mapped record. The next part starts with how addresses are normalized and looked up using a minimal perfect hash.

How Addresses Are Canonicalized and Indexed

Before any lookup can happen, the incoming address needs to be parsed and brought into a consistent form. This step is important. Street names can vary in casing, spacing, or abbreviations. House numbers might include extra characters. Some users enter “Straße”, others type “Str.” or “Str” or “Strasse”.

To keep lookups fast and reliable, I normalize every address into a canonical format. The parser breaks the input into components (house number, street, postcode, city, state, and country) and applies a few simple rules. At ingest I use libpostal to expand abbreviations and variants (e.g., Str., Straße, Strasse → brunnenstrasse), lower-case and strip diacritics; at query time a lightweight parser applies the same normalizer to build the canonical key.

For example, the input:

Brunnenstraße 107, 13355 Berlin

…becomes (country | state | city | street | house-number | postcode — all lower-cased, diacritics removed, abbreviations expanded):

de|berlin|berlin|brunnenstrasse|107|13355

This canonical string is what the engine actually indexes. It serves as the key in a minimal perfect hash function, a structure that maps each unique string to an exact numeric index. No collisions, no lookup trees, and no fallback probing. Because an MPHF returns an index for any input, the engine verifies membership by checking a short fingerprint (or canonical key bytes) stored with the record before using it. If the fingerprint doesn’t match, the lookup is treated as ‘not found’.

The library I used for this is boomphf, which builds a minimal perfect hash from the full set of canonicalized addresses at blob build time. The hash takes just a few bits per entry, and produces an exact lookup index in constant time.

At runtime, the engine follows this path:

input string
   ↓
parse and normalize → canonical string
   ↓
MPHF → index
   ↓
offset[index] → byte position in mmap
   ↓
read → candidate record
   ↓
verify fingerprint (or canonical bytes)
   ↓
match → return record     |     no match → not found

Each lookup skips branching logic or conditionals. It’s a direct path from string to memory address.

Minimal Perfect Hash Function (MPHF)

This works well because the hash and offset array are both baked into the blob file. Once the hash is loaded, each canonical address string can be mapped directly to its corresponding record location with just two memory accesses.

It also fails fast: if the fingerprint (or canonical‑bytes) comparison fails, the address isn’t in the index.

How the Blob File Is Structured

Every address record is stored inside a single binary file: compact, read-only, and designed for fast access. This file is called the blob.

The blob is built once per day, typically as part of a nightly CI run. It contains everything the lookup engine needs: the hash function, the offsets, and the records themselves.

The structure is simple and linear:

[header]
[MPHF bytes]
[offset array]      // one offset per address
[address records]   // variable-length binary records

Each part plays a clear role:

  • The minimal perfect hash (MPHF) maps the canonical address string to a unique index.
  • The offset array maps that index to a byte position in the file.
  • The record at that position contains all the data needed to answer the query.
Summarizes the on-disk blob and the end-to-end lookup path.

All of this is loaded into memory using mmap(). The memory map makes the file feel like an array in RAM. The operating system handles paging in the background. The application doesn't need to read from disk or manage buffers.

Access becomes a pointer arithmetic operation:

  1. Use the MPHF to get an index.
  2. Use the offset array to get the byte position.
  3. Jump to that position and read the record.
Storage Layout

Each address record starts with a small fixed-size header — latitude, longitude, accuracy, source flags, and geohash. After that come the variable-length address parts, both in canonical and display form.

For example, a record includes:

  • lat, lon as f64
  • accuracy_m as u16 meters (0–65,535). One extra byte per record keeps the model simple and expressive.
  • source_flags (bitmask: OSM, OA, etc.)
  • geohash (9 bytes, precision-9)
  • fingerprint of the canonical key (e.g., 32–64 bits) or canonical key bytes (length‑prefixed) for exact match verification
  • Address parts: house number, street, postcode, city, state, country (canonical + display)

The fingerprint (or canonical bytes) allows a membership check at read time, which is necessary because an MPHF maps any input to an index.

On average, each record is around 110 bytes, small enough to keep the memory footprint tight even with millions of entries.

The blob stays entirely read-only. The server never mutates it. Updates happen by building a new version and swapping it out atomically, for example, via a new file path or environment variable.

This separation between write-time and read-time keeps the design clean. Lookups never block. There’s nothing to lock or cache. The engine just maps the blob, reads memory, and returns a result.

How the Blob Gets Built Every Night

The geocoder doesn’t rely on a live database or background workers. Instead, it uses a prebuilt snapshot of the full address index generated once per day as part of a CI job.

This process starts from scratch every night. It pulls in fresh data, cleans it up, merges everything, and writes out a new blob file.

The steps are fixed and deterministic:

1) Fetch OpenStreetMap extract + diff
2) Download OpenAddresses CSV files per Bundesland
3) extract_osm → osm.tsv
4) extract_oa → oa.tsv
5) merge_and_rank → merged.tsv
6) interpolate → addr.tsv
7) centroid_builder → postcode_centroid.tsv, city_centroid.tsv
8) build_blob → addr_index_YYYY-MM-DD.blob
9) compress + upload to Azure Blob or S3

Each step has a single job:

  • extract_osm pulls structured address nodes and ways from the latest .pbf extract.
  • extract_oa parses OpenAddresses CSVs per state (Berlin, Bayern, etc.).
  • merge_and_rank combines both sources, resolves duplicates, and picks the most accurate version.
  • interpolate fills in missing house numbers from ranges.
  • centroid_builder adds fallback records for postcodes and cities.
  • build_blob writes the actual memory-mapped blob with MPHF, offsets, and records.

These timings are from a 16-vCPU CI runner; adjust proportionally for smaller or larger machines.

The result is a file like:

addr_index_2025-08-07.blob

This file is immutable and versioned by date. The blob is uploaded to an Azure Blob Storage or S3. On cold start, the API server downloads the blob and memory-maps it locally. After that, every lookup is local and fast.

This build-once, read-only model keeps the system simple and reliable. Each build is complete and self-contained.

What This Design Enables

Once the blob is built and loaded, the engine is ready to serve lookups with minimal latency and very little resource usage. There’s no warm-up phase, no preloading, and no caching logic – the file structure and the operating system take care of that.

The complete index covers millions of addresses that can be easily stored in memory with maximum precision, including all offsets and metadata.

The lookup path stays short:

• Canonicalize the input address.

• Use the hash to get an index.

• Use the index to read an offset.

• Jump to that byte in the memory-mapped file.

• Return the record.

On a warmed server, end-to-end lookups typically finish in a few hundred microseconds (around 200–400 µs p50) with tails under a millisecond, which roughly works out to a few thousand requests per core per second. Right after a restart or blob swap, expect brief millisecond-level spikes until the page cache settles, then it returns to the sub-ms profile.

Because the system doesn’t rely on a live database, scaling is a matter of replicating the blob. Multiple API servers can map the same file and serve requests independently. This keeps deployment simple and avoids infrastructure overhead.

In practice, this setup is well-suited for logistics, mapping platforms, on-device geocoding, or any application that needs fast, predictable address resolution without the weight of a full spatial database.

Closing Thoughts

This project started as a practical challenge: how fast can an address lookup get if you remove everything that doesn’t need to be there?

What I ended up with wasn’t just a faster geocoder. For me, it was a reminder of how far you can go with a small, well-shaped system. No layers, no background queues, no service orchestration. Just clear data, focused transformations, and a file you can memory-map.

The work also gave me a better sense of how data wants to be laid out when it’s read much more than it’s written. Sorting by geohash, writing once, serving many, these ideas aren’t new, but putting them together for this use case felt like the right tool meeting the right shape of problem.

And maybe that’s the real outcome here: not just speed or throughput, but clarity. A simple build step. A direct read path. No surprises during lookup.

Geocoding, like many things in logistics, tends to disappear into the background when it works. But underneath that smoothness, there’s room for good engineering even at the level of byte layouts and prefix strings.

This is the kind of work I enjoy: sharp edges, measurable results, and a clear sense of done.

Cheers!

This article was originally published here: https://levelup.gitconnected.com/how-i-built-a-high-performance-geocoding-engine-from-scratch-17449df01315

Subscribe to Rico Fritzsche

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe