NLP for Unstructured Urban Data: Mapping Local Markers to GPS
In this article
When the Address Is "Near the Red Water Tank"
Every logistics executive who has worked on urban delivery in the Global South has encountered the same wall. The customer's address reads: "House no. 4, near the red water tank, behind the masjid, opposite Raju's shop. "No pin code. No street name. Certainly no latitude or longitude.
This is not an edge case. In Mumbai, Lagos, Dhaka or Nairobi, among many other cities, the majority of residential addresses are not geocoded. The legally questionable structures, and the densely populated neighborhoods that are home to millions of people are all unknown to the digital world of coordinates and therefore of street addresses. These addresses are unique to their communities and are often unknown to delivery executives - the red water tank on the corner, the blue gate on the main road, and the broken traffic light that has been like that for the last 11 years. All of these are perfect reference points for the citizen, yet none are represented in the digital infrastructure of an address.
Identifying and classifying every unique type of address (also known as “invalid” or “hard to read” addresses) at scale is one of the biggest engineering challenges for any last-mile delivery company. And we are seeing great success with our NLP and logistics AI technology.
Why Standard Geocoding Fails Here
Most geocoding workflows are based on the assumption that the input will always be in door number, street name, postal code and city format. And most of the time this is actually the case. So the geocoding results can be directly taken from the Google Maps API, the HERE API or the OpenStreetMap API. The problem is that landmark-based addressing fundamentally breaks those assumptions.
Consider what a standard geocoding API does with "third lane after the overbridge, blue building, ground floor". The API either returns nothing, matches to the nearest named street with very low confidence or returns a plausible-looking result that is 400 meters out of position.
At scale, a 5% error rate can easily become critical. With 10,000 delivery points processed every day, a minor geocoding error rate adds up to 500 routes that drivers must drive which are incorrectly pinpointed on the map. The effects of this are not minimal. In a complex logistics network where the last mile is highly complex and crowded, as it is in an urban environment, a small margin for error results in incremental routing inaccuracy. Many errors result in failed deliveries, re-deliveries, customer dissatisfaction and excessive fuel consumption. All costs that eat away at an already highly marginalising bottom line in the last-mile delivery business, where costs are already highly competitive.
The engineering challenge here is not so much about geocoding better but about creating a system that understands the semantic geography of an informal urban space - a fundamentally different mapping challenge.
The NLP Architecture: From Landmark Text to Coordinates
This requires a pipeline that has to pass several stages to deal with the different layers of the address parsing process.
Stage 1: Entity Recognition and Landmark Extraction
The first task is identifying what kind of information exists in the raw address string. A transformer-based Named Entity Recognition (NER) model, fine-tuned on a corpus of informal address data, can classify tokens into categories: physical landmarks (water tank, temple, school), directional modifiers (near, opposite, behind, next to), distance qualifiers (100 meters, two minutes' walk), and structural descriptors (blue building, red gate, broken wall).
This is a far cry from the standard NER tasks. The entities are hyperlocal. The language is often mixed (code-switching from Hindi to English and from Tamil to English and from Swahili to English is quite common). The entities are context-dependent. A phrase like “near the park” in one locality might correspond to a park which was demolished ten years ago but whose name continues to exist as a location in local memory.
Training data is the critical bottleneck here. Libera's approach, forged through ElasticRun's years of operating India's largest logistics network across 2,400+ warehouses, involves collecting and labelling delivery attempt data at scale — driver GPS traces, successful delivery confirmations, and failed attempt notes — to build a training corpus that reflects how addresses actually work in the field.
Stage 2: Landmark Disambiguation and Knowledge Graph Lookup
So, now that we have our entities, the system needs to find where they are located. To do this we use a locally enriched knowledge graph.
You won’t find ‘Raju’s shop’ or the ‘red water tank near Govindpuri metro’ in your average map database. But you will, in the knowledge graph of our crowdsourced delivery logistics platform, built from operational delivery history, OSM contributions, drivers' waypoint updates, and satellite imagery annotations. Every single successful delivery to a landmark-referenced address becomes a node in this graph.
We use a combination of proximity scoring and semantic similarity in the disambiguation layer. So when we receive an address string and it includes the phrase “near the Hanuman temple, Sector 7", we bring up all the Hanuman temple entries in the knowledge graph for that candidate geographic location. We then filter them by our understanding of historical delivery performance in that location, the semantic probability of the phrase given the other words in the address string, and the distance from the last known location of the customer.
This is AI in supply chain management applied at the microgeographic level using accumulated operational intelligence to resolve ambiguity that no static map database could handle. And what is the global environment that is so hard to model in advance? Simply the unpredictable world we all live in, which no amount of data modeling can fully represent.
Stage 3: Relative Positioning and Coordinate Inference
This stage deals with converting the direction words and distance words to the coordinate offset. Example: “100 meters behind the temple". “Behind” would refer to the north or the road direction of the temple, which would be context-dependent. “100 meters" can go up to 80 metres and down to 30 meters in casual language.
This is where the calibration curves we learn from the delivery outcome come into play. So when we have to compute the coordinates of a micro-zone and there is some bias in it, the calibration curves learn from the GPS traces of the drivers of the successful deliveries. So when we hear “Opposite School” in Dwarka Sector 4, it is unlikely that it means the exact point right across the road from the main gate of the school. It probably means 40 metres to the northeast of the main gate.
The result is a probability distribution over candidate coordinates, not a single point. The highest-confidence candidate is passed to the navigation system, but the uncertainty score is also surfaced which matters operationally, because a delivery with high coordinate uncertainty should trigger a driver call or WhatsApp confirmation before the trip begins, rather than after a failed attempt.
Stage 4: Feedback Loop and Continuous Learning
The architecture is only as good as its feedback mechanism. And every delivery outcome (whether it’s a first-attempt success, whether it’s a retry needed, or whether it’s no address found at all) can be treated as a valid training example. Training the system using the data obtained at the time of successful Electronic Proof of Delivery (ePOD) and using the GPS coordinate collected at that time is arguably the “gold standard” of quality of feedback that a system can get. This coordinate provides the location at which a pickup or delivery actually took place for a given address string.
The whole system creates a closed loop of supply chain automation – where the more transactions we have (more scale), the better the AI works. The system that has managed the 10,000th delivery to “near the red water tank, Govindpuri” would have been significantly better at its job than the system that had managed the first 100 deliveries to the same place.
The Micro-Sector Layer: Beyond PIN Codes
One of the key decisions in designing last-mile delivery solutions, particularly in the context of informal urban zones, is that the PIN code cannot be used as the geography unit. This is because the PIN codes of Indian cities can cover an area of 5-15 square kilometres. The actual delivery area in an urban cluster is typically around 200 meters, which again does not hold good as the basic geography unit in this context.
Libera’s Address AI engine uses micro-sector segmentation. Micro-sectors are defined as segments of the road network that are typically 50-200 meters in length. The micro-sectors are derived from a combination of delivery clustering analysis, understanding of the road network topology and landmark density. All derived addresses are associated with a micro-sector. Routing drivers are assigned specific routes for delivery at the level of the micro-sector and not at the PIN code level.
This has a direct impact on the efficiency of route planning software and dispatch optimization. When delivery zones are defined at the micro-sector level, route optimization algorithms can generate sequences that actually match how drivers navigate informal urban space by landmark progression and beat familiarity, not by abstract coordinate minimization.
This results in faster receive-to-out-on-road time, fewer re-deliveries and greater confidence for drivers to access some of the city’s toughest slum communities. A driver is able to service their assigned micro-sector much quicker and more effectively when they are able to use contextual location-based data points to navigate the terrain, as opposed to being forced to accept street-level data that can easily contradict real-world conditions. The Libera system does not override local data points with street-level data if it has sufficient information to pinpoint drop-off and pick-up points using recognizable objects such as vendors and trees as well as physical boundaries including concrete barriers, walls and any other obstructions.
Integration with the Broader Supply Chain Stack
Address intelligence technology is not a standalone tracking system. To realize its full potential, it must be integrated into the broader logistics management system and supply chain control tower architecture.
Libera’s address parsing engine is embedded in the order management layer in the platform at the time of the booking confirmation. In the case of high-confidence addresses, these are then auto-geocoded and pushed through to the routing algorithm. In the case of low-confidence addresses, these are routed to a verification process which may involve either a map pin confirmation sent to the customer via WhatsApp or a call masking activity between the driver and customer prior to the order being dispatched.
This integration into the real-time tracking and exception-handling layer ensures that address uncertainty is surfaced and resolved proactively, not reactively. The cost difference between resolving an ambiguous address at booking time versus resolving it at the point of failed delivery is significant: one is a 90-second interaction, the other is a re-delivery cycle with full associated costs.
The Larger Significance
There is something worth acknowledging beyond the engineering: this problem matters for reasons that go beyond operational efficiency metrics.
Collaborative Research on AI for Informal Settlements Informal settlements are highly prevalent in many cities across Asia, Africa and Latin America. Millions of residents of informal settlements lack a formal address and therefore are excluded from the digital value chains that depend on this infrastructure. Building natural language processing systems that are aware of the geography described by the residents of informal settlements is not just about improving logistics; it is about enabling access to the modern economy for some of the poorest people in the world. Where formal addressing has long been a barrier to participation in the economy, it is important that those designing new systems consider the impact on the excluded.
The red water tank is not an obstacle to precision navigation. With the right architecture, it is the precision. The engineering challenge is building systems smart enough to understand that.
Libera, powered by ElasticRun, delivers AI-native logistics infrastructure purpose-built for the complexity of real urban environments. Explore our Address AI and last mile delivery capabilities at libera.run.