In this article, you’ll learn How to design the Uber Backend System and APIs: The System Designing of Uber Application
- There are two types of users for UberApplication:
- Drivers need to routinely tell the backend service about their present Location and also they have to flag their availability status to pick up travelers. Travelers/Riders get the chance to see all the close by accessible drivers. Users can request a ride; drivers are informed that the user requested a ride and ready to be picked up. When a driver and Rider acknowledge a ride, they can continually check each other’s current location until the destination is reached. After arriving at the objective, the driver notifies the Journey complete to be marked as available for further riders.
Account should be there/Registration
- Either you will login to Passenger account as a Passenger
- Either you will login to Driver account as a Driver
- For passenger, he should be able to set the current location and search for the nearby cabs/drivers
- Requesting for a rider: Passenger should be able to request a ride by setting his start and destination location.
- The ETA for the destination.
- Passengers should be able to see his/her previous trip history.
- Driver should inform his availability to the system. The driver should receive a ride request from a nearby passenger.
- After the driver accepts the ride request, the passenger should be able to see the driver details and the ETA of the cab.
- After the driver accepts the ride request from the driver then both driver and passenger should see each other’s location
- Driver should mark the trip completion.
- Rating/Feedback feature
- The service should be highly available.
- Passenger searching time for cab/driver should be defined up to a certain threshold like 60 secs. And after that again the passenger has to again invoke the search request.
- The system should be highly scalable.
Life Cycle of Uber Ride/Trip:
- APIs would be forgetting the Map, Getting the location, finding ETA, Requesting Trip, ConfirmingAvailabilityofDriver, acceptTrip, cancel Trip, CompleteTrip, getTripStatus, update location, rateUser, getTrip History
Capacity Estimation and Constraints:
- Let’s expect we have 100M passengers/Riders and 1M drivers with 1M passengers which are active per day and 100K Cab drivers operating per day.How about we accept that all dynamic drivers pass their current location to the backend system in every 5 seconds.When a Passenger places a request for a ride, the framework/system should be capable to connect to the drivers in the real-time
Uber backend Micro Service Architecture Design:
- There are tons of services which are supporting the trip terabytes of data that have been used for this particular trip. When Uber started, they had a monolithic architecture that means they had a back-end service having front-end Java application and a database and a couple of real time java development services only. This did not work well when Uber started to roll the service into different regions. initially the design was something like they used Python for the application servers, and they use a Python based framework. After 2014 uber architecture has evolved into something like service-oriented architecture now as below:
- From the API gateway, all the internal endpoints are connected. All these are individually deployable units. We can work independently on individual applications without disturbing the other service.
- As soon as the Customer/Passenger opens up the app on his phone, it shows a list of nearby drivers that are hidden, and then it enters his pick-up and drop-off location to request a ride, and then he presses the button request.
- The request is sent to all the nearby drivers who are available at that point and then waits for the driver to accept requests at which point a ride gets dispersed and driver info.is shown on the user’s uber app.
- Now two things can happen at this point either driver and user can decide to continue the trip or they can cancel. if they two forward with the ride, Uber tracks the location of the driver in the background which will help to generate the ride receipt after the trip completion. So, at that point rider gets a charge based on the time. The driver and rider both will see a receipt on their app. Uber is completing millions of rides per day so on another day this sounds pretty amazing, but it could be also pretty embarrassing if something or some system is not efficiently designed.
- I’ll try to give a very high-level system and services described here. So, first User bootstrap the Uber app, and the number app calls the trip Dispatcher Microservice which calls the car location index to show all the nearby drivers. In the area, these locations are first restored by another microservice which we call driver location manager and then we have the trip dispatcher called the ERA calculator microservice which obviously helps in calculating the ETA of the driver to appear in front of the rider. Then once this trip starts, we have a trip recorder as the on-ride component that tracks the driver’s GPS signals based on the time and the distance of the ride.
- In the end, we have a Map matcher and the price calculator as the post-trip component that takes in the input from the trip recorder to actually generate the release feed and send it to both passengers and drivers.
- So, now as we have a complete picture of Uber Microservices architecture at least from the Passenger booking point. Now the question is how the Cab location index supports this high volume of reads and writes coming from both riders and drivers in the uber app so as I have mentioned that it is a distributed cache where the car locations are stored in memory. But how does it do it?
- It is done by selecting the right kind of Database Sharding Strategy to avoid problems like hot partitions so we can employ distributed cache to balance the load of reads and writes and we can store it in memory to make those queries really faster since Passenger does not have to wait five minutes to find the best cab but the main thing that we need to take care of hot partitions.
System Designing of Uber System:
- Here is the complete architecture for uber or any taxi aggregation platform and you see everything here I have written all the major components over here but instead of jumping right into explaining each and every component I am going to concentrate much on this particular component over here i.e. Dispatch optimization.
How Dispatch system works completely on map our location data
- We have to model our Maps and our location data properly. Since the earth is spherical, it is pretty hard to do summarization and approximation just by using latitude and longitude data, and to solve this particular problem what Uber uses is Google s2 library. This library takes the spherical map data and it makes and widens the data into tiny cells of 1KM* 1KM cells(For Example I have mentioned 1kn). So, when we join all these cells we get the complete map.
- So each cell is being given a unique ID and this is really helping now to spread this data in the distributed system and store it easily. So, whenever we want to access a particular cell from the Cell ID, then we can easily go to the server where that particular data is present. Here, we can use consistent hashing based on the cell ID.
- Google S2 library can easily give you the coverage for any given shape. For example, if we want to draw a circle on the map and we want to figure out all the producers/Cabs available inside that particular circle. Then in this case, what we need to do is use this S2 library and will feed the radius as an input to this, and then it will automatically filter out all the cells which contribute to that particular circle/area.
- So, in this way we get to know all the cell IDs and from this we can easily filter the data which that belong to that particular cell and this will help us to tell us the list of Cabs available in that area/circle. Also, we can calculate ETA from that.
- So, when we want to match a rider/Passenger to the driver or even if you want to show a number of cars which are available in your region then in that case, what we need to do is just need to draw a circle of about two-kilometer radius and list out all the cabs available using Google S2 library. Then we need heck for the ETA.
What is the role of the ETA and how can it can be checked:
- For example, within a particular circle of about a two-kilometer radius, suppose 4-5 cabs are available. Here, we have to calculate ETA or the distance from the Passenger.
- Consider the passenger is present in a particular location in the center of that circle. So, here the shortest distance obviously we can calculate using any algorithm, but this won’t accurately give you the ETA because cab driver can just drive the cab from his point to the rider’s place directly but instead, he has to go to the connected roads. We have to find the ETA or the distance which is connected by the road and we have to figure out all the Cabs’ ETAs.
- So, considering the road path and condition then actual ETAs can be calculated. Accordingly, we can filter out which are the cabs suitable for this particular passenger and then we can send the notification to the driver.
Explanation of the Functionality of Each Component:
While jumping into the system design, first understand all the components which are needed to understand the Producer/Consumer(Supply/Demand) part and the dispatch optimization component.
WAF & Load balancer:
- You can see over here in the above image, the cabs are the producers/Supply and Users are the consumers/Demand. In every 5 seconds, the Cavs will be kept on sending the location data to the Kafka REST API and every call happens through the Web ApplicationFirewall and then it hits the Nginx load balancer(LB) and it goes to Kafka where it keeps updating the location by pushing it to Kafka and then the data is consumed by various components of Big data.‘
- A copy of the location is also sent to the database and also to the dispatch optimization to keep the updated location of the cab. We need a Web Application Firewall for security purposes. we can block the request from the blocked IPs and from the bots and also, we can block the request from the regions where Uber is still not yet launched.
- The next is the Load Balancer. We obviously need a load balancer that can have different layers within like layer 3, layer4, and layer 7. Layer 3 and 4 based on their based load balancer like all the ipv4traffic go to this particular server or the ipv6 traffic go to a different kind of server. In Layer 4, we can do DNS based Load balancing. And in layer 7, it is application-level Load balancing.
Kafka rest APIs will provide the endpoint to consume all the location data for every Cab. If for example we have thousands of Cabs running for the city and in every 4 seconds, cabs are sending their location data, that means in every four seconds, Uber backend system will be having thousand hits or thousand location point being sent to the Kafka API.
And that data will be buffered and put it to a Kaka and then they were consumed two different confidence of Big data layer and then a copy of it will be saved to NoSQL DB. When the Ride/trip is happening, and the latest location will be sent to the dispatcher component to keep the state machine updated.
We also have other rest APIs and I will explain these components also. Here the important component is WebSocketand why do we need WebSocket?
Unless like normal HTTP requests, WebSocket is really helpful in these kinds of applications because we need an asynchronous way of sending messages from client to the server and server to the client at any given part of the time. That means that we should have a Connection established between the Cab Service to the server or from the user service to the server. What happens is web WebSocketkeeps the connection open to all of the applications and based on changes happens in the dispatch system or any component in the server, then the data will be exchanged to and fro between the application and the server.
Since the Producer-Consumer(Supply-Demand) and the WebSocket component of are mainly written in node.js as NodeJS is really good in asynchronous messaging and small messaging and also it is an event-driven framework.
So, now I am explaining how the dispatch system works the uber dispatch system is built usingnode.js.The advantage of using node.js either asynchronously event-driven framework. So, the server can push the message or send a message to the application whenever it wants.
Now the next question is how do we scale these servers inside the DISCO component? To scale this, uber uses something called repo it has below functionalities
- it does consistent hashing to distribute the work between these servers.
- it also uses an RPC call to make calls from one server to one more server at times.
- it also uses something called “Gossip” protocol which helps every server knows the other servers’ responsibility. Every server has a specific responsibility even though they all do the same work but the responsibility to compute for aspecific location is assigned to each and every server.
Why do we need in Gossip protocol?
- So, the advantage of gossip protocol is we can easily add a server and remove the server. That way when we add the server the responsibility is distributed to the new server and responsibly is reduced for other servers that way rest of the servers also knows that this new server is responsible for doing what works.
- In real-time when a passenger places a request for a cab or for a trip, then how this Uber system functions, I am going to explain that. So, you know that WebSocket has a connection to the passenger and to the Cab Service, so when the user places the request for a ride, then the request first lands to the web socket and this web socket hands over the request to the producer/Consumer(Supply/Demand) service.
- Since the Consumerserviceknows the requirement of the Cabaret offering riders.For Example, I need A uber mini car or I need a Sedan car something like that. Now the Consumer service requests the supply/producer that I need a Cab of this kind at a particular location. Here, what the Producer service does is: it knows the location of the Passenger that means the cellID of the user’s location on a map as I have already mentioned that Google S2 library gives breaks the alert into tiny sizes mean it breaks location into small
- If the user is present in a particular cell, that means it knows the cell ID of that particular user. The consumer service supplies the Cell ID to the Producer/Supply Service. So based on this, what it does is it contacts one of the servers in the ring of the servers. In the consistent hashing, you know the responsibility is equally distributed that means suppose we have about 10 cells in total
just for example and dividing the responsibility often cells within the available servers.
- Suppose the user is requesting from cell 9 that means the Producer knows the five and hits the server and requests the server to find a Cab for that passenger present in cell 9. Now what the server does is it figures out and draws a circle in the map within some radius and then it figures out all the cells from which the Cabs can be figured out. Then it makes the list of all the cabs and from that the Cab list, it figures out the ETA for each and every cap using the rest service called Maps ETA service and it sorts based on that.
- With all this information it gives back to the Supply/Producer service and the producer service using the WebSocket sends the request to the first few Cabs which is very near to the passenger as soon as the driver accepts whoever accept first that particular cab will be assigned to that Passenger.
- So sometimes what happens is the Producer/Supply service won’t directly talk to each and every server. It just passes the requests to one of the servers and then that server internallyhands overs the requests through all the different other servers which is responsible to compute or to figure out the Cabs. The requests are passed from one server to other using RPC call.
- Once all the servers figure out the Cabs’ ETA then they all respond back to the supply/Producer service and the producer service takes care of notifying the driver and matching the demand with the supply
- Suppose, we need to add more service to the existing dispatch optimization system. The reason is we need to handle the traffic from the newly added city. For example, we have added two different servers into the RIngpop. Now the responsibility of these servers is unknown. So, what the ring pop does is it knows all newly added cell IDs from the MAPS-ETA component and it distributes the responsibility of newly added cells to these new servers.
- Exactly the same way it works when we take down the server and then it reshuffles the cell IDs or-assigning the responsibility of the computation of a particular cell to one of the random servers which is free .
- As I have already mentioned about the Google S2 library used to break the map into different cells and that are used to easily locate the Cabs near to any particular Passenger’s location so that is the use of S2 libraries.
- Now, how to build maps or to use maps in your application? EarlierUber used touse Map box because of Google’spricing strategy and etc. but now Uber is heavily using Google’s map framework in which uses GoogleMaps to show on Mobile apps and also uses google’s Maps API to calculate the ETA from point A to point B that is from pick-up point to the destination point. Earlier Uber used to do all this thing on its own and it also earlier used to repeatedly trace the Cab’smovements, GPS points and to build the road network system on its own by Calculating the ETA on its own. But nowUber has moved on and it is currently using Google’s library heavily.
Finding the Preferred Access Points:
- The is preferred access point the place where the passengers would have booked the cab from the nearby location of that place. Now the question comes to your mind that how did it learn it? It actually learns based on the repeatedly Uber drivers or the Cabs used to stop near the entry/exit gates or near any shops/malls or universities.
- Because they can’t enter the campus so that is been learned by uber and it automatically shows to the customers that The Cab can only pick up from those specific points. Uber uses different algorithms and machine learning to keep automatically figure out these Preferred access points.
How ETAs are Calculated:
- Now I will explain a little bit about how ETS are calculated and why it is a very important component of Uber or any Cab Provider service. Suppose, for example, a Passenger is requesting a cab from a particular point and the number of available Cabs near the passengers says 4 cars. So, when a user requests a Cab, the Consumer service requests the Provider/Producer service to figure out carbs for the passenger.
- Now what the service does is it tries to figure out all the cabs which are nearby to this particular passenger, so now we trust it draws the circle and then it figures out there are 4 Cabs which are free to take the service but what happens now isit calculates ETA from the Passenger’s place to those Cabs’ place by the road system and then figures out the ETA for all the four different
Cabs. But this approach always does network because these could lead to bigger ETAs.
- For example, if one more 5th cab which is about to finish another ride and that car is very near and the trip is about to end so this is obviously a better selection than any of those other 4 cabs as the trip is about to complete in about few minutes and this 5thCab is much nearer to the rider. So, Uber includes all the different factors like-turn cost, turn cost, the traffic condition, and everything to calculate the ETA and based on the different ETA sometimes not just the idle Cabs, but the other Cabs which are already serving the trip also included for the Cab assignment to the Passenger.
ETA Calculation(Alternate Way)
For ETA calculation, we can divide this into two major components:
- First is how to compute the route from the passenger’s pickup point to the destination with the least cost
- Second is the estimated time taken to traverse the route.
The main thing to imagine here is to convert any map into a geographical representation so where each intersection is a vertex and each road segment connecting two vertices is an edge. Then we can employ any routing algorithm like Distance Vector or Link state to find the shortest path between source and destination. But as we also know that to find the shortest distance it will take us O(n log n) run time complexity so this means the larger the end the longer it will take us to find out the shortest distance between two points.
Here, what we can do is to partition the entire graph and precompute the best path within those partitions which significantly brings out the runtime complexity. Once we have the most optimal path to go from point X to point Y, we can use the current traffic information to calculate the ETA which can be influenced by the time of the day or where the conditions or any concerts of gatherings happening at that particular moment which will be influencing the time. it will take for the driver to get from point X to point Y just like by adding the weights on these edges based on that traffic condition.
Now Passenger is happy as his ride is broken down and he got an ETA for the driver to arrive so now let’s dive deeper into the on-trip component. This is the part where the driver presses the “start ride”button and the back end system is tracking the trip’s current location. So, we have a message broker that listens to all the events sent by the drivers through Uber app. We can use any message broker service here like Apache Kafka, EWS kinases,Microsoft event Hub, Google pub/sub etc. We can use Kafka streams here for our design because it is open source and cost-effective as it doesn’t charge on the number of messages in the channel.
Once the trip location is sent to Kafka stream, then it gets forwarded to the Trip recorder service which takes it and writes it on to a location store (Any NoSQL DB like Mongo DB). We can use Cassandra as it is highly scalable as you can keep on adding more machine nodes and keep on storing data. Since we don’t want to lose trip details because of the hosts’ failure.
Below are the various Machine Learning Models using which we can optimize the ETA calculation process which will give near accurate ETA.
Database Design :
Coming to the database part, actually, the earlier users used to use a DBMS that is the Postgresql database for operations. They used to save driver/passenger profile information, they used to save GPS points for everything in a DBMS. It couldn’t scale as Uber rolled out service in different cities and then they talked about a new NoSQL kind of database that is built on top of MySQL and it is called schema-less.
When they are building this database below are the points they considered:
- The first one is it should be horizontally scalable i.e. you can linearly add the capacities in a different part of the cities into the network. For example, there are multiple nodes in different regions/cities that are added, and all together acts as one database that is schema-less. So, if you don’t want to design then you can either use a big table or Cassandra, MongoDB or any of the NoSQL DB. since they also behave in the same way.
- The second consideration which they considered while building the scheme was a delight and read availability. As I have already mentioned that in every four seconds the Cabs will be sending the GPS location to the Kafka rest API and those points were sent to Kafka for different processing and also points are written to NoSQL DB. for record purposes. And also, points are sent to state machines. It means that it is a Write Heavy application.
- Also, when User requests a cab, all the latest Cabs information is also fetched from the DB to show to the customer on the application. That means that there are tons of Reading happening and also there are tons of Writes happening to the system. So, the system should be heavily Writable and readable, and this system should never give downtime because we haven’t ever heard about Uber’s downtime even for a minute. Because every minute people will be requesting Cabs and so, we can’t just say that we are doing some maintenance. Hence, the system should be always available no matter what you are doing.
- What Uber does is when they launch services in new cities, they try to build the data center near to the city to give a seamless service. If not, always the nearest data center is selected and the data will be served from those locations.
Use of Analytics in Uber System:
- Now, coming to the Analytics part, what is analytics? In simple words, it is just making sense of the data which we have. Uber does that a lot because they need to understand the user, and they need to understand the behaviors of the Cab Drivers. That’s why you can optimize the system and eventually we can minimize the cost of operation and also make the user satisfaction better.
- So now let’s discuss what are the different tools which Uber users or different frameworks that Uber Utilizes to do a lot of analytics. As I have already mentioned, there is tons of GPS data flowing into the system from drivers and also a lot of information coming in from the users/passengers. All those data are saved either in NoSQL or our DBMS system or to the HDFS directly. If we are not sending the data directly to the HDFS, then what we can do is dump all the data from the NoSQL DB and put it onto HDFS.
- And sometimes, for a different kind of analysis, we may need the data in real-time and this data can directly be consumed from Kafka. Let me brief about every component of Big Data that you see in the main architectural diagram.
Big Data module(HDFS, HIVE, MAPS)
- The Hadoop platform has a lot of analytics-related tools that we can make use of to build analytics on the existing data. So we can take a constant dump of the data which we have in the database to the HDFS and on top of that, we can use tools like HIVE and PIG to get the data that we want from the HDFS.
- The next component is Maps or ETAcomponents.What we can do with this is, you consume the historical data along with real-time data, and also we can retrace the previous Maps data that we have, and then we can build the new Maps all together we can improve the Maps data that is currently there in the Map. And also, with the historical data and the real-time information coming from the Cab like traffic conditions and the speed of the Cabs, etc.
- We can use this data to compute ETA. When there is a request for Cabs, then the server contacts contact these Big Data components for ETA calculation. Now Uber is using something like simulated artificial intelligence to calculate the ETA with more accuracy.
Challenges need to be Considered:
In a Slow network or intermittent disconnecting network scenario, how will the passenger be able to use the Uber app uninterrupted?
In the middle of the trip, if the network gets disconnected then in that case how to handle the billing for that passenger?
Here, in this blog, I have explained the system designing of the Uber backend. Also explained about the microservices architecture that Uber is leveraging upon to build the backend APIs communicating with each other in a distributed system.
- How to design a microservices-based architecture for any of the backend applications?
- What is the use of Kafka in real-time data analysis and processing?
- How ETAs are calculated in any of the Cab provider services like OLA, Uber etc.
- What is Geo-Spatial Design and how it is helpful in the applications?
- What is Gossip Protocol and why it is used in a distributed system?