Nuances and Pitfalls in building scaleable Header platforms for e-commerce websites
2K views
Oct 30, 2023
The session elaborates & highlights on building Header platforms for e-commerce websites. Headers are usually the most important widget on e-commerce sites as they enable users to view their cart, global navigation, sign-in/sign-out, view notifications etc., Specifically in dealing with Scale, resiliency, availability. Software Architecture Conference 2023 (https://softwarearchitecture.live/) #SoftwareArchitectureConf23 #CSharpTV #csharpcorner
View Video Transcript
0:00
Hi, my name is Damodran. I'm an engineer with eBay and I work mostly on seller and bio-focused
0:05
experiences for eBay and all things web. I work predominantly on JavaScript and yeah
0:11
I'd like to also say my thanks for the organizers of this conference for picking my talk when I
0:19
submitted it. And yeah, let's just get started. So the topic for today was nuances and pitfalls
0:25
in building scalable header platforms for e-commerce websites. So the first thing that
0:29
I mean, I normally get from people is what header, what is that
0:33
I mean, is that so complex? Right. So headers are the first widgets that show up on any given page
0:40
Like, say, for example, if you're not talking about e-commerce websites, say, for example, like eBay
0:46
the snapshot that you see here is an example of the header widget that shows up on MWeb, on apps and on DWeb and so on
0:55
So like what is so complex about it? just a text box, a couple of buttons and so on
1:00
But at first glance, that's what it might look like. But when you take a deeper look like it
1:06
it actually brings in a lot of functionality. The first and foremost thing is that your platform users
1:12
would not be able to sign in and sign out. If this thing is not going to show up
1:16
they would not be able to do their searches on the site if it doesn't show up
1:20
And all those quick links and the important links that actually have all those core functionalities
1:25
hidden inside your website, say for example if you're an e-commerce website and you want users to buy and sell all those
1:31
quick links that are there would not show up if the header doesn't show up and again all your site
1:36
navigation for the entire site if they want to be accessing the site through specific categories
1:43
that would not be possible again if the header did not show up and again things like your profile
1:50
pages your your account settings pages your site-wide settings say for example you want to
1:56
set a specific shipping provider, or you want to set a specific currency, or you want to set a
2:01
specific language, those things would not be available again, you would not be able to view
2:05
your watchlist, see your notifications, access your shopping cart, all of that. So at first glance
2:12
it might look very simple. But when you take a closer look, you really understand like
2:17
how much of core functionality this one small part of the page actually encompasses fully
2:23
and when we talk of header platforms, we basically talk of the entire project of the header that is
2:31
maintained by specific teams that run these projects and in the talk today, we'll just see
2:38
like how eBay went about scaling this and just building this. So when we talk of header components
2:44
there are visual components. I mean, the ones that users can actually click access, I mean
2:49
interact with and so on, the snapshots that you see on the screen right now. But there are also a lot of non-visual components because the header platform is
2:57
something that is like cross-functional and is going to be horizontal and present
3:01
across all pages. So it also serves as a mechanism for you to be able to send
3:05
stuff across all your pages on the website. Say, for example, you want to send a
3:10
specific script that has to load across pages, then the header platform is a good
3:14
candidate for actually serving that. So what exactly happens is these non-visual components are
3:22
things like service workers, things like HTTP clients, service clients, those scripts that would help you manipulate
3:30
cookies and all other partner platform integrations. Like say, for example, when you're talking of a website or
3:37
eBay scale, there's a lot of things that happen behind the scenes. Like if you just scaffold an app, the first thing
3:43
you would have to deal with is tracking, experimentation, rate limiter, and all of these things
3:48
So you don't really want your app to be integrating with every single of these platforms that are
3:53
there within eBay. Instead, you give this responsibility to the header platform, and that
3:57
takes care of one single horizontal function that is present across all pages. And it takes care of
4:03
doing all these kinds of interactions and integrations behind the scenes. And those are the
4:08
non-visual components that we're talking about. When we're actually going about formulating an architecture for this kind of a setup
4:18
we also have to take into account the scale that we're operating. For example, if you take a look at eBay, for instance
4:25
there's over 600 plus apps on production right now, at least front-end apps, and there's over 100 teams
4:31
that are actually maintaining all these apps. They are on multiple stacks and frameworks
4:36
and there's like so many different partner platforms that you'll have to integrate with
4:41
and since the header is present across all pages light at the time of peak traffic it can
4:46
clock close to a billion impressions in traffic and the most important part there is that since
4:52
it's the first widget that shows up on the screen and it's above the fold you just have a window of
4:56
25 to 30 milliseconds to operate so your header has to show up within that amount of time for
5:02
for actually improving perceived performance. So having all these in mind, we have to actually take care of a lot of things
5:11
when we design this kind of a system. The first thing is that the availability of the system
5:17
you don't want the header to go down because it actually brings in the most crucial functionality
5:22
as we saw earlier. And you would absolutely have it to have at least a zero downtime
5:30
more the usual scale of 99.999. And I mean, obviously you need to have
5:38
a good amount of fault tolerance. If there are any errors, you don't want the system
5:42
to just go boom on the screen. And you would want to show some sort of a fallback there
5:50
And you also want something like isolation from failures. You don't want any errors from the domain teams
5:57
to creep into the header or the other way around. You want these teams to be operating independently
6:03
You're talking about a scale where there's over a 100 different teams that manage
6:09
so many different apps in production and they have their own release cycles and you don't want your release cycles
6:15
to be tied with them and so on. They have to be independent
6:19
In the event there is any error, you would want the recovery time to be very minimal
6:26
And as usual, we need to have a fast response time of about 20 to 30 milliseconds
6:30
And we obviously know the scale of traffic we're dealing with. And we also need to take care of all the other things like tracking
6:37
experimentation, testing, being able to debug if there are issues. And we also need this to be front end framework agnostic
6:44
because when you talk of so many different teams, you have so many different I mean, engineers who have their own opinions and they have their own frameworks
6:52
They like there are people who work on React. There are people who work on something called Marco, which is like actually open source by eBay
6:58
And there are people who are working on Angular and so on. So you would want your app to be framework agnostic and not be tied to all these things, which are, I mean, all these choices made by the engineers of the domain pages
7:13
And obviously, you would also have to account for personalization, which and we'll just get to that in a moment
7:19
even before we start right i just wanted to go over the structure of a generic ebay app so ebay operates in the usual microservice architecture and it makes use of Process Manager 2 where in every box in production would have about one master process
7:35
followed by three worker processes. And eBay believes in general about MPAs, which are multi-page apps
7:42
They are not heavy into single-page applications because of the various performance implications
7:47
and SEO implications associated with single-page apps. So the apps are mostly multi-page applications and this is the usual structure of an app in
7:57
eBay and we are like heavily invested into Node.js. There are of course some apps that are very old
8:03
that are still on the Java stack but all the new apps that are I mean are getting built are again
8:09
on the Node.js stack and this is how the stack would pretty much look like. So when a request
8:13
comes into like one of these servers it would go through a set of pipeline handlers and these
8:19
Handlers, they actually set the eBay context. They do a lot of these calls behind the scenes
8:25
like the tracking calls, the authentication calls, authorization calls, fetching apps, secrets
8:29
tokens, all of that. And at the end of it all, the request comes to your app code
8:34
And at this point, you want to be able to render the header. So what we do is we first send the
8:41
call for the header render, I mean, well ahead of the pipeline, so that the call is actually
8:48
resolved at the time it comes to the app code. So that is the first thing we do. And we also have
8:53
as usual, NPM modules because we're using a Node.js-based architecture. We do have our own
8:59
I mean, the header NPM adapter module. And we'll see how that factors into the entire architecture
9:06
to X1. And I mean, the next thing we want to talk about is render speed. As I mentioned earlier
9:11
eBay is mostly into NPS, which are multi-page applications. And the bare minimum criteria
9:16
for any framework that is there inside eBay is for it to be able to do server-side rendering
9:21
It should also be able to stream responses and it should be able to do something called
9:26
recent rendering wherein you in this example right you see this I mean this image on the left side
9:32
which waits for all the services to complete and then it loads the page whereas the image on the
9:37
right side is an example where as the services finish and respond you are going to just render
9:43
that fragment of the template and flush it to the browser so you see an improvement in terms of
9:48
perceived performance wherein on the left side there's nothing on the screen it just
9:52
waits for everything to load whereas the one on the right side it starts loading stuff immediately
9:57
and you want the header to load in such a way so that it is able to show up itself in about 25 to
10:02
30 milliseconds immediately the next thing is that we were talking about scale and we also need to
10:09
have the header absolutely available so the earlier screens i spoke about the headers having
10:15
visual and non-visual components and now the visual components can be further split into static
10:21
fragments and dynamic fragments so the thing here is that static fragments are the ones that show up
10:27
pretty much common to almost a large segment of users whereas dynamic fragments are very
10:32
personalized data say for example your shopping cart is going to have a different sort of items
10:37
from what I might have in my shopping cart and so on. So the important thing is that you would be able to access dynamic fragments
10:45
only when you're able to have the static fragments show up. So only if I have the static header to show up
10:53
I would be able to hover over the cart icon to be able to see
10:56
all the items in the shopping cart and so on. So this kind of a diff is actually driven by the UI and the UX itself
11:07
And this actually factors a lot into the kind of architecture we go for this. So how do we go about scaling the static portions
11:14
That's one thing. So the thing is that if you are able to generalize the header
11:19
but still be able to differentiate it based on some specific aspects
11:24
then you would be able to serve the same kind of header for millions of users belonging to that specific segment
11:29
So when we talk about aspects here, what we are talking about is, say for example, eBay has a US site and a UK site
11:35
They do have different pages like the search page, home page, the item pages, the product pages and so on
11:42
They do have different variants. They have to serve the header on M-Web. They also have to serve it on D-Web
11:47
eBay also has its own design system, which is operated by a framework called Skin
11:53
and since there are so many pages and only a handful amount of teams that are available to operate it
12:00
you'd obviously see that some pages are on an older version of design systems
12:04
and some are on the latest version of the design system. So you would not want to serve the latest header
12:12
onto a page that is having a very old design system. They don't blend homogeneously and would just stand out
12:18
and it could look very awkward. So the other thing is that we also have different flavors of the header
12:24
We have the simple flavor, which is the one that you see on the right side
12:29
and the prominent one, which is called the full flavor, which you see on the left side with the prominent search box
12:34
The thing is that when you're on the search page, the search functionality is the most prominent thing that has to show up
12:39
But if you're on the shopping cart page, the goal of the intent is for that transaction to be converted completely
12:47
You don't want to distract the user by showing up a prominent search box there and again, wanting them to search
12:54
So based on these aspects, the site, the pages, the variants, the version of the design system themes and things like the header flavors
13:03
and the languages, we would be able to formulate a key from these aspects
13:08
And if you're able to compute a response for this specific key, and since this key is going to be pretty much common
13:13
for a large number of users in a specific segment, say, for example, the eBay US site homepage, which is on the language English and is following
13:22
a specific version of a design system on MbEB, right? That is a specific static header that
13:26
is going to be served for millions of users. So if you are able to compute this before
13:31
and you're able to cache it, you would be able to repurpose the same header for millions
13:35
of users. That's pretty much the idea. And like we were also talking about design considerations
13:42
with isolation and independence. The need for isolation is that you don't want errors
13:46
from each of the header and the domain pages to affect the other one
13:51
In terms of independence, we don't want the release cycles to be tied up
13:56
and we want all the engineers of different teams to be able to operate independently
13:59
so that we are able to roll out code with good velocity
14:04
The other ones is that we want the header teams and the domain teams to be able to conduct
14:09
their own A-B tests and experiments The thing is that since we are actually operating in a Node.js based environment, the first thing is the header NPM module
14:19
But the catch here is that this NPM module is not going to do the render
14:25
It's just going to make downstream service calls. As I earlier established, the first bare minimum criteria is that we expect all the front end frameworks to be able to do server side rendering
14:35
and we are operating in a multipage environment. The thing is that the only things that
14:41
these modules do is that they make service calls if they don't already have a response
14:46
The only things they expose to the pages say for example the homepage or the search page and so on would be tag libraries And these can be on any different frameworks you could be on Svelte you could be on SolidJS you could be on React you could be on Marco and we would just have
15:03
those tag libraries for those specific frameworks exposed. And these domain pages here, which we call
15:10
as the various pages within eBay, the homepage search page and all, they don't have to bother
15:14
about the render of the HTML and they don't have to bother about bundling
15:19
resources because all of that is now taken care of by the render service which
15:23
actually performs the rendering. We'll just get to that in a moment. As I earlier mentioned, we just have a window of about 25 to 30 milliseconds for the header
15:33
to show up. The thing is that this can obviously only be accomplished by some form of heavy caching
15:39
and we are able to cache this because we already are going to split the visual components as I
15:44
I earlier mentioned the static and dynamic fragments, and the static fragments are something that can be
15:49
served to millions of users belonging to a specific segment. If we cache that
15:53
we are obviously doing good in terms of the render speed. The first thing here is that
16:00
when a request comes in into an eBay app, and it's going to see the header NPM module
16:06
the header NPM module is going to have the stack libraries exposed and it's going to attempt to render
16:11
which means it's going to make a service call. But even before making the service call, it's going to look up into a system of caches
16:18
Like we have a cache that is tied to every single process in production
16:22
which means you would have three different caches in production, which is the L1 cache. And we do have a shared cache behind the scenes
16:27
So if it's present in the cache that is tied to a specific process, it would flush that immediately
16:35
If not, it would look up into the backup cache and repopulate the main cache and flash it back
16:39
And the last thing is, if it's not present in any of these caches
16:43
it is going to obviously make a service call to the renderer service, which is going to perform the render and return the rendered HTML back to us
16:50
So through this system of caching, we are able to serve the header in less than 30 milliseconds
16:56
And the reason why we cannot go, we actually maintain this cache within the NPM module here
17:02
but we don't have a distributed cache because again, that introduces a new network call
17:07
call and there is again a network latency involved and there is again some form of uncertainty that's
17:13
actually tied to that call if the call fails if the render fails and so on uh so for that purpose
17:17
we don't really uh have this option of going to a distributed cache or a shared caching server and
17:24
and things like that we would have to have the cache local that is to every single box within
17:29
i mean every single domain page and yeah that is pretty much about caching here and as i mentioned
17:35
right since we have this last option of making the network call here for actually doing the render
17:42
what happens is we are introducing some form of network latency and we have some form of
17:46
uncertainty there so a lot of things can go wrong the render can fail the service call can fail and
17:51
so on and sometimes even the node.js worker processes restarts if there are any errors like
17:58
if they are out of memory errors and so on so we do have the last backup cache which is a disk
18:03
base cold cache that is bootstrapped into memory. In case the worker restarts
18:12
it is going to be able to go back to this cold cache and pull it up and immediately serve the header so that there is a near zero downtime for the header
18:21
I was also mentioning earlier that we are able to split these visual components into static and
18:25
dynamic fragments and static fragments are entirely static. Then the dynamic fragments have a very
18:31
specific user personalized data but that is not entirely true because even the static fragments
18:38
have some hints of personalization say for example you log in and you need to see a greeting message
18:42
like this on the i mean i'm just referring to the image on the right side here so um we do have a
18:48
hint of personalization there and so what we do is just before the html is going to be flushed out
18:54
to the browser, we would just do an injection of specific user values like this, I mean into the
19:02
response, so that it shows up on the screen immediately. The other one, I mean the last
19:09
part of the setup is again the renderer service. The renderer service is the one that actually
19:13
performs the HTML rendering on the server and again it also has a two-tier caching system for
19:20
faster responses and this cache can obviously be a distributed cache because it's not accessed pretty
19:26
often but we haven't actually found a need for that yet and we are able to serve this without
19:32
a cache that is actually distributed uh the thing specific about this renderer service is that it
19:38
is framework agnostic which means your domain page which is like the home page the search page or the
19:43
cart page the checkout page or anything it can be on any sort of a view library and they don't really
19:48
have to bother, nor does the header team have to bother what is being used by the domain pages
19:53
because the entire render is going to be decoupled away from the domain pages, and they can be on any
19:59
view library they want, and we can be on any sort of a view library that we want. The only thing here
20:07
is that the only point of interface here, again, is just the adapter. And if you're making any
20:13
changes to the adapter, we ensure that it's actually included as part of the monthly software
20:19
upgrades which are automated. And so if a team actually on boards into the software upgrade
20:25
they would just get the latest portion of the specific adapter that they require. So when we talk of this renderer service, there's like two problems that can happen. So if you just
20:34
think of the case when you're on the homepage and you have about X boxes on the homepage pool that
20:39
that is trying to serve the header and a specific variant of the header
20:43
Say for example, the Xboxes are trying to serve by requests for the header and it's not present in the cache
20:51
And so it tries to make requests to the downstream service, which is the renderer service
20:55
And the renderer service can obviously get a lot of calls in that fashion
21:01
So there is no point if you're trying to process all these requests at the same time
21:05
because they are all similar requests that just happened to occur simultaneously
21:12
So we don't want to overstress the downstream server by processing all of these requests
21:18
Instead, we just process only one out of the end requests and we just repurpose the same response for the remaining N-1 requests
21:25
They call this as request collapsing. that is like, it solves a very specific problem
21:30
in distributed systems in terms of handling cascading failures. That is one thing that we use for this
21:38
That is actually employed using the Hystrix library that was popularized by Netflix
21:44
We use that through a library called TrueBar. That helps you build scalable pipelines for large scale apps of this size
21:55
So we are able to hold those requests and we just process one out of the N requests and we repurpose the same response for the remaining N-1 requests
22:02
Again, these are actually computed based on the aspects. So we know that the N requests are similar because we are able to actually compute the key for the aspects and we know that all these requests are pretty similar
22:13
So we don't have to process them. The next one is we do have circuit breakers in place I think there was a session about circuit breakers in this conference and we also pretty much use circuit breakers because when a downstream system is already strained you don want
22:29
to fire more and more calls and don't give that an opportunity to recover from the error
22:34
So that is like a bad pattern. So it helps to alleviate back pressure on downstream servers and also prevent cascading
22:42
failures and it helps to lower the stress on downstream systems. What basically happens is the app monitor
22:49
would monitor the downstream service for markdowns. If it sees a markdown and if it'll just open
22:55
the circuit and not make the service call, instead it'll just trigger the trap for serving
23:01
the fallbacks immediately so that there's no delay in the response. At the same time, behind the scenes
23:08
it keeps monitoring for the status of the service by making things at regular intervals and
23:14
seeing if the service comes back up. If it comes back up, then it again closes the circuit and
23:19
it just continues making the calls. So we talked about static fragments of the header so far. And
23:26
let's now have a look at how the dynamic portions are. So the dynamic portions pretty much follow
23:33
the established patterns of building any sort of a distributed system. At least it is, it's held
23:42
heavily in place by eBay's infrastructure that includes, by default, the historyx patterns and all the request collapses that we earlier
23:49
discussed before. There's also, so what basically happens is the static header loads on the page and then the users just cannot. Say, for example, if
23:59
I'm going to click on the watch list here, it has to show up a modal or a fly
24:04
out with all the items there. And for that, it would just make an AJAX call to
24:08
the downstream service and that would reach a separate production pool. That's also one of the reasons why we split this into static and dynamic fragments
24:18
because we now have two different production pools and we are able to scale these independently
24:22
The static fragments can be scaled independently and the dynamic fragments can be scaled independently and the errors that occur in one don't impact the other and so on
24:31
It also makes the static header extremely resilient because it has to absolutely show up
24:37
even for the dynamic portions to work. If you're going to access, say, for example
24:42
your notifications, you just hover over the bell icon there on eBay's website, it would show up a list of notifications
24:49
For that, we make an Ajax call that reaches the dynamic platform
24:54
which takes care of the HTML rendering on the server. Internally, it makes downstream service calls
24:58
which again follows the backend for front-end approach here. And these things can again involve, I mean, layers of caching and DB sharding and so on
25:11
eBay has its servers in three different infrastructures that are held at three different places
25:17
So in terms of coordinating with them and all that, all the usual established patterns of distributed systems come into play here
25:24
And yeah, putting it together, we have a separate pool for the static portions
25:29
and also a separate pool for the dynamic fragments. we are able to scale them independently and the errors in one don't impact the other making them
25:38
more resilient and for instance the entire set of one billion impressions that were clocked at the
25:46
time of high speed traffic for the static fragments were just served using 40 boxes
25:50
on production because we rely heavily on caching again and we are able to
25:56
I mean cache that in a manner that I earlier discussed before and so the entire set of one
26:01
1 billion impressions get served with about 40 boxes in production. The dynamic portions, they are frequently accessed and have to be
26:10
recomputed for every single user. We have about 500 boxes that serve this in production
26:16
This is handled again by eBay's own cloud infrastructure. They don't use Azure or AWS or anything like that
26:23
They have their own cloud platform that can flex up and flex on based on the traffic
26:28
The last part of this is like being able to debug entries
26:33
We earlier saw that there was a multi-tier caching system like this in place
26:37
We obviously need to know which part of this system actually ended up serving the request
26:45
If we are having any issues, we need to be able to debug that specifically and do a setup is already in place
26:53
We are also able to bypass the cache completely to load and test a fully cacheless experience
26:59
In terms of testing, there are end to end pipelines that we have. I mean, automation pipelines that are written that
27:07
that obviously will do I mean, testing for at random times for
27:14
most important pages, at least for eBay to ensure that the header always shows up
27:17
And in terms of tracking also, we do have, I mean, impression tracking for every single page that loads up in the header load up or not
27:24
and things like that. And for observability and monitoring, we do monitor the CPU usage of the production servers
27:31
And we also check for out of memory errors and error counts. And if there are any issues that
27:37
happening, it would immediately start, I mean, all the alarm bells would start going on. And
27:44
the last bit of this is, I mean, we are tracking so much on the server, and we obviously need to
27:50
also know that did the header i mean after showing up on the page did the header actually work
27:55
properly right so for this we log again the client side errors back into cal which is ebase they
28:02
call this as the central application logger and from that we are able to pull in our data if
28:08
if we had any sort of errors on the client side when the header loaded and lastly uh this cache
28:15
eviction right so we have established that there are independent pools i mean every page is on a
28:20
a separate pool within eBay. If they do rollouts, it's going to reset the cache because it's an in-memory cache
28:26
and that's fine. But what happens if the renderer service does a rollout
28:30
Then what happens? This is handled by eBay's config management team that's
28:36
actually going to publish a config which any apps that are interested is going to listen to
28:41
Once they subscribe for that specific change, they would do something called staggered cache eviction
28:46
so that they don't flood the downstream service again, I mean, just trying to evict all the cache entries and repopulating all of them again
28:55
Yeah, that is pretty much it. So the important takeaway here was sometimes it's not about just designing a distributed system
29:03
You have sometimes the UI and the UX can also drive your design decisions
29:08
In this case, we were able to split that into two segments separately, and we were able to sort of isolate them away
29:16
and scale them in a manner that they don't impact each other
29:21
So that was a big takeaway in this. And I've included all the references that I've shared for this session
29:28
that I've spoke about in this talk. And you can always refer to that
29:33
I'm hoping there'll be access to the slides sometimes. Yeah, that is pretty much it
29:38
Simon
#Retail Trade
#Software