Dealing with millions of webhooks and surviving Black Friday
With last years 300% increase traffic, the holiday season, starting end of October, is the busiest season of the year. Black Friday, Cyber Monday and Christmas shopping, all lead to an invasion of orders and updates to customers, products, and more.
As an App developer, this busiest season is the most important season of the year. There is no-way you want your app to fail or experience the stress of downtime. You need to be prepared. In order to help you out, we have reached out to our largest partners, Shappz and AppLab (DotCommerce) on how they deal with millions of webhooks.
When the data in your application need to be as up-to-date as possible you should use the webhook functionality within the API. A webhook is an event driven HTTP-callback with a JSON or XML payload. This payload is posted to a predefined endpoint set by the app builder., Lightspeed eCom webhooks wait a maximum of 5 seconds for an HTTP 200 answer before it closes the connection.
Sounds like efficient way of pushing data over doesn’t? But as to many subjects, economies of scale apply here. Webhooks are interesting, but become even more interesting when you need to deal with millions of them.
If your application processes the webhooks on-the-fly and you let Lightspeed wait for the webhook to be processed, your server can easily be overloaded. And soon it will look like your server is being under DDoS.
In addition, letting Lightspeed wait for 5 seconds means that it takes 5 seconds before your app or another app will receive the next webhook. Not accepting (returning HTTP-200) within the given 5 seconds results in a penalty and requeues your webhook at the end of the list with a delay. The more penalties your application has the longer it has to wait in the queue. This could lead to a deactivation of the webhook or ban of your host.
So how should you deal with large amount of webhooks?
Both Shappz and AppLab gave us insights on how they deal with millions of webhooks. In order to receive and process those millions of webhooks, they use a so-called Job Farm with Job workers. A Job-Farm is a collection of individual tasks waiting to be picked up by a Job-worker. A Job-worker can be dedicated to a specific task (single-skilled) or can handle all the tasks (multi-skilled).
As soon as a webhook is posted to one of their servers, they accept the webhook and store them in a Job-Farm. Nothing more, nothing less. This significantly drops the response time from an average of 4 seconds to less than 0.1 seconds.
This Job-Farm is usually built upon an in-memory data structure store, such as Redis, Pheanstalk and SQS. A task is stored within a list and handled according to the FIFO (First-in, First-out) principle. This is important if you want to process your webhooks in the order you have received them.
Job-Workers are processes that continuously check if new jobs are added to the farm. As soon as a job is added, the worker starts processing the job. The job-workers’ processes are constantly monitored to prevent them from failing or stopping. As soon as a worker crashes or experiences hiccups, a new worker should automatically be started. This prevents your application workers from crashing and stopping to process your jobs and your queues from piling up.
What are the benefits of a Job-Farm and Job-workers?
A major advantage is that you have better control over your infrastructure. Accepting a webhooks consumes very little CPU and memory. As the jobs / tasks are processed one by one or per running worker, you can easily control the amount of CPU / Memory that is allocated to the workers.
Job-farms are ideally suited for scalability. If your apps becomes more popular or a merchant has done on import, you can increase the amount of workers to process the amount of pending tasks faster.
Last but not least, with having such a mechanism Lightspeed is less likely to deactivate or ban your non-performing webhook.