CMS_WebFarmServer and scaling

Hey there,

I'm coming here because we just had an incident that almost cost us our production environment.

We're working on an Xperience Website with this architecture:

  • One CMS instance

  • One Frontend instance that can scale to up to 4

All that in Azure.

We're currently doing a lot of test on our scaling so a lot of server creation/destruction.

A few minutes ago, we had the quite unpleasant surprise to see our SQLServer Instance go up to 100% with nothing peculiar running. Let's check the problem.

Proc_CMS_WebFarmTask_DeleteOrphanedTasks => min 15s, moy 15s, max 15s

That's not good. 15s felt a lot like a SQL timeout and indeed it was.

Let's dive.

Body of DeleteOrphanedTasks with a NOT IN( get all the tasks id in WebFarmServerTasks)

And at that moment, I got it.

Let's check our servers...

18 servers with ServerEnabled seems a lot as we only had 1 instance at that moment (+1 cms + 1 staging slot I guess). Let's check what's going on...

 request returning last ping by ServerId

So, some servers haven't answered for more than 12h but are still considered (serverEnabled) ?

Once I deleted all tasks/server that were not there anymore, the db load is back to normal.

Ever had this problem ? What is the normal behavior for our usecase ? Do Xperience support autoscalling or do I have to manage the server list on our side ? I feel a process should do the cleaning hourly at least.

Furthermore, Delete Top is a really bad pattern. We already had a problem with k13 with the same kind of request. You were doing almost the same thing to delete logs and if you had a big log intake, you could easily break the system. Your request is listing all the lines and taking the x first. If the listing takes more time than your timeout, the whole cleaning breaks.
You have an identity field. Get the min/max of your request and delete on Id > X. It will be lighting fast in comparison.

[edit] seems we can't put SQL. Not quite practical :D

Environment

  • Xperience by Kentico version: [31.5.0]

  • .NET version: [10]

  • Execution environment: [Private cloud (Azure)]

Answers

To response this discussion, you have to login first.