Understanding ATS memory usage

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Understanding ATS memory usage

Veiko Kukk
Hi,

I'm confused about ATS memory configuration. I have a server with CentOS 7, ATS 7.1.1, 64GB memory and ~ 10TB disk.
traffic_server process takes ~ 23GB memory with the configuration option (8GB)
CONFIG proxy.config.cache.ram_cache.size INT 8589934592
ATS is using raw partition on HDD.

* Why does it swap when there is page cache that's basically free memory that could be used before swapping. vm.swappiness is 10, i had it set to 0 too, then system does not swap.
* Considering ATS is using O_DIRECT with raw partitions and it's own memory management for disk cache, would that mean that ATS is not using kernel page cache at all?
* Would ATS benefit  from larger RAM cache considering it has it's own disk buffer management.

Also, most strange is that there are frequent memory usage drops of traffic_server process. After around 50 minutes, 10GB memory is released and immediately consumed again. Attaching screenshot.

Regards,
Veiko


ats_memory_jumps.png (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Veiko Kukk
Hi,

Really nobody knows how ATS uses memory?

Veiko

2017-12-12 14:44 GMT+02:00 Veiko Kukk <[hidden email]>:
Hi,

I'm confused about ATS memory configuration. I have a server with CentOS 7, ATS 7.1.1, 64GB memory and ~ 10TB disk.
traffic_server process takes ~ 23GB memory with the configuration option (8GB)
CONFIG proxy.config.cache.ram_cache.size INT 8589934592
ATS is using raw partition on HDD.

* Why does it swap when there is page cache that's basically free memory that could be used before swapping. vm.swappiness is 10, i had it set to 0 too, then system does not swap.
* Considering ATS is using O_DIRECT with raw partitions and it's own memory management for disk cache, would that mean that ATS is not using kernel page cache at all?
* Would ATS benefit  from larger RAM cache considering it has it's own disk buffer management.

Also, most strange is that there are frequent memory usage drops of traffic_server process. After around 50 minutes, 10GB memory is released and immediately consumed again. Attaching screenshot.

Regards,
Veiko


Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Alan Carroll-2
It's a complex subject hard to put in an email. A few notes:

1) You shouldn't let an ATS box swap. That almost always ends badly. Adding more ram or adjusting the configuration to avoid it is better. I think we set swappiness to 0.

2) The cache directory takes memory independent of the ram cache. This is 10 bytes per directory entry. The number of directory entries is roughly the cache disk size divided by the average object size as set in records.config.

3) ATS does not use the kernel page cache for its own cache operations.

4) A larger ram cache almost always creates better performance, but the yield curve can differ quite a lot. What the ram cache does is enable cached data to be served from ram instead of disk. However, once the ram cache covers most of the working set, additional ram yields marginal benefits. E.g. putting an object fetched once a day in ram cache is better, but not very much.

5) I think what your'e seeing with your graph is cache directory synchronization to disk. To do that, ATS allocates memory for a copy of the cache directory, copies the directory there, then writes it out. It should be doing that somewhat peicemeal because a full duplicate of cache directory can be very large.


On Tue, Dec 19, 2017 at 3:04 AM, Veiko Kukk <[hidden email]> wrote:
Hi,

Really nobody knows how ATS uses memory?

Veiko

2017-12-12 14:44 GMT+02:00 Veiko Kukk <[hidden email]>:
Hi,

I'm confused about ATS memory configuration. I have a server with CentOS 7, ATS 7.1.1, 64GB memory and ~ 10TB disk.
traffic_server process takes ~ 23GB memory with the configuration option (8GB)
CONFIG proxy.config.cache.ram_cache.size INT <a href="tel:(858)%20993-4592" value="+18589934592" target="_blank">8589934592
ATS is using raw partition on HDD.

* Why does it swap when there is page cache that's basically free memory that could be used before swapping. vm.swappiness is 10, i had it set to 0 too, then system does not swap.
* Considering ATS is using O_DIRECT with raw partitions and it's own memory management for disk cache, would that mean that ATS is not using kernel page cache at all?
* Would ATS benefit  from larger RAM cache considering it has it's own disk buffer management.

Also, most strange is that there are frequent memory usage drops of traffic_server process. After around 50 minutes, 10GB memory is released and immediately consumed again. Attaching screenshot.

Regards,
Veiko



Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Veiko Kukk
Hi again,

During that mysterious task that happens after ~ 50-51 minutes causes requests/responses to slow down very much, even time out.
Requests that usually take few hundred milliseconds are now taking over 30s and timing out. This happens only during that time when memory consumption is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
ATS version is 7.1.1 and this looks like serious bug for me.

Regards,
Veiko


2017-12-19 18:28 GMT+02:00 Alan Carroll <[hidden email]>:
It's a complex subject hard to put in an email. A few notes:

1) You shouldn't let an ATS box swap. That almost always ends badly. Adding more ram or adjusting the configuration to avoid it is better. I think we set swappiness to 0.

2) The cache directory takes memory independent of the ram cache. This is 10 bytes per directory entry. The number of directory entries is roughly the cache disk size divided by the average object size as set in records.config.

3) ATS does not use the kernel page cache for its own cache operations.

4) A larger ram cache almost always creates better performance, but the yield curve can differ quite a lot. What the ram cache does is enable cached data to be served from ram instead of disk. However, once the ram cache covers most of the working set, additional ram yields marginal benefits. E.g. putting an object fetched once a day in ram cache is better, but not very much.

5) I think what your'e seeing with your graph is cache directory synchronization to disk. To do that, ATS allocates memory for a copy of the cache directory, copies the directory there, then writes it out. It should be doing that somewhat peicemeal because a full duplicate of cache directory can be very large.


On Tue, Dec 19, 2017 at 3:04 AM, Veiko Kukk <[hidden email]> wrote:
Hi,

Really nobody knows how ATS uses memory?

Veiko

2017-12-12 14:44 GMT+02:00 Veiko Kukk <[hidden email]>:
Hi,

I'm confused about ATS memory configuration. I have a server with CentOS 7, ATS 7.1.1, 64GB memory and ~ 10TB disk.
traffic_server process takes ~ 23GB memory with the configuration option (8GB)
CONFIG proxy.config.cache.ram_cache.size INT <a href="tel:(858)%20993-4592" value="+18589934592" target="_blank">8589934592
ATS is using raw partition on HDD.

* Why does it swap when there is page cache that's basically free memory that could be used before swapping. vm.swappiness is 10, i had it set to 0 too, then system does not swap.
* Considering ATS is using O_DIRECT with raw partitions and it's own memory management for disk cache, would that mean that ATS is not using kernel page cache at all?
* Would ATS benefit  from larger RAM cache considering it has it's own disk buffer management.

Also, most strange is that there are frequent memory usage drops of traffic_server process. After around 50 minutes, 10GB memory is released and immediately consumed again. Attaching screenshot.

Regards,
Veiko




Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Leif Hedstrom


On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[hidden email]> wrote:

Hi again,

During that mysterious task that happens after ~ 50-51 minutes causes requests/responses to slow down very much, even time out.
Requests that usually take few hundred milliseconds are now taking over 30s and timing out. This happens only during that time when memory consumption is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
ATS version is 7.1.1 and this looks like serious bug for me.


That sounds suspiciously like kernel paging activity, maybe it’s spending that time dumping flushing dirty pages?  Maybe transparent huge pages ? Or tweak the sysctl’s for dirty page ratios?

The other thing to possibly look at is the dir entry sync periods of ATS. Whenever we sync those to disk, we consume both more memory and more disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you need to turn down the RAM cache or tweak the amount of directory entries you have).

— Leif

E.g.



Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Veiko Kukk
Hi,

I should have noted before that, during that timeframe, there is no higher disk activity than on average. No higher load, no disk latency, no cpu load. Nothing abnormal except slow ATS.
ATS is running on CentOS 7 directly on hardware dedicated server.

Dirty pages related config that's been always there for that server:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 40
vm.swappiness = 0

# free -m
                   total           used           free          shared  buff/cache   available
Mem:         128831       27841        2436        4331       98554       95722
Swap:          4095           0        4095

As you see, there is lot of available memory.

I don't see how writing dirty pages could slow down ATS when there is no indication of excessive load on any of the system resources.
And this strange regularity: every ~ 50 minutes this happens. Like some regular (cronjob like) task is being ran inside ATS that delays all other tasks.

ATS is using 9TB raw partition if that information might be relevant.

Could you point me to the documentation dealing with dir entry sync periods of ATS?


-- 
Veiko



2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[hidden email]>:


On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[hidden email]> wrote:

Hi again,

During that mysterious task that happens after ~ 50-51 minutes causes requests/responses to slow down very much, even time out.
Requests that usually take few hundred milliseconds are now taking over 30s and timing out. This happens only during that time when memory consumption is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
ATS version is 7.1.1 and this looks like serious bug for me.


That sounds suspiciously like kernel paging activity, maybe it’s spending that time dumping flushing dirty pages?  Maybe transparent huge pages ? Or tweak the sysctl’s for dirty page ratios?

The other thing to possibly look at is the dir entry sync periods of ATS. Whenever we sync those to disk, we consume both more memory and more disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you need to turn down the RAM cache or tweak the amount of directory entries you have).

— Leif

E.g.




Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Veiko Kukk
Hi again,

I'd really appreciate if somebody could point me in the right direction how to solve this.
During whatever ATS does after each ~ 50 minutes, it has strong effect on response times.
ATS is used in reverse proxy mode and we run regular tests against test URL on proxied server(s) (excluded from caching in ATS config).
This test GET is ran as HAproxy health check after ~15 seconds for two local HAproxy backends which both pass requests to single local ATS.

Quite complex setup, but point here being that tests run frequently and give information about ATS response times over long period.

Total test runs today: 6364
Tests that took over 7s today: 50

Distribution of requests, first column is response time, second amount of requests under that value.
100 1292
300 4351
500 5194
700 5578
900 5794
1200 5985
1400 6058
1800 6143

Here is the output of tests log that contains all the extremely slow responses. Test response size is only 609 bytes. Usually response time fluctuates around 

2018-01-26T01:13:32.150186+00:00 12412
2018-01-26T01:13:32.150188+00:00 20803
2018-01-26T02:05:04.536931+00:00 29764
2018-01-26T02:05:04.536936+00:00 27271
2018-01-26T02:05:04.536941+00:00 10233
2018-01-26T02:56:26.968987+00:00 9511
2018-01-26T02:56:26.968989+00:00 30084
2018-01-26T02:56:26.968991+00:00 27337
2018-01-26T04:39:21.947460+00:00 24171
2018-01-26T04:39:21.947462+00:00 12042
2018-01-26T04:39:21.947464+00:00 36979
2018-01-26T04:39:31.954116+00:00 7369
2018-01-26T04:39:31.954118+00:00 32305
2018-01-26T04:39:31.954120+00:00 19779
2018-01-26T04:47:42.349748+00:00 29177
2018-01-26T04:47:42.349754+00:00 26212
2018-01-26T04:47:42.349757+00:00 21645
2018-01-26T04:47:42.349759+00:00 24932
2018-01-26T05:39:04.925435+00:00 32361
2018-01-26T05:39:04.925438+00:00 33587
2018-01-26T05:39:04.925440+00:00 8173
2018-01-26T05:39:04.925443+00:00 28149
2018-01-26T05:39:04.925445+00:00 29115
2018-01-26T06:30:27.643170+00:00 7423
2018-01-26T06:30:27.643172+00:00 32271
2018-01-26T06:30:27.643174+00:00 18927
2018-01-26T06:30:27.643179+00:00 27849
2018-01-26T06:30:37.644023+00:00 15160
2018-01-26T07:21:50.231681+00:00 19208
2018-01-26T07:21:50.231684+00:00 14984
2018-01-26T08:13:12.874501+00:00 16876
2018-01-26T08:13:22.885389+00:00 14007
2018-01-26T09:04:35.509167+00:00 9016
2018-01-26T09:04:35.509172+00:00 9356
2018-01-26T09:55:58.052277+00:00 24137
2018-01-26T09:55:58.052280+00:00 23709
2018-01-26T09:55:58.052282+00:00 19901
2018-01-26T09:55:58.052284+00:00 19034
2018-01-26T10:47:10.614261+00:00 23419
2018-01-26T10:47:10.614263+00:00 18967
2018-01-26T11:38:32.984318+00:00 14425
2018-01-26T11:38:32.984324+00:00 9797
2018-01-26T11:38:32.984326+00:00 11161
2018-01-26T11:38:32.984329+00:00 16228
2018-01-26T12:29:45.511517+00:00 15580
2018-01-26T12:29:45.511520+00:00 11439
2018-01-26T13:20:58.023816+00:00 21360
2018-01-26T13:20:58.023818+00:00 19488
2018-01-26T13:20:58.023821+00:00 14737
2018-01-26T13:20:58.023823+00:00 17118


The question is: why does ATS regularily slow down? Are there some internal management jobs done, that use same single queue as requests coming in and then requests just wait in queue until internal processes finish?
It's getting worse as ATS uptime increases, triggering HAproxy health check timeouts. After restart it's not that bad again for a while.

How to get rid of this regular slowness?

Best regards,
Veiko



2018-01-23 13:53 GMT+02:00 Veiko Kukk <[hidden email]>:
Hi,

I should have noted before that, during that timeframe, there is no higher disk activity than on average. No higher load, no disk latency, no cpu load. Nothing abnormal except slow ATS.
ATS is running on CentOS 7 directly on hardware dedicated server.

Dirty pages related config that's been always there for that server:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 40
vm.swappiness = 0

# free -m
                   total           used           free          shared  buff/cache   available
Mem:         128831       27841        2436        4331       98554       95722
Swap:          4095           0        4095

As you see, there is lot of available memory.

I don't see how writing dirty pages could slow down ATS when there is no indication of excessive load on any of the system resources.
And this strange regularity: every ~ 50 minutes this happens. Like some regular (cronjob like) task is being ran inside ATS that delays all other tasks.

ATS is using 9TB raw partition if that information might be relevant.

Could you point me to the documentation dealing with dir entry sync periods of ATS?


-- 
Veiko



2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[hidden email]>:


On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[hidden email]> wrote:

Hi again,

During that mysterious task that happens after ~ 50-51 minutes causes requests/responses to slow down very much, even time out.
Requests that usually take few hundred milliseconds are now taking over 30s and timing out. This happens only during that time when memory consumption is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
ATS version is 7.1.1 and this looks like serious bug for me.


That sounds suspiciously like kernel paging activity, maybe it’s spending that time dumping flushing dirty pages?  Maybe transparent huge pages ? Or tweak the sysctl’s for dirty page ratios?

The other thing to possibly look at is the dir entry sync periods of ATS. Whenever we sync those to disk, we consume both more memory and more disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you need to turn down the RAM cache or tweak the amount of directory entries you have).

— Leif

E.g.





Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Bryan Call-2
I came across this command and it has helped track down some latency issues caused by other processes (ss -s).  Can you run it during the time you are seeing latency issues and post the results here?

dstat -c --top-cpu -d --top-bio --top-latency -n

-Bryan

On Jan 26, 2018, at 5:48 AM, Veiko Kukk <[hidden email]> wrote:

Hi again,

I'd really appreciate if somebody could point me in the right direction how to solve this.
During whatever ATS does after each ~ 50 minutes, it has strong effect on response times.
ATS is used in reverse proxy mode and we run regular tests against test URL on proxied server(s) (excluded from caching in ATS config).
This test GET is ran as HAproxy health check after ~15 seconds for two local HAproxy backends which both pass requests to single local ATS.

Quite complex setup, but point here being that tests run frequently and give information about ATS response times over long period.

Total test runs today: 6364
Tests that took over 7s today: 50

Distribution of requests, first column is response time, second amount of requests under that value.
100 1292
300 4351
500 5194
700 5578
900 5794
1200 5985
1400 6058
1800 6143

Here is the output of tests log that contains all the extremely slow responses. Test response size is only 609 bytes. Usually response time fluctuates around 

2018-01-26T01:13:32.150186+00:00 12412
2018-01-26T01:13:32.150188+00:00 20803
2018-01-26T02:05:04.536931+00:00 29764
2018-01-26T02:05:04.536936+00:00 27271
2018-01-26T02:05:04.536941+00:00 10233
2018-01-26T02:56:26.968987+00:00 9511
2018-01-26T02:56:26.968989+00:00 30084
2018-01-26T02:56:26.968991+00:00 27337
2018-01-26T04:39:21.947460+00:00 24171
2018-01-26T04:39:21.947462+00:00 12042
2018-01-26T04:39:21.947464+00:00 36979
2018-01-26T04:39:31.954116+00:00 7369
2018-01-26T04:39:31.954118+00:00 32305
2018-01-26T04:39:31.954120+00:00 19779
2018-01-26T04:47:42.349748+00:00 29177
2018-01-26T04:47:42.349754+00:00 26212
2018-01-26T04:47:42.349757+00:00 21645
2018-01-26T04:47:42.349759+00:00 24932
2018-01-26T05:39:04.925435+00:00 32361
2018-01-26T05:39:04.925438+00:00 33587
2018-01-26T05:39:04.925440+00:00 8173
2018-01-26T05:39:04.925443+00:00 28149
2018-01-26T05:39:04.925445+00:00 29115
2018-01-26T06:30:27.643170+00:00 7423
2018-01-26T06:30:27.643172+00:00 32271
2018-01-26T06:30:27.643174+00:00 18927
2018-01-26T06:30:27.643179+00:00 27849
2018-01-26T06:30:37.644023+00:00 15160
2018-01-26T07:21:50.231681+00:00 19208
2018-01-26T07:21:50.231684+00:00 14984
2018-01-26T08:13:12.874501+00:00 16876
2018-01-26T08:13:22.885389+00:00 14007
2018-01-26T09:04:35.509167+00:00 9016
2018-01-26T09:04:35.509172+00:00 9356
2018-01-26T09:55:58.052277+00:00 24137
2018-01-26T09:55:58.052280+00:00 23709
2018-01-26T09:55:58.052282+00:00 19901
2018-01-26T09:55:58.052284+00:00 19034
2018-01-26T10:47:10.614261+00:00 23419
2018-01-26T10:47:10.614263+00:00 18967
2018-01-26T11:38:32.984318+00:00 14425
2018-01-26T11:38:32.984324+00:00 9797
2018-01-26T11:38:32.984326+00:00 11161
2018-01-26T11:38:32.984329+00:00 16228
2018-01-26T12:29:45.511517+00:00 15580
2018-01-26T12:29:45.511520+00:00 11439
2018-01-26T13:20:58.023816+00:00 21360
2018-01-26T13:20:58.023818+00:00 19488
2018-01-26T13:20:58.023821+00:00 14737
2018-01-26T13:20:58.023823+00:00 17118


The question is: why does ATS regularily slow down? Are there some internal management jobs done, that use same single queue as requests coming in and then requests just wait in queue until internal processes finish?
It's getting worse as ATS uptime increases, triggering HAproxy health check timeouts. After restart it's not that bad again for a while.

How to get rid of this regular slowness?

Best regards,
Veiko



2018-01-23 13:53 GMT+02:00 Veiko Kukk <[hidden email]>:
Hi,

I should have noted before that, during that timeframe, there is no higher disk activity than on average. No higher load, no disk latency, no cpu load. Nothing abnormal except slow ATS.
ATS is running on CentOS 7 directly on hardware dedicated server.

Dirty pages related config that's been always there for that server:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 40
vm.swappiness = 0

# free -m
                   total           used           free          shared  buff/cache   available
Mem:         128831       27841        2436        4331       98554       95722
Swap:          4095           0        4095

As you see, there is lot of available memory.

I don't see how writing dirty pages could slow down ATS when there is no indication of excessive load on any of the system resources.
And this strange regularity: every ~ 50 minutes this happens. Like some regular (cronjob like) task is being ran inside ATS that delays all other tasks.

ATS is using 9TB raw partition if that information might be relevant.

Could you point me to the documentation dealing with dir entry sync periods of ATS?


-- 
Veiko



2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[hidden email]>:


On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[hidden email]> wrote:

Hi again,

During that mysterious task that happens after ~ 50-51 minutes causes requests/responses to slow down very much, even time out.
Requests that usually take few hundred milliseconds are now taking over 30s and timing out. This happens only during that time when memory consumption is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
ATS version is 7.1.1 and this looks like serious bug for me.


That sounds suspiciously like kernel paging activity, maybe it’s spending that time dumping flushing dirty pages?  Maybe transparent huge pages ? Or tweak the sysctl’s for dirty page ratios?

The other thing to possibly look at is the dir entry sync periods of ATS. Whenever we sync those to disk, we consume both more memory and more disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you need to turn down the RAM cache or tweak the amount of directory entries you have).

— Leif

E.g.






Reply | Threaded
Open this post in threaded view
|

Re: Understanding ATS memory usage

Veiko Kukk
Hi Bryan,

System in general is under light load. No other processes cause latency.



Veiko


2018-01-27 2:20 GMT+02:00 Bryan Call <[hidden email]>:
I came across this command and it has helped track down some latency issues caused by other processes (ss -s).  Can you run it during the time you are seeing latency issues and post the results here?

dstat -c --top-cpu -d --top-bio --top-latency -n

-Bryan

On Jan 26, 2018, at 5:48 AM, Veiko Kukk <[hidden email]> wrote:

Hi again,

I'd really appreciate if somebody could point me in the right direction how to solve this.
During whatever ATS does after each ~ 50 minutes, it has strong effect on response times.
ATS is used in reverse proxy mode and we run regular tests against test URL on proxied server(s) (excluded from caching in ATS config).
This test GET is ran as HAproxy health check after ~15 seconds for two local HAproxy backends which both pass requests to single local ATS.

Quite complex setup, but point here being that tests run frequently and give information about ATS response times over long period.

Total test runs today: 6364
Tests that took over 7s today: 50

Distribution of requests, first column is response time, second amount of requests under that value.
100 1292
300 4351
500 5194
700 5578
900 5794
1200 5985
1400 6058
1800 6143

Here is the output of tests log that contains all the extremely slow responses. Test response size is only 609 bytes. Usually response time fluctuates around 

2018-01-26T01:13:32.150186+00:00 12412
2018-01-26T01:13:32.150188+00:00 20803
2018-01-26T02:05:04.536931+00:00 29764
2018-01-26T02:05:04.536936+00:00 27271
2018-01-26T02:05:04.536941+00:00 10233
2018-01-26T02:56:26.968987+00:00 9511
2018-01-26T02:56:26.968989+00:00 30084
2018-01-26T02:56:26.968991+00:00 27337
2018-01-26T04:39:21.947460+00:00 24171
2018-01-26T04:39:21.947462+00:00 12042
2018-01-26T04:39:21.947464+00:00 36979
2018-01-26T04:39:31.954116+00:00 7369
2018-01-26T04:39:31.954118+00:00 32305
2018-01-26T04:39:31.954120+00:00 19779
2018-01-26T04:47:42.349748+00:00 29177
2018-01-26T04:47:42.349754+00:00 26212
2018-01-26T04:47:42.349757+00:00 21645
2018-01-26T04:47:42.349759+00:00 24932
2018-01-26T05:39:04.925435+00:00 32361
2018-01-26T05:39:04.925438+00:00 33587
2018-01-26T05:39:04.925440+00:00 8173
2018-01-26T05:39:04.925443+00:00 28149
2018-01-26T05:39:04.925445+00:00 29115
2018-01-26T06:30:27.643170+00:00 7423
2018-01-26T06:30:27.643172+00:00 32271
2018-01-26T06:30:27.643174+00:00 18927
2018-01-26T06:30:27.643179+00:00 27849
2018-01-26T06:30:37.644023+00:00 15160
2018-01-26T07:21:50.231681+00:00 19208
2018-01-26T07:21:50.231684+00:00 14984
2018-01-26T08:13:12.874501+00:00 16876
2018-01-26T08:13:22.885389+00:00 14007
2018-01-26T09:04:35.509167+00:00 9016
2018-01-26T09:04:35.509172+00:00 9356
2018-01-26T09:55:58.052277+00:00 24137
2018-01-26T09:55:58.052280+00:00 23709
2018-01-26T09:55:58.052282+00:00 19901
2018-01-26T09:55:58.052284+00:00 19034
2018-01-26T10:47:10.614261+00:00 23419
2018-01-26T10:47:10.614263+00:00 18967
2018-01-26T11:38:32.984318+00:00 14425
2018-01-26T11:38:32.984324+00:00 9797
2018-01-26T11:38:32.984326+00:00 11161
2018-01-26T11:38:32.984329+00:00 16228
2018-01-26T12:29:45.511517+00:00 15580
2018-01-26T12:29:45.511520+00:00 11439
2018-01-26T13:20:58.023816+00:00 21360
2018-01-26T13:20:58.023818+00:00 19488
2018-01-26T13:20:58.023821+00:00 14737
2018-01-26T13:20:58.023823+00:00 17118


The question is: why does ATS regularily slow down? Are there some internal management jobs done, that use same single queue as requests coming in and then requests just wait in queue until internal processes finish?
It's getting worse as ATS uptime increases, triggering HAproxy health check timeouts. After restart it's not that bad again for a while.

How to get rid of this regular slowness?

Best regards,
Veiko



2018-01-23 13:53 GMT+02:00 Veiko Kukk <[hidden email]>:
Hi,

I should have noted before that, during that timeframe, there is no higher disk activity than on average. No higher load, no disk latency, no cpu load. Nothing abnormal except slow ATS.
ATS is running on CentOS 7 directly on hardware dedicated server.

Dirty pages related config that's been always there for that server:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 40
vm.swappiness = 0

# free -m
                   total           used           free          shared  buff/cache   available
Mem:         128831       27841        2436        4331       98554       95722
Swap:          4095           0        4095

As you see, there is lot of available memory.

I don't see how writing dirty pages could slow down ATS when there is no indication of excessive load on any of the system resources.
And this strange regularity: every ~ 50 minutes this happens. Like some regular (cronjob like) task is being ran inside ATS that delays all other tasks.

ATS is using 9TB raw partition if that information might be relevant.

Could you point me to the documentation dealing with dir entry sync periods of ATS?


-- 
Veiko



2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[hidden email]>:


On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[hidden email]> wrote:

Hi again,

During that mysterious task that happens after ~ 50-51 minutes causes requests/responses to slow down very much, even time out.
Requests that usually take few hundred milliseconds are now taking over 30s and timing out. This happens only during that time when memory consumption is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
ATS version is 7.1.1 and this looks like serious bug for me.


That sounds suspiciously like kernel paging activity, maybe it’s spending that time dumping flushing dirty pages?  Maybe transparent huge pages ? Or tweak the sysctl’s for dirty page ratios?

The other thing to possibly look at is the dir entry sync periods of ATS. Whenever we sync those to disk, we consume both more memory and more disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you need to turn down the RAM cache or tweak the amount of directory entries you have).

— Leif

E.g.