Local dynamodb metrics

Local dynamodb metrics how to#
Local dynamodb metrics series#

Note that there are about two orders of magnitude more throttling events than failed requests. The top graph, “as seen by the apps”, tracks requests that failed, despite retries. The bottom graph tracks each throttled request “as seen by CloudWatch”. Using Datadog, Medium created the two graphs below. Since it can be difficult to predict when DynamoDB will throttle requests on a partitioned table, Medium also tracks how throttling is affecting its users.ĭynamoDB’s API automatically retries its queries if they are throttled so the vast majority of Medium’s throttled requests eventually succeed. Nathaniel Felsen from Medium describes in detail, in this post, how his team tackles the “hot key” issue. In that case, they may be throttled even before this strategy would predict. Note that since partitioning is automatic and invisible, two “semi-hot” posts could be in the same partition. If the number of requests per second for that post starts to approach their estimated partitioned limit, they can take action to increase capacity. As seen in the snapshot below (bottom chart), one post on Medium is getting more requests per second than the next 17 combined. Next Medium logs each request, and feeds the log to an ELK stack ( Elasticsearch, Logstash, and Kibana) so that they can track the hottest keys. Then they calculate the throughput limit of each partition by dividing their total provisioned capacity by the expected number of partitions. In order to anticipate throttling of hot keys, Medium calculates the number of partitions it expects per table, using the formula described in the AWS documentation. The challenge is that the AWS console does not expose the number of partitions in a DynamoDB table even if partitioning is well documented. For example, if Medium has provisioned 1000 reads per second for a particular table, and this table is actually split into 10 partitions, then a popular post will be throttled at 100 requests per second at best, even if other partitions’ allocated throughput are never consumed. Since Medium’s tables can go up to 1 TB and can require tens of thousands of reads per second, they are highly partitioned. These assets have “hot keys” which create an extremely uneven access pattern.

However, some elements of a Medium “story” can’t be cached, so when one of them goes viral, some of its assets are requested extremely frequently. In this case, your requests will be throttled about when you reach your provisioned capacity, as expected. That’s not a big issue if your items are accessed uniformly, with each key requested at about the same frequency as others. As described by AWS here, DynamoDB automatically partitions your tables behind the scenes, and divides their provisioned capacity equally among these smaller partitions. Even though you can provision a specific amount of capacity for a table (or a Global Secondary Index), the actual request-throughput limit can be much lower. Unfortunately, tracking your remaining whole-database capacity is only the first step toward accurately anticipating throttling. As you can see, except for one brief spike their actual usage is well below their capacity. A snapshot of one of their Datadog graphs is below.

Medium uses Datadog to track the number of reads and writes per second on each of their tables, and to compare the actual usage to provisioned capacity. Properly monitoring requests and provisioned capacity is essential for Medium in order to ensure an optimal user experience. Throttling: the primary challengeĪs explained in Part 1, throttled requests are the most common cause of high latency in DynamoDB, and can also cause user-facing errors. In this article we share with you DynamoDB lessons that Medium learned over the last few years, and discuss the tools they use to monitor DynamoDB and keep it performant. Anticipating their growth, Medium used DynamoDB as one of its primary data stores, which successfully helped them scale up rapidly. Growing fast is great for any company, but requires continuous infrastructure scaling-which can be a significant challenge for any engineering team (remember the fail whale?).

Medium launched to the public in 2013 and has grown quickly ever since.

Local dynamodb metrics how to#

Part 1 explores its key performance metrics, and Part 2 explains how to collect these metrics.

Local dynamodb metrics series#

This post is the last of a 3-part series on how to monitor DynamoDB performance.