CPU Spikes in Azure App Services

We recently ran into buggy behavior with the Azure load balancers which sit in front of an App Service and distribute the traffic between each server instance. It manifested as peaky CPU load on one of the horizontally scaled-out instances in the App Service Plan. At seemingly random times, monitoring alerts would fire for dropped requests and request latencies going over SLA thresholds. Each time, the source was the same; one of the app service instances was suffering extremely high CPU load.

Unevenly distributed CPU Load

Initial Triage

To compensate, we increased the number of instances to 5-6 S3 instances (4 CPU, 7 GB RAM, ~400 ACU) costing about €1500 p/m. With 5 instances you can still see imbalance in the above image. The blue server is peaking out at ~85% CPU utilization, while the pink & black instances are completely under utilized. At 4 instances this was even more pronounced and eventually the blue server would crumple under the load as the number of queued connections and requests increased.

This was bad! Aside from a server instance going down:

  1. The average CPU utilization for the service plan was still comparatively low so auto-scale rules weren’t being triggered to bring additional instances online.
  2. Manually added instances were still prone to the same uneven distribution as pink & black. We were throwing a lot of money at the problem for little gain.

App Service Load Balancer “Logic”

At this point we escalated the issue to Azure Support

Hi Azure Support, you appear to have a problem with your load balancers!

They responded

Nah, we’re good, they’re working as expected, you appear to have a problem with your app!

Frustrating! But starting to make sense. Azure load balancers (both the PaaS product, and the those which sit in front of App Services) are network-level technology. They use a 5-tuple hash algorithm to decide where to route traffic based on:

  • Source IP Address
  • Source Port
  • Destination IP Address
  • Destination Port
  • IP protocol number

It’s not a pure, random-distribution, round-robin approach. And it doesn’t take things into account like the Application & URL routes of the HTTP service been called.

Diving into App Insights

A little more digging and we get to the source of the issue. The app service plan was hosting 2 separate applications, with 2 very different CPU utilization profiles.

  • Application A is an API service doing database calls, CosmosDB calls, and performing a lot of business rule processing on the retrieved data in each request. These were CPU intensive requests
  • Application B is a very light-weight traffic redirection service. These request were extremely light on CPU utilization.

While the overall distribution of requests is being evenly spread across all the server instances, there is a very uneven distribution of Application traffic to each instance.

Unevenly distributed CPU Load by Application

In the above bar chart the green segments are the light weight click & tracking requests. The blue segments are the CPU intensive API calls. For this 24 hour period, the 5th server, RD***D3 is getting hammered by CPU-intensive requests (accounting for > 75% of it’s request count) where as the 1st server, RD**98 spends most of it’s time serving simple static content and 302 redirects.

Separate the Apps

The solve was pretty obvious in hindsight, and is alluded to in a single line of Azure documentation

Isolate your app into a new app service plan when the app is resource intensive

Thankfully our API endpoints were already in two separate .NET projects, deployed to 2 separate app services. If they’d been different endpoints in the same project, this would have been a more time consuming solve. Moving an App Service to a different plan is straight forward within the same region and resource group.

First, we moved the CPU heavy API application into it’s own dedicated App Service Plan. Then we changed the scale up & scale out configuration for each of the two different app service plans to accommodate their specific CPU utilization profiles.

Old Configuration

Plan 1
    Scale Up:     S3 (4CPU / 7GB Ram / 400ACU)
    Scale Out:    6
    Running Cost: €1,500 p/m

New Configuration

Plan 1
    Scale Up:       P2v2 (2CPU / 7GB Ram / 420ACU)
    Scale Out:      3
    Running Cost:   ~€750 p/m

Plan 2
    Scale Up:       P1v2 (1CPU / 3.5GB Ram / 210ACU)
    Scale Out:      2
    Running Cost:   ~€250 p/m

With these changes we’re seeing a whole host of improvements

  • CPU Utilization is more evenly distributed
  • We’re making more efficient use of resources on a smaller number of instances
  • Auto Scale rules based on Avg. CPU Utilization are working
  • Fewer total instances are running equating to a 33% cost saving

Eoin Campbell

Building BuyIrish.com

Over the past few days myself & a few colleagues have been chipping away at a little hackathon project to try and help drive consumers towards Irish retailers in the run into Xmas. The premise was simple. Could we use our existing knowledge and capabilities around product data acquisition and web crawling/scraping and create a search engine/aggregator site to allow a consumer to search for products and gift ideas (incl. price & availability) across lots of Irish retailers and product categories?

The result… https://buyirish.com

Gathering the Data

The first major challenge we had was acquiring the data. Typically, we’ll need to tailor our crawlers on a per-site basis. Is information available in the page source, or dynamically injected by javascript? How is that data presented? Do we need to tailor the parsers to identify specific places on the page to extract content? Do we need to get around anti-bot measures or use web proxies? Thankfully Alan & our DAX (Data Acquisition) team are really on the ball and they came up with a clever solution using a combination of services to gather the data in a fairly generic manner.

A master list of scraped datasets was maintained centrally and each time we added additional retailer, their data set was appended to this master list.This allowed us to build an extremely simple .NET Core console tool which performed the following logic

using var webClient = new WebClient();

var rJson = webClient.DownloadString(_masterListUri);
var r = JsonConvert.DeserializeObject<RetailerRoot>(rJson);

foreach (var retailer in r.Retailers)
     var pJson = webClient.DownloadString(retailer.Uri);
     var p = JsonConvert.DeserializeObject<Products>(pJson);
     foreach(var product in p.Products)
         //Send it to somewhere I guess ¯\_(ツ)_/¯

A Rough Design

We knew that to get this up and going quickly we didn’t want to start muddling around with a Azure SQL instance or have to start modelling and pushing migrations with something like Entity Framework to a relational db. And we already use CosmosDB extensively in our core platform. Our production APIs serve millions of requests daily from Cosmos so we knew it would be a good candidate for direct storage and could deal with requests at scale. BUT… we also knew that the equivalent of the following query wasn’t going to get great results from a relevancy point of view.

SELECT     *
FROM       data
WHERE      data.ProductName LIKE '%term%'
OR         data.Description LIKE '%term%'
OR         data.Tags LIKE '%term%'

A little light bedtime reading later, and we thought we had our answer. We could bulk load the data directly into Cosmos and then point an Azure Cognitive Search instance at the CosmosDB container. ACS would keep it’s own index up to date based on a high-watermark timestamp check every 60 minutes. It would also give us the benefit of result relevancy scoring, and the ability to tweak the scoring profiles if needed.

Bulkloading into CosmosDB

Getting the data into Cosmos proved exceptionally simple with the new v3 Cosmos SDK. One of the first tests performed involved bulkload Upserting 50,000 JSON documents into the container and it only took fractions of a second. It’s also very easy to auto scale up & down the throughput provisioning on the fly via code. You can check out the v3 SDK samples here

public async Task BulkTest(IEnumerable<Item> items)
    var clientOptions = new CosmosClientOptions { AllowBulkExecution = true }; //Enable Bulk Execution
    using var client = new CosmosClient(_endpoint, _authKey, clientOptions);
    var database = client.GetDatabase(_databaseName);
    var container = database.GetContainer(_containerName);
    var requestOptions = new ItemRequestOptions() { EnableContentResponseOnWrite = false }; //Blind Upserts, don't get a hydrated response

    await container.ReplaceThroughputAsync(10000); //On-the-fly Throughput Scale Up
    var concurrentTasks = new List<Task>();

    foreach(var item in items)
        concurrentTasks.Add(container.UpsertItemAsync(item, new PartitionKey(item.PartitionKey), requestOptions));

    await Task.WhenAll(concurrentTasks.ToArray());
    await container.ReplaceThroughputAsync(1000); //On-the-fly Throughput Scale Down

Creating an Azure Cognitive Services Index

With the data safely in Cosmos, next we set about setting up the Azure Cognitive Services instance. Configuring this was a breeze. You can create a new instance directly from the CosmosDB Resource, and there’s a walk-through wizard to get the indexer setup using an hourly high-watermark check on the _ts timestamp

One thing that took a little bit of trial and error was the composition of the index in terms of what should be retrievable, searchable, orderable and facetable. More than once, we had to purge/drop and recreate the index as once it’s created you can’t modify the configuration. This is very manageable with a small initial dataset of a few hundred thousand products but I can imagine this would be slightly more work in production where we have ~10^7 product updates happening every day.

It’s worth mentioning that a number of my friends also recommended ElasticSearch as an alternative to ACS. It’s definitely on the bucket list to read up on, and one feature that ACS is sorely missing is the concept for Consistently Random Results which would have been nice in order to give a fair distribution of views to similar products across multiple retailers, or to build an “Inspire Me” function to return completely random results from a * search.

Serving the results

While I continued to get my arms around Cosmos & ACS, my colleague Daniel was busy getting the website up and running. For simplicities sake and since it was what we’re both most familiar with, the front-end was assembled using a vanilla .NET 4.8 MVC5 project with a bootstrap themes and some custom css/js.

The use case is pretty simple.

BuyIrish Site

When the consumer arrives on the site, they can search for a term and ACS will return the most relevant products in order based on it’s internal scoring algorithm and a tweaked scoring profile we’ve provided. (More on that below)

The search query will return 6 results at a time and an infinite scroll javascript plugin handles fetching the next paginated set of product cards for that search term. In addition to the search results, the ACS response contains a facet result list based on the retailers that carry those products. These are displayed on the left hand side (including a result count per retailer), and if the consumer wants to filter to just that retailer they can click to filter. Finally, the consumer can also press “Inspire Me” and a random keyword will be chosen to provide a selection of different products.

Tweaking the algorithm

After some initial testing we noticed some discrepancies in the results. The problem was that some retailers provided extremely verbose product descriptions which might repeat a search term multiple times, while another retailer with a more relevant product might only mention the term once in the product title.

For example, if a consumer searched for “ACME Phone” you might have 2 different products at 2 different retailers. The 1st product is more relevant, whereas the second will get a better hit-rate based on keyword prevalence.

Retailer Name Tags Description
Retailer 1 ACME Phone ACME, Phone This is a phone
Retailer 2 Phone Cover ACME, Phone Works with ACME Phone Model X, ACME Phone Model Y, ACME Phone Model Z, ACME Phone Model Q

The solution was to provide a scoring profile which over-rode the weights for these results, and now gives a much higher weighting to term occurrence in the product name than in the product description.

Stats, Stats, Stats

We wanted to get some very lightweight metrics on two things initially.

  1. What are people searching for?
  2. What are people clicking on?

To achieve this, we created a very lightweight click handler in the app that performs the 302 redirect to the retailer site. Every product is assigned a GUID formed from a MD5_HASH of it’s product URL as it’s ingested into the system. This allows us to both confirm uniqueness for UPSERTS but also quickly retrieve the cosmos document from the click handler and redirect the consumer to the product details page URL.

var result = await _cosmosService.GetItemAsync(retailer, productId);

if (result != null)
    var properties = new Dictionary<string, string>
        { "et", "click" },
        { "id", productId },
        { "pn", result.ProductName},
        { "rt", retailer},
        { "ts", DateTime.UtcNow.ToString() }

    _telemetryService.TrackEvent("click", properties, null);

    return Redirect(result.Address);
return null;

The statistics themselves are then persisted directly to App Insights using the Application Insights SDK as customEvents with a Dictionary of meta data attached.. This is a nice quick solution (ignoring the default 90-day data retention issue) as it allows us to write some quick kusto queries to see how things are performing.

We’ve also added App Insights and Google Analytics to the front end to capture generic usage information as well.

| where timestamp > ago (8h)
| where name == 'click'
| extend retailer = tostring(tolower(customDimensions["rt"]))
| extend product = tostring(tolower(customDimensions["pn"]))
| summarize sum(itemCount) by retailer // or by product, or both
| order by sum_itemCount desc

Some outstanding bugbears

One this which is still causing some head-aches is the ability to use Fuzzy Search. Azure Cognitive Services support the Lucene Query syntax. It should be possible to use keyword modifiers like ~ to specify fuzzy matching on certain words. This however led to spurious results. While beneficial for searches like tshirt~ to find results for t-shirt, it caused much poorer results for mis-spellings or keywords that clearly weren’t covered by any retailer. hurling~ led to hits for halflinger horse related products, and attempting to supply numeric modifiers like hurling~1 tanked the results entirely.

The //TODO List

These type of hackathon projects are great. They really highlight how quickly you can get something live. But they also quickly highlight why doing things right in a maintainable fashion is important. Right now this solution is missing a lot of “little” things which when combined together add up to a far more mature solution. We’ll see how things go over the coming days and weeks and maybe if it gets some traction, we’ll revisit to look at the following.

  • Add CI/CD and a build/release pipeline to automate the deployment
  • Setup a non-production environment for testing
  • Add support for Soft-Deletion to Cosmos & ACS
  • Add support for selective re-running of specific retailers
  • Move the ETL Console Tool -> Azure Functions
  • Make the Search a little more robust and forgiving to spelling mistakes/relevancy issues
  • Add additional horizontally scalable instances
  • Swap out the free 90 day SSL for a WildCard SSL Cert (or better a LetsEncrypt config)
  • Add App Insights Continuous Export + Stream Analytics to get Click/Search data into our Synapse environment
  • Migrate the web app from .NET Framework + MVC5 => .NET Core + MVC6
  • Migrate the static content to Blob Storage

New Tech is Fun

Overall, this was a fun little project. It’s always nice to get out of the day-to-day JIRA Backlog and explore some new technology and I can definitely see us having a use for Azure Cognitive Search at some point in the future on our product roadmap. Thanks to everyone who chipped into to get it built. Daniel P, Alan, Daniel G, Bogdan, Dorothy, Enda & John.

Eoin Campbell

Data Partitioning Strategy in Cosmos DB

Over the past 6 months, I’ve been overseeing the design and implementation of a number of new projects for ChannelSight’s Buy It Now platform. One of the key technologies at the center of these projects is Cosmos DB, Microsoft’s globally distributed, multi-model database.

We’ve been moving a variety of services, which previously sat on top of SQL Azure, over to Cosmos DB for a variety of different reasons; performance, ability to scale up/down fast, load distribution; and we’re very happy with the results so far. It’s been a baptism by fire, involving a little bit of trial and error, and we’ve had to learn at lot as we went along.

This week I ran a workshop for our Dev team on Cosmos DB. Partitioning was the area we spent most time discussing and probably THE most important thing to spend time on, during the planning stages, when designing a Cosmos DB Container. Partitioning defines how Cosmos DB internally divides and portions up your data. It affects how it’s stored internally. It affects how its queried. It affects several hard limits you can hit. And it affects how expensive it is to use the service.

Getting your partitioning strategy correct is key to successfully utilizing CosmosDB; getting it wrong could end up being a very costly mistake.

Microsoft provides some guidance on partitioning, but that didn’t stop me from making a number of errors in my interpretation of what a good partitioning strategy should look like and why.

What is partitioning?

First, it’s important to understand what paritioning is in relation to Cosmos DB, and why it needs to partition our data. Cosmos DB lets you query your data with very low latency at any scale. In order to achieve this, it needs to spread your data out among lots of underlying infrastructure along some dimension that you specify. This is your partition key. All rows or documents which share the same partition key will end up stored and accessed from the same logical partition. Imagine a very simple document like so:


If you decided to specify /city as the partition key, then all the Dublin documents would be stored together, all the London documents together, and so on.

How is Cosmos DB structured?

Before we get down to the level of partitions, lets look at the various components that make up your Cosmos DB account.

Cosmos DB Account Structure

  • The top layer in the diagram, represents the Cosmos DB account. This is analagous to the server in SQL Azure
  • Next is the database. Again, analgous to the database in SQL Azure
  • Below the database, you can have many containers. Depending on the model you’ve selected your container will either be Collection or a Graph or a Table
  • A container is served by many physical partitions. You can think of a physical partition as the assigned metal that serves your container. Each physical partition has a fixed amount of SSD backed storage and compute resources; and these things have physical limitations
  • Each physical partition then hosts many logical partitions of your data
  • And each logical partition, in turn holds many items (documents, nodes, rows)
We're using CosmosDB with the SQL API and since we're coming from a SQL Azure background, there is somewhat of a SQL/Relational slant on the opinions below. I'll be talking about `Collections` and `Documemnts` rather than more generically but the same principals apply to a Graph or Table either.

When you create a new collection, you have to provide it with a partition key. This tells the database along what dimension of your data (what property in your documents), it should split the data up. Each document with a differing partition key value will be placed in a different logical partition. Many logical partitions will be placed in a single physical partition. And those many physical paritions make up your collection.

Parition Limitations

While logical partitions are somewhat nebulous groupings of like documents, physical partitions are very real and have 2 hard limits on them.

  1. A physical partition can store a maximum of 10GB of data
  2. A physical partition can facilitate at most 10,000 Request Units (RU)/s of throughput.

A physical partition may hold one or many logical partitions. Cosmos DB will monitor the size and throughput limitations for a given logical partition and seamlessly move it to a new physical partition if needs be. Consider the following scenario where two large logical partitions are hosted in the one physical partition.

Physical Partition Logical Partition Current Size Current Throughput OK
P1 /city=Dublin 3GB 2,000 RU/s :white_check_mark:
P1 /city=London 6GB 5,000 RU/s :white_check_mark:

Cosmos DB’s resource manager will recognise that entire P1 partition is about to hit a physical limitation. It will seemlessly spread these two logical partitions out to two seperate physical partitions which are capable of dealing with the increased storage and load.

Physical Partition Logical Partition Current Size Current Throughput OK
P2 /city=Dublin 5GB 4,000 RU/s :white_check_mark:
P3 /city=London 7GB 8,000 RU/s :white_check_mark:

However if a single logical attempts to grow beyond the size of a single physical partition then you’ll receive an error from the API. "Errors":["Partition key reached maximum size of 10 GB"]. This would obviously be very bad, and you would need to reorganise and repartition all this data to break it down into smaller partition by a more granular value.

Physical Partition Logical Partition Current Size Current Throughput OK
P2 /city=Dublin 5GB 4,000 RU/s :white_check_mark:
P3 /city=London 10GB 10,000 RU/s :x:

Microsoft provides some information here on how the number of required partitions are calculated but lets look at a practical example

  • You configure a collection with 100,000 RU/s capacity T
  • The maximum throughput per physical partition is 10,000 RU/s t
  • Cosmos allocates 10 physical partitions to support this collection N = T/t
  • Cosmos allocates key space evenly over 10 physical partitions so that each holds 1/10 of logical partitions
  • If a physical partition P1 approaches it’s storage limit, Cosmos will seemlessly split that partition into P2 and P3 increasing your physical partition count N = N+1
  • If you return later and increase the throughput to 120,000 RU/s T2 such that T2 > t*N, Cosmos will split one or more of your physical partitions to support the higher through put

Data Size, Reads & Writes

In an ideal situation, your partition key should give you several things

  1. An even distribution of data by partition size.
  2. An even distribution of request unit throughput for read workloads.
  3. An even distribution of request unit throughput for write workloads.
  4. Enough cardinality in your partitions that overtime, you will not hit those physical partition limitations

Finding a partition strategy that satisfies all of those goals can be tricky.

On the one extreme, you could choose to place everything in a single partition but this puts a hard limit on how scalable your solution is as we’ve seen above. On the other hand you could put but every single document into it’s own partition, but this might have implications for you if you need to perform cross partition queries, or utilize cross document transactions.

So what is an appropriate partition strategy?

When building a relational database the dimensions upon which you normalize or index your data tend to be obvious. 1:Many relationships are obvious candidates for a foreign-key relationship; any column that you regularly apply a WHERE or ORDER BY clause to becomes a candidate for an index. Choosing a good partition key isn’t always as obvious and changing it after the fact can be difficult. You can’t update the partition key attribute for a collection without dropping and recreating the collection. And you can’t update the partition key value of a document, you must delete and recreate that document.

All of the following are valid approaches in certain scenarios but have caveats in others.

Partitioning by Tenant Id or “Foreign Key”

One candidate for partitioning might be a tenant Id, or some value that’s an obvious candidate for an important Foreign Key entity in your RDBMS. If you’re building a product catalog, this might be the Manufacturer of each product. But if you don’t have an even distribution of data/workload per tenant, this might not be a good idea. Some of our clients have 100 times more data and 1000 times more traffic than others. Using manufacturer Id uniformly across the board would create performance bottle necks for some of the bigger clients and we would very quickly hit storage limits for a single physical partition.

Single Container vs. Multiple Containers

Another option for dividing data along the “Tenant” dimension would be to first shard your application into one-container-per-tenant. This will allow you to isolate each tenants data in completely seperate collections, and then subsequently partition that data along other dimensions using more granular data points. This also has the benefit that a single clients workload won’t impact your other clients. This did not make sense for us, as with a 1000 RU minimum per collection, the majority of our smaller clients would not hit that limit, and we couldn’t have passed on the cost for standing up that many collections.

Partitioning by Dates & Times

You could also partition your data by a Date or DateTime attribute (or some part of). If you have a small consistent write workload of timeseries data, then partitioning by some time component (e.g. yyyy-MM-dd-HH) would allow you to subsequently query or fetch sets of data efficiently in 1 hour windows. Often, however, this kind of timeseries data is high volume (audit logs, http traffic) and as such you end up with extremely high write workloads on a single partition (the current hour) and every other partition sitting idle.

Therefore it often makes more sense to partition your data (logs) by some other dimension (process id) to distribute that write workload more evenly.

Partitioning by a Hybrid Value

Taking the above into consideration, the answer might involve some sort of hybrid value mixing data/data points from serveral different attributes of your document.

An application audit log for your platform might be partition data by {SolutionName}/{ComponentName} so that you can efficiently search logs for one area of your system. If the data is not needed long term, then you could specify time-to-live values on the documents so that they self-expire after a rolling period of days

HTTP traffic logs, for impression and click data might be partitioned by {yyyy-MM-dd}/{client}/{campaign} so that data and write workloads are partitioned at the level of an individual client and individual marketting campaign for a given day. And then you can efficiently query that data for specific date ranges, clients and campaigns for reporting aggregation later.

Dynamic Partitioning for Multiple Documents Types

For our solution we had a very specific requirement for our product search query. For a given Manufacturer’s product SKU, we wanted to look up all the retailers that carried that product. In the end we settled on the following strategy:

Put all documents in a single collection

We started out with essentially two types of documents

  1. A Product document, which contained a SKU & Manufacturer data
  2. A Retailer Data document, which contained a reference id to the Product document

Use a common base entity for all documents

We then implemented a small abstract base class which all documents would inherit from. The PartitionKey string property is used as the partition key for the entire collection.

public abstract class CatalogBaseEntity
    [JsonProperty(PropertyName = "id")]
    public string Id { get; set; }
    public string PartitionKey { get; set; }
    public abstract string Type { get; }

Use different value sets for the Partition Key based on Document Type

For our Product documents, the value of the partition key is set to Manufacturer-<UniqueManufacturerId>. Since the product meta data for a single product is quite small, we’ll never hit the 10GB storage cap for a single manufacturer.

For our RetailerData document, the value of the partition key is set to Product-<ProductDocumentId>.

Querying the API for Product data

We now have a very efficient search system for our product data.

When our API receives a SKU query, we’ll first do a lookup for a Product document for a single manufacturer + sku. This is a single, non-partition-crossing query.

Next, we take the ID of that Product document and do a subsequent query for all the associated Retailer Data documents. Again, since this partitioned by the Product.Id it’s a non-partition-cross-query and limited to a finite set of results.

Hopefully that was a useful insight into partitioning data in Cosmos DB. Would love to hear other peoples experiences with it. And also, if there’s anything I’ve misinterpretted, drop me a note in the comments so I can correct it. Like I said, this has been a big learning experience for us here.

Eoin Campbell

Your writing a console app and you want to continue to accept input from the user over multiple lines until they stop typing. Essentially “How do I do ReadToEnd() on the command line. Alternatively you want to be able to redirect input from another file.

Turns out it’s quite easy to do.

class Program
    static void Main(string[] args)
        using (var sr = new StreamReader(Console.OpenStandardInput(), Console.InputEncoding))
            var input = sr.ReadToEnd();
            var tokens = input.Replace(Environment.NewLine, " ").Split(' ');
            Console.WriteLine($"Tokens: {tokens.Count()}");

For the user interactive example you’ll have to terminate the input with a CTRL-Z (or a CTRL-D on linux)

And you can now redirect/pipe to STDIN from other files.

Get-Content .\input.txt | .\stdin-test.exe

Eoin Campbell


Qluent is a Fluent Queue Client for Azure storage queues

Qluent is simple fluent API and set of wrapper classes around the Microsoft Azure Storage SDK, allowing you to interact with storage queues using strongly typed objects like in the code snippet below. You can see lots of other ways to use it in the Documentation on Github.

var queue = await Builder
await queue.PushAsync(new Entity());

So why did I build this.

Back in March, some colleagues and I ran into an issue with a legacy project that we’d inherited at work. At random times a queue consumer would just get stuck and stop dequeuing messages. When we went to debug it we discovered the code responsible for dequeueing, processing and deleting the message was buried in an assembly, and the source code was … unavailable :confounded:.

After much hair pulling and assembly decompilations, we eventually tracked down the bug, but it got me thinkings about a couple of things:

  1. Setting up an azure storage queues through the SDK is a little tedious. There’s quite a bit of ceremony involved to create an CloudStorageAccount, CloudQueueClient and CloudQueue, to ensure it exists and to deal with serialization/deserialization.

  2. There are some aspects of the SDK I dislike. Specifying the large majority of settings on the methods (such as message timeout visibility etc…), rather than as configurations on the CloudQueueClient itself seems wrong. It leaves lots of sharp corners for the developer to get caught on, after they fetch a queue from DI and want to interact with it.

  3. There are lots of tricky scenarios to account for even in simple messaging use cases, such as idempotency issues, handling retries and dealing with poison messages.

  4. Developers shouldn’t need to worry about writing consumers/dispatchers. They should just need to worry about getting their message handled.

The Goal: Keep it simple

What I really wanted to provide was a very simple fluent API for creating a CloudQueue and a message consumer around that CloudQueue. Creating a consumer is simply a matter of providing, a type, a queue, a message handler and starting it up.

var consumer = Builder
    .ThatHandlesMessagesUsing((msg) => 
            Console.WriteLine($"Processing {msg.Value.Property}"); 
            return true; 

await consumer.Start()

The library is intentionally meant to simplify things. Often times I’ll find myself having to scaffold something and spending way too long focusing on the infrastructure code to support message queuing when I should be focusing on the actual problem I’m trying to solve. That’s what this is for. It is a simple wrapper around Azure storage queues to make working with them a little easier.

However there are lots of complicated things you may find yourself needing doing in a distributed environment: Complex Retry Policies; complicated routing paths; Pub/Sub models involving topics and queues; the list goes on. If that’s the case, then perhaps you should be looking at a different technology stack (Azure Service Bus, Event Hubs, Event Grid, Kafka, NService Bus, Mulesoft etc…)

Below you can see some of the features and that the library supports.


Creating a Queue

Queues can be created by simply specifying a storage account, queue name and a type for your message payload. You can purge the queue and obtain an approximate count of messages from it. All operations are async awaitable, and all support taking a CancellationToken.

var q = await Builder

await q.PurgeAsync(); 

var count = await q.CountAsync() 

Basic Push/Pop Operations

Basic queue operations include push, pop and peek for one or multiple messages.

var person = new Person("Eoin");
await q.PushAsync(person);

var peekedPerson = await q.PeekAsync();
var poppedPerson = await q.PopAsync();
IEnumerable<Person> peekedPeople = await q.PeekAsync(5);
IEnumerable<Person> poppedPeople = await q.PopAsync(5);

Receipted Deletes

You can also control the deletion of messages from the CloudQueue using the Get and Delete overrides. Under the hood this will use PopReceipts to subsequently remove the message or on visibility timeout, the message will reappear on the queue.

var wrappedMessage = await q.GetAsync();

    //attempt to process wrappedPerson.Value;
    await q.DeleteAsync(wrappedMessage);
catch(Exception ex)
    //message will reappear on queue after timeout    

Queues can also be configured to support

  • Delayed visibility of messages
  • Message TTLs
  • Visibility timeouts for dequeue events
  • Automatic rerouting of poison messages after a number of dequeue & deserialize attempts
  • Customized object serialization

The message consumer provides a simple way to asynchronously poll a queue including. It supports

  • Message Handlers
  • Failed Processing Handlers (Fallback)
  • Exception Handlers
  • Flow control for when exceptions occur (Exit or Continue)
  • Custom Queue Polling Policies
  • Integration with NLog for Logging

And there’s more detailed info in Github Repo README.md.

Get It on Nuget

I’d really appreciate feedback on it so if you want to try it out, you can get it on nuget

Eoin Campbell

Nuget Stable