Visit our archive

The right tools aren’t enough.

It’s not enough to have the right tools, you have to use them in the right way.

Over the last several months we’ve noticed that our sending processes (for sending emails to our customers) have been going much slower than their theoretical limit, and in some cases performing no better than the system it replaced. This was somewhat disturbing, as we’d built it specifically to outperform the old system by a factor of at least 5. To see that we were barely able to send as much as before was quite disappointing. We’d definitely had peaks of high capacity which showed us that our concept was sound, there was just a bottleneck somewhere that we hadn’t been able to track down.

We had settled on redis as a suitable in memory key-value store for certain pieces of our process that we needed to perform the most, and overall we were quite happy with it. Eventually though, after a few days of dedicated bug snooping, we realized that redis was our bottleneck.

We were quite surprised by this, as it was the last place we would have considered had we not been looking. Test after test showed that it was definitely redis that was eating up all our processing time. Though it was hard to believe, we started looking at each of our methods that used redis to see where the problem might lie.

After a fair number of profiling runs, and some well placed print statements, we tracked the bottleneck down to a single function:

public function isBlacklisted($emailaddress)

This is a function that takes an email address as a parameter and checks to see if the domain of that email address matches any of our blacklisted domains. If it does, the function returns true and no mail is sent to that address.

Now, we store our domain blacklist in redis as a set. In general, it’s very cheap to look something up in a set, as it’s just a set of unique values. It should be a constant time operation O(1) regardless of set size. Our domain blacklist also isn’t huge, it’s only about 33,000 entries. Checking for a single entry in the set should be very fast, but that’s not what we were seeing. Every single call to this function was taking between 500ms up to 1 second.

We started looking more closely at what this function was doing, and we came across this:

$blacklist = $this->_redisConn->getAllSetMembers('sender_bad_domains');
return in_array($domainName, $blacklist);

Wait, is that retrieving the WHOLE set from redis? And then doing the member check in PHP?

Seems like yes, that’s exactly what was going on. Every time this function was called (which is *very* often) we were asking redis to give us the whole set of blacklisted domains, all 33000 of them, then checking to see if the email’s domain was in that list and returning only to do the whole thing again the next time we called this function. There wasn’t even a layer of persistence to hang on to that big ol list between function calls, we just asked redis for it, again, and again, and again.

What we should have been doing instead was something like this:

return($this->_redisConn->IsMember('sender_bad_domains', $domain));

Letting redis do what it’s made to do, which is check to see if the domain we care about is a member of the sender_bad_domains set. Before doing that, however we decided to see if this was really the problem, by adding a simple bit of persistence to the list of domains that we fetch from redis. Instead of asking for it every time, we just ask for it once, and store it in a member variable of our class. This isn’t the ideal solution, but as a quick test it would show us if this was really our problem. We made a quick change which amounted to saving the list in a variable, and then running our in_array against that list. Now, instead of asking redis for the full blacklist 2000 times within a single execution of our script, we just ask once. The results?
We saw network traffic leaving the redis server drop from 200Mbps down to 6Mbps and then down to 300Kbps and stay there.

This is much more impressive when you consider the traffic graph:

{F100}

bugfix

This change allowed us to get right back up to where our theoretical limit should have been, and improved our sending speed dramatically. Instead of taking up to 15 minutes to process 2000 emails, it now takes 30 seconds.

I’m sure the astute reader is currently screaming “WAIT, YOU ARE STILL NOT DOING IT RIGHT!” and of course, you are correct. We are still asking redis for the whole set, instead of just asking if the thing we care about is in the set, so there is still room for improvement.

The moral of this particular story is that it doesn’t matter if you are using the latest, fancy, hipster, webscale technology; if you are using it poorly, it won’t help you.

– Gabriel

Comments

comments