Skip to content

Multi-tenancy options

Multi-tenancy is a key part of building a SaaS product. You want to amortize your software investment across different paying customers. Customers should never be able to access any other customer’s data. And it is often the case that customers don’t want their data intermingled with other customers’ data, though that depends on the type of data and the customer needs.

By separating customers into tenants, you can achieve this data separation. There are a number of levels of multi-tenancy. In increasing order of isolation, they are:

  • no multi-tenancy. In this situation, all the customer data is co-mingled in the database. Any data that is unique to a customer is tied to that customer with an id.
  • logical multi-tenancy. Isolation is enforced in code and the database (every table has a ‘tenant id’ key, and you’re running joins). When you are using this type of isolation, you want to resolve the tenant as soon as possible. This is often done with a different hostname or path. You also want to ensure that any users accessing tenant data are part of the tenant. If you use this approach, mistakes in your code can be used to ‘escape’ the tenant limitation and read other customer’s data. However, one advantage of this approach is that you have operational simplicity: one version of the code to maintain and one version of the database. There may be support for this in your framework, or you may be rolling your own.
  • logical multi-tenancy supported by the database. Some databases support row level security isolation. In this scenario, each tenant has a different user, and the data isolation is enforced by the database. Your code is limited to looking up the correct user for a given tenant.
  • container level multi-tenancy. In this scenario, you run separate containers for each tenant. If you are using a solution like Kubernetes, you can run them in different namespaces to increase the isolation. The operational complexity increases (I did mention Kubernetes, did I not?) but it becomes far more difficult for an attacker to use the access of one tenant to get another tenant’s data. However, now you can have multiple versions of the codebase running. This can be a blessing and a curse, as it allows each client to control their version (if you enable it). This can increase support burden depending on the complexity of your application. You could also choose to run the latest code on every container, upgrading all containers every time a change is made to your software.
  • virtual machine multi-tenancy. Here you use different databases and virtual machines for each tenant. You can leverage common security defense-in-depth practices at the network level, using network access controls and firewalls. This physical isolation makes it even harder for an attacker to escape and view other tenants’ data. However, it increases your operational costs both in terms of complexity (are you going to force everyone to upgrade across the entire fleet?) and support (there may be configuration and/or code drift between the different VMs). If you pursue this, it behooves you to automate the creation of these virtual machines.
  • physical hardware isolation. With this choice, you actually run different hardware for each tenant, possibly in different data centers. This is the most secure, but the most operationally intensive. There are some options for API driven hardware setup, but the isolation, while a boon for security, makes updates and upgrades more difficult.

What is the best option for your SaaS solution? It depends on the security needs of your customers as well as your cost structure and your operational maturity. The higher the level of isolation, the harder it is to run and upgrade the various systems.

Load Testing Weirdness With AWS Aurora

Confused personSo I was doing a load test and saw behavior that reminded me that sometimes you just need to test.

Ran a test with 1500 requests/second with multiple servers (20ish) and smaller number of bigger servers (2-3). Saw some weird behavior with a number of 500 errors (bad gateway). Didn’t see these errors under a lower load.

Looked at the database (an aurora cluster with a single read and a single write instance) and saw that it was maxed out (cpu pegged, connections at max, couldn’t even connect at times.

Thought I need to upgrade the database. I upgraded the write instance. It was late and I failed to notice that that upgrade flipped the read and the write instances. So now the read instance was at the bigger server size and the write instance was at the smaller (original) server size. Then I re-ran the load test and everything went swimmingly (response time under 500 ms, where before it had spiked to 100 secs or more).

Great, problem solved. The larger instance size solved it.

But wait, it didn’t. The app was connecting to the primary endpoint, which is the master write node. I didn’t believe it, so I double checked and matched test times against connection spikes to the db.

So somehow, the flipping of the database to have a different primary Aurora instance (but no change in db size) caused a radical change in system behavior under heavyish loadfor a distributed php application.

Mysteries.

Greenfield webapp data storage decision tree

Choice on a sign
Choices choices

I saw a discussion about storing data in one of my slack channels and saw a line too good to leave in the slack.

Here’s a data storage decision tree for 95% of applications. Do you have data to durably store?

1. Use an open source relational database

(The original poster specified PostgreSQL, but MySQL/MariaDB are viable alternatives in my mind. Each has different strengths.)

The modern, open source RDBMS is flexible and scaleable (even web scale [warning, video has cursing]). It’s free from licensing fees (though you can pay for support). There are turnkey solutions for managing it in the cloud. There are plenty of developers who know how to use it (even more developers know how to use SQL) and many DBAs and sysadmins who know how to tune it. . You can scale it out by using read replicas and scale up by buying better hardware (or VMs). Every language has a library that talks to RDBMS. The database will maintain data integrity. Many tools that non technical users favor can connect to them (even Excel, if you install the right ODBC driver).

There are plenty of other solutions out there (filesystem, no SQL variants, xml databases, data lakes). They exist for a reason. For certain problems and at certain scale they are better solutions than the Swiss army knife of a RDBMS. But the default decision should always be an RDBMS, and the onus should be on the other solution to justify its present. For 95% of problems, the your default should be MySQL/MariaDB or Postgres.

Let AWS RDS handle database scutwork

Amazon DatabasesRDS is a service I’ve mentioned in the past, but it’s fantastic. You can outsource large chunks of database administration to AWS. Tasks you can forget about include backups, failover, read only replicas, and OS and DB upgrades.

This is a great fit for spinning up databases for small scale to large scale systems and prototyping.

Things to keep in mind if you start using RDS:

  • The database is launched into a VPC and will have a security group around it. You’ll need to allow IP addresses or security groups access to the port the database is living on or your connections will time out.
  • The database RDS creates is a normal database that you can manage like you can any other database you have set up and installed, but there are certain limitations (for example, no MySQL UDFs). Read the documentation and understand the limitations, but be aware they are constantly changing. I suggest subscribing to the AWS Database blog RDS category for updates.
  • RDS uses EBS under the covers and has the performance constraints of that technology. For the largest scale production systems you’ll want to test before jumping in whole hog.
  • If you are using MySQL or PostgreSQL and are running into concurrency problems, Aurora may be worth evaluating.
  • If you want to have backups past thirty five days for peace of mind of compliance concerns, you’ll need manual snapshots.
  • RDS only supports certain RDBMS and limits databases to certain sizes. If you want to run anything else on AWS, you will need to self manage your DB on EC2 or look at other data management solutions. Here are some other gotchas.
  • When using RDS you aren’t freed from all database administration tasks. There are still users to manage, indices to add, and queries to tune. Most of your RDBMS skillset is applicable to RDS, however. You’ll also need to determine when to schedule DB and OS upgrades, backups and how to size your instances. You still need to set up the optimal architecture of an RDS system including standbys and read only replicas and do other configuration both at the network and database level.
  • You can manage RDS system attributes via cloudformation, terraform and the CLI in the same way you can manage other AWS infrastructure. That said, the RDS system is stateful so you can’t treat it entirely as “cattle”.

You can learn more about RDS in the extensive documentation.

Always break rails migrations into smallest chunks possible, and other lessons learned

So this was a bit of a sticky wicket that I recently extracted myself from and I wanted to make notes so I didn’t make the same mistake again. I was adding a new table that related two existing tables and added the following code

class CreateTfcListingPeople < ActiveRecord::Migration
  def change
    create_table :tfc_listing_people do |t|
      t.integer :listing_id, index: true
      t.string :person_id, limit: 22, index: true

      t.timestamps null: false
    end

    add_foreign_key :tfc_listing_people, :people
    add_foreign_key :tfc_listing_people, :listings

  end
end

However, I didn’t notice that the datatype of the person.id column (which is a varchar) was `id` varchar(22) COLLATE utf8_unicode_ci NOT NULL

This led to the following error popping up in one of the non production environments:

2018-02-27T17:10:05.277434+00:00 app[web.1]: App 132 stdout: ActionView::Template::Error (Mysql2::Error: Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_general_ci,IMPLICIT) for operation '=': SELECT COUNT(*) FROM `people` INNER JOIN `tfc_listing_people` ON `people`.`id` = `tfc_listing_people`.`person_id` WHERE `tfc_listing_people`.`listing_id` = 42):

I was able to fix this with the following alter statement (from this SO post): ALTER TABLE `tfc_listing_people` CHANGE `person_id` `person_id` VARCHAR( 22 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL.

But in other environments, there was no runtime error. There was, however, a partially failed migration, that had been masked by some other test failures and some process failures, since there was a team handoff that masked it. The create table statement had succeeded, but the add_foreign_key :tfc_listing_people, :people migration had failed.

I ran this migration statement a few times (pointer on how to do that): ActiveRecord::Migration.add_foreign_key :tfc_listing_people, :people and, via this SO answer, I was able to find the latest foreign key error message:

2018-03-06 13:23:29 0x2b1565330700 Error in foreign key constraint of table sharetribe_production/#sql-2c93_4a44d:
 FOREIGN KEY (person_id)  REFERENCES people (id): Cannot find an index in the referenced table where the
referenced columns appear as the first columns, or column types in the table and the referenced table do not match for constraint. Note that the internal storage type of ENUM and SET changed in tables created with >= InnoDB-4.1.12, and such columns in old tables cannot be referenced by such columns in new tables.
Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-foreign-key-constraints.html for correct foreign key definition.

So, again, just running the alter statement to change the collation of the tfc_listing_people table worked fine. However, while I could handcraft the fix on both staging and production and did so, I needed a way to have this change captured in a migration or two. I split apart the first migration into two migrations. The first created the tfc_listing_people table, and the second looked like this:

class ModifyTfcListingPeople < ActiveRecord::Migration
  def up
    execute <<-SQL
      ALTER TABLE  `tfc_listing_people` CHANGE  `person_id`  `person_id` VARCHAR( 22 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL
    SQL

    add_foreign_key :tfc_listing_people, :people
    add_foreign_key :tfc_listing_people, :listings
  end
  def down
    drop_foreign_key :tfc_listing_people, :people
    drop_foreign_key :tfc_listing_people, :listings
  end
end

Because I’d hand crafted the fixes on staging and production, I manually inserted a value for this migration into the schema_migrations table to indicate that the migration had been run in those environments. If I hadn’t had two related but different migration actions, I might not have had to go through these manual gyrations.

My lessons from this episode:

  • pay close attention to any errors and failed tests, no matter how innocuous. This is a variation of the “broken window theory”
  • break migrations into small pieces, which are easier to debug and to migrate back and forth
  • knowing SQL and having an understanding of how database migrations work (they are cool, but they aren’t magic, and sometimes they leak) was crucial to debugging this issue

A Few SQL Database Links

So, here are two different relational database links that I am currently interested in.

The first is PostgreSQL Exercises. This is a fun way to sharpen your SQL. While I’m not a big user of PostgreSQL, I have enjoyed it the few times I’ve touched it. And the basics of thinking in SQL are useful across any database.

The second is Modern SQL. I was pointed to this site (in particular this page on pivots) after asking a question on a slack (so I can’t link to it). My question was about how to transform a set of event values (X happened Y ago, then Z happened at Y+1, and so on) up into counts (X happened N times, Z happened M times). But the entire site is worth reading if you are a SQL nerd, especially the use cases.

Bonus, check out this Twitter thread about using a single SQL database as a data source for multiple microservices. Pragmatism!

DynamoDb: What’s left to manage

AWS recently announced that DynamoDb will now scale read and write capacity automatically.  While there was already a lot of database administration that DynamoDb took care of (backups, underlying infrastructure provisioning), setting the proper capacities initially, and updating them as your application changed, was a key task that fell to the user. No more.

I posted a link to the news to a discussion channel I participate in, and someone asked: “what’s left to manage?”. Drawing from that discussion, here are a few items remaining:

  • Appropriate partition keys.  Make sure they are spread uniformly.
  • Choosing the right primary key. Since you typically want to avoid table scans and can only query by primary key, making sure you pick the right one is important.  (I would also call this “data model design”.)
  • Enforcing data integrity, initially and through time.  This is a challenge with every nosql solution.
  • Creating the appropriate secondary global indices for your application.
  • Securing and controlling access to your data.

These are all still important tasks, but DynamoDb is getting easier and easier to use for high performance applications for which nosql is a good fit.  (And for which you don’t mind being tied to AWS.)

Restoring a single table from an Amazon RDS backup

material-icon-1307676_640When you use SQL, how do you write delete statements at the database prompt?

A delete statement typically looks like this: delete from table_name where column_name = 'foo';. I usually write it in this order:

  1. delete
  2. delete where column_name = 'foo';
  3. delete from table_name where column_name = 'foo';

Even though this is a pain because you have to move back and forth (I really need to look into vi keybindings for mysql), it prevents you from making sending this command by accident: delete from table_name; which deletes all the data in your table.  (Another alternative is to never use the interactive client and always write out your delete statements in a file and run that file to delete data.)

But, recently, I did exactly that, because I forgot.  I deleted all the data from one table in our production database.  It was billing data, so rather important.  Luckily, I am using Amazon RDS and had set up backup retention.

I wanted to outline what I did to recover from this.

  • I took a deep breath.
  • I wrote a message on the slack channel documenting what had happened and the possible customer impact.
  • Depending on which data is removed, it’s possible you will want to put the application in maintenance mode and/or inform your customers of the issues.  What I deleted was used rarely enough that I didn’t have to take these steps.
  • I looked at how to restore an Amazon RDS backup.
  • I restored the missing data.
  • I communicated that things were back to normal to internal stakeholders.

Unfortunately, it wasn’t clear how to restore a single table.  I’m used to being able to download a .sql file and hand edit it, but that’s not an option.  Stackoverflow wasn’t super helpful.   But if there’s anytime you want clarity, it’s when you are restoring production data.  You don’t want to compound the problem by screwing up something else.

So, here’s how to restore a single table from an Amazon RDS backup:

  • Note the time just before you deleted the data.  (Another reason the slack message is nice.  chatops ftw.)
  • Start up another instance from that moment.  I named it something obvious like ‘has-data-from-tablename’.
  • Twiddle your thumbs anxiously while the new instance starts up.
  • The instance is put into your default security group (as of this writing) which probably doesn’t allow mysql access.  Make sure you modify this security group to allow access.
  • When the instance is up, do a dump of the table you need: mysqldump -t --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h has-data-from-tablename.c1m7x25w24qor.us-east-1.rds.amazonaws.com -P3306 database_name tablename > restore-table_name.sql; (-t omits the create database/table statements.)
  • If your table is has had writes since you deleted everything, you may need to manually pull down the current data from the production system and merge it into restore-table_name.sql; I was able to avoid this step.
  • Load the data using mysql mysql --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h production.c1m7x25w24qor.us-east-1.rds.amazonaws.com -P3306 database_name < restore-table_name.sql;
  • Review to make sure the data is correct.
  • Test the application.
  • Update the slack channel, and do any other notifications you need to (customers, internal contacts, etc).
  • Revoke the default security group access you allowed above.
  • Delete the ‘has-data-from-tablename’ instance.

Note this only works if you caught your mistake within the backup retention window. (Make sure you set that up.)  We aren’t multi AZ or clustered, so I’m not sure how that would affect things.

Happy deep breathing!

Why Use an ETL Tool?

transformation photo
Photo by AlicePopkorn

I’m a big fan of ETL tools.  The one with which I am most familiar is Kettle, aka Pentaho Data Integration.  When I was working for 8z, we used it heavily to pull data from other systems, process it, and update our databases.  While ETL systems are not without their flaws, I think their strengths are such that everyone who is moving data around should consider them.  This is more true now than in the past because there is a lot more data flowing everywhere, and there are several viable open source ETL tools, so you don’t have to spend thousands or tens of thousands of dollars to get started.

What are the benefits of ETL tools?

  • There are pre-built components for common data tasks (connecting to a database, parsing a flat file) that have been tested and debugged by many many people.  It’s hard to over emphasize how much time this can save, allowing you to focus on business logic.
  • You operate at a higher level of abstraction.
  • There is support for other performance features like parallel jobs that you can configure.
  • The GUI makes data flow obvious.
  • You can write your own components that leverage existing libraries.

What are the detriments?

  • Possible to version control, impossible to merge.
  • Limits of components mean you sometimes have to contort your data flows, or drop down to write your own component.
  • Some components (at least for Kettle) are not open source.
  • You have to roll your own testing framework.  I did.
  • You have to learn another tool.

Don’t re-invent the wheel!  Your data movement problem may very well be a super special snowflake, but chances are it isn’t.  Every line of code you write is another you have to maintain.  When you are confronted with a data movement problem, take a look at an ETL tool like Kettle and see if you can stand on the shoulders of giants.  Here’s a list of open source ETL tools to evaluate.

Cassandra Dev Day Denver

data photo
Photo by justgrimes

Most of my datastore experience is with RDBMS like mysql, oracle and postgresql (though I did work with some key value stores like berkleydb back in the day). So when a full day, free intro to Cassandra was offered, I jumped on it, even though it was in Centennial. You can view the schedule, speakers and talk synopses for the day. There were two tracks, beginner and advanced. Since I didn’t know anything about Cassandra, I followed the beginner track.

First, though, it was amazing how many people were there. The two main companies behind the talk, Pearson Education and DataStax, a vendor providing a commercial, supported version of Cassandra, ended up having to provide two overflow rooms, and it was still standing room only for some of the talks. Quite a nice turnout, and I think the sponsors were pleasantly shocked. I was also surprised by the number of folks from Boulder. I happened to sit next to two folks from Westminster and Superior, and ended up having a common friend or colleague with each. Small world.

I learned a ton about Cassandra, from its internals, to its topology (the ring’s the thing) to abstractions that let you query it (CQL, which is a subset of SQL) to data modeling to using the java driver, which makes accessing Cassandra almost as easy as using JDBC. While there are some SQL concepts that appear to map fairly well to Cassandra, I put quotes around them below to remind myself of the fact that a Cassandra ‘table’ isn’t the same as a RDBMS table, ditto for ‘row’, ‘primary key’ and other important concepts.

I think the biggest takeaway for me was that Cassandra is a “write many, read once” system. Because you can only query efficiently on one or two keys, if you have multiple queries, you want to write the data multiple times in a denormalized system, one ‘table’ for each query. Because of this, Cassandra shines in use cases where you are doing a lot of inserts, have known queries, and need speed and availability (sensor data was mentioned several times).

How does this actually work? Here’s an example (as best I understand it–here are some others from people who actually have experience using this technology):

If you have click stream data, in standard apache format, and you want to be able to have it stored in a database and highly available for a few specific queries, Cassandra might be a good choice. Here’s a line of my clickstream, from my blog:

157.55.39.40 - - [15/Oct/2014:08:03:30 -0600] "GET /wordpress/archives/date/2007/07 HTTP/1.1" 200 66258 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

This is time based data, and has some valuable information, and some not so valuable information. What things might I be interested in querying on? Well, I might care about the user agent, request time, ip address, status code, or URL path requested. I probably don’t care about the HTTP method, HTTP version or the bytes served. But, for the sake of this example, let’s say my application needs to show the location of the most recent 1000 users for a given country, for a fancy widget for my website. I will use maxmind or a similar service for mapping ip addresses to country. That’s all I care about. (Yes, I know this is contrived–I had to revise this example a couple of times to make it fit with Cassandra’s usage model.) I would set up in this ‘table’ in Cassandra.


CREATE TABLE IF NOT EXISTS location (
country text,
time timestamp,
user_agent text,
status_code int,
path text,
PRIMARY KEY (text, time)
)

In this table, country is the partition key, and time is the clustering key. That means that this query: select location from location where country = 'USA' limit 1000; will be screaming fast. If I wanted to look at paths requested by time, I would not add an index to this table, but instead create another whole table, say request_path. Then my insertion code would write to both location and request_path. And then clients that wanted path information would use the specific table. Yup, denormalization is the name of the game.

This means that Cassandra has certain specific use cases, and that trying to use it as a general purpose data storage and query engine is foolish. Several presenters mentioned that Cassandra plus Apache Spark for general queries was a good solution.

Denormalization as standard operating procedure isn’t the only mind bending facet of Cassandra. Others:

  • the presenters also talked about the high availability and replication of Cassandra–you can actually configure it to be data center aware so that it automatically replicates data across different data centers.
  • For each keyspace (a set of tables, so similar to a schema), you specify how a replication factor–how many times each piece of data is stored.
  • For every read or write, you specify how many nodes must either agree or accept the data, respectively.
  • Adding nodes is the preferable way to deal with scale. You can add a node easily, and Cassandra will auto partition data and spread existing data across the new node.
  • The biggest Cassandra setup they mentioned was 75K nodes.
  • Each ‘row’ can have up to 2 billion records. If a row stretches across nodes, you’ll kill performance.
  • There’s a process called ‘compaction’ which is similar to Java’s garbage collection, and just like GC, you have to pay attention to how compaction works, because it will affect performance.

All in all, very interesting day, and I appreciated the experience. One more (interesting) tool to add to the toolbox.