Secrets of scaling JIRA and Confluence for real enterprise use

While JIRA and Confluence can be easily installed and used by almost anyone, once they start to grow across several teams with different expectations you will soon be facing their dark side.

As part of my DevOps work at Citrix, I had the chance, or curse), be be handled few JIRA and Confluence instances to take care of. I was informed that taking care of these should not take more than 10% of my time. Almost 4 years later I could say that the initial estimations were a bit off, the real life values proved to be around 60-75%.

Is true that this included lots of migrations from other systems, but even taking them off, I would say that most of my time is spend nurturing those systems, so they could host more projects and more users.

Database

While Atlassian products can run with SQL Server or Oracle databases, you should be a fool if you are not using them using PostgreSQL.

You will be wonder why? Is simple, PostgreSQL is the same database that is used by Atlassian on development, testing and production. All their instances are running using PostgreSQL which means that the chance of having any database related bugs is far lower on PostgreSQL than on other ones.

If you do not believe me, try some queries on https://jira.atlassian.com or subscribe yourself to release notes and you will see.

The database was configured to use about 50% of the memory of the machine, yes is it running on the same machine as JIRA and Confluence but that’s for convenience, and to lower the external dependencies.

Operating system and JVM

I would not advice anyone to use Windows to host a tomcat based service, unless they are unable to deal with the command line.

We are running all Atlassian products on Debian (base or Ubuntu derivative) because there is a third party APT repository which provides Oracle JVM. As Oracle did their best on making harder to install Java and to keep it updated, the best options we found so far is to use webupd8 PPA repository which works very well with Debian too.

Use only JVM 8 and always keep it updated, we had this fully automated. That’s not only for security but also for fixing bugs that do affect the products. The cool part is that’s safe to upgrade JVM while the products are running, no need to restart them right away.

Virtualized or bare metal

In January 2013 we switched from running Jira and Confluence on a VM to bare-metal. It seems that any virtualization system would slow down them too much. Imagine that, on the same configuration we got the Jira indexing from 3½ hours to about 15-20 minutes.

If you want speed, especially for things that affect the downtime duration, use SSD. For storing Attachments you can use anything you want, speed is not important, we current store them on a NetApp filer.

Here is our hardware configuration on which we do run our main Jira, Confluence and Crowd:

800GB Intel SSDs
160GB RAM, split 50% for database using pg_tune, 20GB Jira, 15GB Confluence, 2GB Crowd, rest being free to be used by the operating system for caching.
32 cores, which are underused, average load is below 3 (~10%)

JIRA has almost 400.000 issues and about 2500 users inside and we are confident that we would not have to upgrade the hardware before we get to 2-3M issues.

Why we are running all these on a single host

All these products do have interdependencies so by keeping all on the same machine we do not introduce new external dependencies.

We never has a problem of one service taking down the machine, all being JVMs it means that they are already isolated.

Because the load spikes do almost never happen in parallel, it does make sense to improve the use on the CPU power.

Think that a restart of JIRA or Confluence used to take 15-25 minutes, now we can restart and even upgrade them with a downtime of about 5-6 minutes.

Regarding upgrades we are using an open source tool that I wrote back in 2002, atlassian-updater.py – that’s because the Atlassian installers seems to not to be designed to minimize the downtimes, or to keep your customizations in place.

Staging instances

How surprised would you be to find out that we do run the staging instances on the same machine we run production?

This saved us lots of money, and more important even time, that’s because we can perform a live clone of the production database in less than 5 minutes, and we are talking about databases that do have an archived dumps of about 1GB each.

The staging instances are upgraded automatically when Atlassian is releasing a new version, so we would be able to test it before putting it on production.

Q&A

Why not using Atlassian Cloud instead

Atlassian Cloud works the best for small to medium clients. You have very limited control over what you can do with the cloud versions. Even the plugins available are just a few.

Do you use the TAM service aimed for enterprise customers

Currently not.

How about the Data Center editions

While Atlassian managed to build the best issue tracker and wiki solution, I would also say that high-availability is not among their core competencies. It took them years to build init.d scripts for their services and even today we have to use our own versions because the default ones are not able to kill the service is it gets stuck.

Customers were asking for a HA more for Atlassian products for year and when finally decided to do something about it, they managed to do something that is still far from a HA system. Let me explain, while each of these products cannot be upgraded by upgrading each node at a time while the other one is still serving requests, I would not call this real HA products.

Atlassian released the Data Center edition back in 2014 and these seem to be handy if you want to spread the load across several hosts. Still, I doubt there is any Atlassian instance in the world that cannot cope with the hardware of a single multi CPU machine. Considering that I am still seeing lots of bug fixed related to Data Center deployments in the products and that also most plugins have problems with these this seems like an expensive recipe for spending more money and time in order to lower your uptime.

Most painful thing and admin would have to deal with

Saying NO to JIRA change requests. Once your users discover how many things you can do with JIRA, they will start to bomb you with feature-creeping.

The amount of entropy you have in JIRA will affect your system stability and performance of the system. It is up to you to limit the amount of customizations made. Running JIRA in enterprise cannot be made using the same way as one instance that only had few projects.

What surprises should I expect

If you have a Google Search Appliance, this could easily put down your Confluence instance.

When it comes to JIRA, be careful about how people are using JIRA REST API because they could easily product a DDOS on the system if they are greedy and not caching data.

Also, when having 1000+ users, you will want to feed the logs to a log monitoring tool that can filter the noise out of them, allowing you to spot those that do matter. GrayLog does a pretty good job but installing and using it does take a lot of time.