Database performance

The following brain dump is a quick run through of the performance optimisation and design considerations when using relational databases, particularly in the context of web applications. It should be noted that it doesnt go into full data warehousing techniques but mentions some of its denormalisations mechanisms. It also ignores all the web side of performance handling. So clearly any caching of content is best done at the highest level, e.g. in a reverse proxy cache such as squid, so that neither the web application or database sees the request. This can be transferred down the layers - with caching function decorators or template fragments until only if absolutely necessary should the stack actually pass a query to the database.


Database types


Databases are characterised as two opposite types, OLTP - ie. high edit databases, or DSS - ie high reporting level / large dataset databases. The database engines used for editing vs querying tend to be separate and all the configuration from data normalisation in the schema design down to storage block sizes on the hard disk are also opposite for optimal performance for each type.


Unfortunately in reality databases are most often required to do both, however this division should always be borne in mind. Whenever possible splitting according to these types. At the macroscopic scale it could be setting up entirely separate servers with replication from OLTP to a read only DSS database, or switching configurations at night time when running reports. Down to the microscopic scale such as writing a stored procedure so that it fetches a dataset into memory then performs all the edits on it, rather than doing a read then a write for each row.


Query optimization


Identify the queries that need optimization. This may just be from the time it takes to run a query, to logging sql timings at the web layer, to using full database tools.
Oracle for instance has AWR which allows the set up of an automated snapshot timings log and HTML report generation filtered by things such as query time, block reads etc.


The next step is to see what is wrong with the way these problem queries are executing that is making them too slow or resource hungry.
So this involves using explain to see which execution plan the database optimizer is choosing to execute the query. Although not part of the sql92 standard this command is standard to any database, e.g. Postgres, MySQL, Oracle etc.
In Oracle you can also choose to for example 'set autot trace', in sqlplus which then just displays the explain plan and cost from stats for any queries rather than running them.


By default the optimizer used will tend to be the cost based one rather than the older rule based one. A cost based analyzer tries all the possible ways of running the query then caches the one that gives the lowest cost stats (see below). Whilst rules just follow standard rules for executing. Given the costs involved appropriate indexes can be added or an index only table or materialized view used.


These latter features may not be available in all databases. Index only tables suit tables where most or many columns are indexed and so the table data can all be stored as indexes rather than in a separate table. So reducing the performance hit of accessing a table as well. A many to many mapping table would be a good candidate for this.

Whilst materialized views (mvs) are local tables that are automatically synched with the source tables based on data changes, or time periods, etc. Then when the source table is used in queries the optimizer will automatically substitute the mv when appropriate. For data change based mvs then an mv log should also be added to speed up mv refreshes.
Remote database tables or costly complex queries of a number of local tables are both good candidates for mvs.

Although a database may not have mvs the same approach can be used with ordinary tables, so instead of 'create mv as select my very time consuming query'. One can use 'create table as ...' Then manually add triggers to update the table acting as an mv, or add cron jobs to update it periodically.


Note that unnecessary indexes may slow up a query since they are another object that it needs to check. Once you have got the explain plan looking OK, you can rerun the actual query to confirm.


Before going more deeply into optimisation it should be remembered that adding ordinary views, can be very effective in performance terms. By breaking out common queries into fixed execution plans and creating pseudo-tables from saved queries. This also has the benefit that it can simplify and prevent hard coded sql complexity from the code outside of the database and make the application more maintainable. So it is the first approach that should be taken. A good database abstraction layer of views can be especially useful when using data from other applications where the schema is outside your control. These views can also be indexed in some databases or moved to mvs for even better performance.


Cost based optimization is based on the stats for a table. Run analyze to manually regenerate these stats. So something you may need to do after adding an index, you can then rerun explain and check to see whether the full table scans have been minimized. For more complex multiple table queries the first table used (the driving table) is scanned and then matching rows from the joined tables found (nested loops). So adding indexes or other features should be informed by the order of these loops. Alternatively a sorted merge join may be faster than looping, and a hash join faster still. All these will be automatically chosen by the optimizer based on the indexes you add.


With Oracle you can also create your own stats table (normally generated by analyze) from which the costs are calculated. Postgres allows manual stats manipulation too. If local stats exist for a table then these will be used in place of the default one. The benefits of this are that you can modify these stats to test how queries will perform with projected data in the future, so if the dataset where much larger or the value distribution much different. So for example a sequence will have a completely even distribution of one distinct value per row whilst another row may have 40% equal to one value then another hundred values distributed unevenly in the other rows. This effects performance for uneven distributions so adding a map of this distribution - called a histogram to your stats helps to fix this.


The recommended approach for Oracle and Postgres once a table is appropriately indexed is to modify your stats to tune query performance. Although the older Oracle option was to add hints ie. select /*+optimizer=rule */ blah from foo; This is now frowned upon since it can fix the execution plan into an inefficient approach when data changes over time.


Indexes


An index is a separate map of column value vs. row id. The database checks it to find the data instead of the table. So if you want to do a lot of 'select name from table where id = 123 an unindexed table would have to do a full table scan to check all the id rows for 123. If the id column were indexed it would just have to check the index and then pull the rows of matching id from the table. Finally if there were an index(name, id) then it could bypass looking at the table altogether - so just read the index which may well be in memory - hence take almost no time at all.


Different indexes suit different cases. The default is the BTree used as standard in both relational and object databases where it is also used as a storage type (e.g. LargePloneFolders).
BTrees suit queries that only access under 15% of the tables rows and where the data is evenly distributed and of high cardinality (different values). These also come in the reverse key type for less cardinal data, and function based indexes to allow the use of an index where a column is normally used wrapped in a function - so function result vs row id is being mapped.
Bitmap indexes are for low cardinality columns, and are ideal for read only systems and very large tables.


A key gotcha is that indexes may be disabled accidentally, so watch out for queries that check for nulls, use a function on the indexed column (e.g. in postgres and mysql mynum = '123' can be taken as an implicit conversion function to a string literal - hence the index isnt used). Comparison with another column in the same table and leading wildcards in like statements. Also remember that if your index is automatically created by adding a constraint such as a primary key, then removing that constraint will drop the index too.


Bind Variables, Memory and Pooling


It is important that frequently used queries where just the parameters change, are written in exactly the same form. So no extra spaces, different field order etc.
One benefit of an ORM or other database abstraction layer is that this will generally happen because the sql is programatically generated. If the database engine gets a query that
only differs in its parameters then it should recognize this and substitute bind variables for those params (assuming the cursor settings allow this). The query itself and its best execution plan can then be rapidly loaded from memory and just the bind variables substituted to get the result.


This is clearly noticable when you are just rerunning the same query - the first execution time will often be more than double the time of running it again. Generally execution plans will be loaded in global memory. Whilst result sets will be within a connection sessions memory pool. So it may prove a performance benefit for a particular database session to be tied to a web session for some applications.


For web applications a connection pool should always be used*. Connection creation can take a significant time especially with separate database servers (ie. a couple of seconds) and without pooling a web application will be recreating a connection for every request. So giving very poor performance and risking a DoS attack effect on the database by creating an unlimited amount of connections.

Another option is to maintain connections via web session machinery, this may be a better solution where the number of concurrent users is low and their usage is quite user specific (see above). So for a low usage OLTP web application. This also alows cursor caching where a cache of cursors being used in queries can be set up for use by a particular user so saving rerunning queries and paginating commands (rownum or limit etc.) each time.


* Connection pools are available for most database connectors, or failing that at the framework level. For example in Python, I have packaged a pooling connector for django and Oracle to use cx_Oracle's pooling, django-oraclepool. Whilst the zope framework pools connections within its object database.

Site code, Google Apps integration and design - Ed Crewe 2011