 | Level: Introductory Gustavo Arocena, Technical manager , IBM Toronto Lab
05 Jun 2003 After you've adjusted your database parameters, maximized parallelism, and fine-tuned your indexes, are you still looking for ways to improve database performance? How about the SQL itself? This article offers some concrete suggestions on how to organize the logic in your SQL stored procedures for optimal response time.
Introduction
There are many sources of advice available on how to tune database systems and applications. Articles like DB2 Tuning Tips for OLTP Applications, previously published on the IBM® DB2® Developer Domain, provide advice on topics ranging from tablespace and index design to memory allocation for buffer pools, exploiting transaction and data parallelism and analyzing query plans. These topics are the ABCs of performance tuning.
However, specific advice on how to organize the logic in the stored procedures themselves with an eye on performance is not that common. That is the kind of advice I provide in this article. Even though the article focuses on SQL procedures, most of the information provided here is applicable to SQL logic embedded in applications or in stored procedures written in other languages.
Background and terminology
Before going into the details, let's first review some basic terminology and concepts related to procedural SQL in DB2. Procedural SQL constructs (such as scalar variables, IF statements and WHILE loops) were introduced in DB2 with the release of DB2 Universal DatabaseTM (UDB) Version 7. Prior releases of DB2 supported C and JavaTM as languages for stored procedures. Version 7 introduced SQL stored procedures, along with a number of other features that facilitate the development of OLTP applications (for example, temporary tables, application savepoints and identity columns).
When an SQL procedure is created, DB2 separates the SQL queries in the procedure body from the procedural logic. To maximize performance, the SQL queries are statically compiled into sections in a package. (For a statically compiled query, a section consists mainly of the access plan selected by the DB2 optimizer for that query. A package is a collection of sections. For more information on packages and sections, please refer to the DB2 SQL Reference, Volume 1.) On the other hand, the procedural logic is compiled into a DLL (dynamically linked library).
During the execution of a procedure, every time control flows from the procedural logic to an SQL statement, there is a "context switch" between the DLL and the DB2 engine. (In DB2 V8, SQL procedures run in "unfenced mode", i.e., in the same addressing space as the DB2 engine. Therefore the context switch we refer to here is not a full context switch at the operating system level, but rather a change of layer within DB2.) Reducing the number of context switches in procedures that are invoked very often (such as procedures in an OLTP application) or that process large numbers of rows (for example, procedures that perform data cleansing) can have a noticeable impact on their performance. Several of the tips in this article aim precisely at reducing these context switches.
The SQL procedural language (commonly referred to as SQL PL) was first allowed only in SQL procedures (DB2 Universal Database V7 GA). Later on (DB2 UDB V7.2), a subset of the language started to be supported in SQL functions and trigger bodies. This subset of SQL PL is known as inline SQL PL. The word "inline" highlights an important difference with the full language. Whereas an SQL PL procedure is implemented by statically compiling its individual SQL queries into sections in a package, an inline SQL PL function is implemented, as the name suggests, by inlining the body of the function into the query that uses it. We'll revisit inline SQL PL later on, along with examples of its use.
Now let's look at some specific things you can do to enhance performance when you're using SQL procedural language.
Avoid multiple statements when just one will do
Let's begin with a simple coding tip. A sequence of single row INSERTs like this one:
INSERT INTO tab_comp VALUES (item1, price1, qty1);
INSERT INTO tab_comp VALUES (item2, price2, qty2);
INSERT INTO tab_comp VALUES (item3, price3, qty3);
|
can be rewritten as:
INSERT INTO tab_comp VALUES (item1, price1, qty1),
(item2, price2, qty2),
(item3, price3, qty3);
|
The multi-row INSERT will require roughly one third of the time required to execute the three original statements. Isolated, this improvement might seem negligible, but if the code fragment is executed repeatedly (for example in a loop or in a trigger body), the improvement can be significant.
Similarly, a sequence of SET statements like:
SET A = expr1;
SET B = expr2;
SET C = expr3;
|
can be written as a single VALUES statement:
VALUES expr1, expr2, expr3 INTO A, B, C;
|
This transformation preserves the semantics of the original sequence if there are no dependencies between any two statements. To illustrate this, consider:
SET A = monthly_avg * 12;
SET B = (A / 2) * correction_factor;
|
Converting the previous two statements to:
VALUES (monthly_avg * 12, (A / 2) * correction_factor) INTO A, B;
|
does not preserve the original semantics because the expressions before the INTO keyword are evaluated "in parallel." This means that the value assigned to B is not based on the value assigned to A, which was the intended semantics of the original statements.
From multiple SQL statements to a single SQL expression
Like other programming languages, the SQL language provides two types of conditional constructs: procedural (IF and CASE statements) and functional (CASE expressions). In most circumstances where either type can be used to express a computation, using one or the other is a matter of taste. However, logic written using CASE expressions is not only more compact, but also more efficient than logic written using CASE or IF statements.
Consider the following fragment of SQL PL code:
IF (Price <= MaxPrice) THEN
INSERT INTO tab_comp(Id, Val) VALUES(Oid, Price);
ELSE
INSERT INTO tab_comp(Id, Val) VALUES(Oid, MaxPrice);
END IF;
|
The condition in the IF clause is only being used to decide what value is inserted in the tab_comp.Val column. To avoid the context switch between the procedural and the dataflow layers, the same logic can be expressed as a single INSERT with a CASE expression:
INSERT INTO tab_comp(Id, Val)
VALUES(Oid,
CASE
WHEN (Price <= MaxPrice) THEN Price
ELSE MaxPrice
END);
|
It's worth noting that CASE expressions can be used in any context where a scalar value is expected. In particular, they can be used on the right-hand side of assignments. For example:
IF (Name IS NOT NULL) THEN
SET ProdName = Name;
ELSEIF (NameStr IS NOT NULL) THEN
SET ProdName = NameStr;
ELSE
SET ProdName = DefaultName;
END IF;
|
can be rewritten as:
SET ProdName = (CASE
WHEN (Name IS NOT NULL) THEN Name
WHEN (NameStr IS NOT NULL) THEN NameStr
ELSE DefaultName
END);
|
In fact, this particular example admits an even better solution:
SET ProdName = COALESCE(Name, NameStr, DefaultName);
|
Exploit the set-at-a-time semantics of SQL
Procedural constructs such as loops, assignment and cursors allow us to express computations that would not be possible to express using just SQL DML statements. But when we have procedural statements at our disposal, there is a risk that we could turn to them even when the computation at hand can, in fact, be expressed using just SQL DML statements. As we've mentioned earlier, the performance of a procedural computation can be orders of magnitude slower than the performance of an equivalent computation expressed using DML statements. Consider the following fragment of code:
DECLARE cur1 CURSOR FOR SELECT col1, col2 FROM tab_comp;
OPEN cur1;
FETCH cur1 INTO v1, v2;
WHILE SQLCODE <> 100 DO
IF (v1 > 20) THEN
INSERT INTO tab_sel VALUES (20, v2);
ELSE
INSERT INTO tab_sel VALUES (v1, v2);
END IF;
FETCH cur1 INTO v1, v2;
END WHILE;
|
To begin with, the loop body can be improved by applying the transformation discussed in the last section:
DECLARE cur1 CURSOR FOR SELECT col1, col2 FROM tab_comp;
OPEN cur1;
FETCH cur1 INTO v1, v2;
WHILE SQLCODE <> 100 DO
INSERT INTO tab_sel VALUES (CASE
WHEN v1 > 20 THEN 20
ELSE v1
END, v2);
FETCH cur1 INTO v1, v2;
END WHILE;
|
But upon closer inspection, the whole block of code can be written as an INSERT with a sub-SELECT:
INSERT INTO tab_sel (SELECT (CASE
WHEN col1 > 20 THEN 20
ELSE col1
END),
col2
FROM tab_comp);
|
In the original formulation, there was a context switch between the procedural and the dataflow layers for each row in the SELECT statements. In the last formulation, there is no context switch at all, and the optimizer has a chance to globally optimize the full computation.
On the other hand, this dramatic simplification would not have been possible if each of the INSERT statements targeted a different table, as shown below.
DECLARE cur1 CURSOR FOR SELECT col1, col2 FROM tab_comp;
OPEN cur1;
FETCH cur1 INTO v1, v2;
WHILE SQLCODE <> 100 DO
IF (v1 > 20) THEN
INSERT INTO tab_default VALUES (20, v2);
ELSE
INSERT INTO tab_sel VALUES (v1, v2);
END IF;
FETCH cur1 INTO v1, v2;
END WHILE;
|
However, the set-at-a-time nature of SQL can also be exploited here:
INSERT INTO tab_sel (SELECT col1, col2
FROM tab_comp
WHERE col1 <= 20);
INSERT INTO tab_default (SELECT col1, col2
FROM tab_comp
WHERE col1 > 20);
|
When looking at improving the performance of existing procedural logic, any time spent in eliminating cursor loops will likely pay off.
Improve cursor performance
If the logic in your stored procedures does require cursors, here are a few things to keep in mind to maximize their performance.
First of all, make sure you don't use an isolation level higher than what you need. The isolation level determines the amount of locking that DB2 applies on the rows that the procedure reads or updates. The higher the isolation level, the more locking DB2 will perform, and therefore, the less concurrency between applications competing for the same resources. For instance, a procedure using Repeatable Read (RR) isolation level will cause share locks on any row it reads, whereas a procedure using Cursor Stability (CS) will only lock the current row of any updatable cursor. The isolation level for SQL procedures can be specified with the DB2_SQLROUTINE_PREPOPTS registry variable. For example, to set the isolation level of SQL procedures to Uncommitted Read (the lowest level, which should be used for procedures that access read-only data), use the following command:
db2set DB2_SQLROUTINE_PREPOPTS="ISOLATION UR"
|
Note: For this setting to take effect, the db2 instance has to be restarted. The default isolation level in DB2 is Cursor Stability. But of course, to preserve the correctness of an application, it is sometimes necessary to use Repeatable Read. What is important to remember is to revert DB2_SQLROUTINE_PREPOPTS back to a lower isolation level once procedures requiring Repeatable Read are created.
One more thing worth mentioning regarding isolation level is that DB2 allows us to override the default isolation level in individual queries, as shown below:
DECLARE cur1 CURSOR FOR SELECT col1 FROM tab_comp WITH UR;
|
The query above will execute with isolation level UR regardless the isolation level specified in DB2_SQLROUTINE_PREPOPTS.
A related issue to keep in mind when trying to improve cursor performance is cursor updatability. A cursor is deletable if the rows it ranges over can be updated or deleted using the WHERE CURRENT OF clause in UPDATE or DELETE statements. When a cursor is deletable, DB2 has to obtain exclusive locks (as opposed to share locks) on rows, and cannot perform row blocking. An exclusive lock on a row prevents other applications from even reading the row (they have to wait until the lock is released, unless their isolation level is UR), whereas row blocking reduces database manager overhead for cursors by retrieving a block of rows in a single operation.
Row blocking can only happen for non-deletable cursors. That is why it is important to let DB2 know how a cursor is going to be used. Cursors can be explicitly declared as non-deletable by specifying the FOR READ ONLY clause in the SELECT statement, or as deletable by using the FOR UPDATE clause in the SELECT statement. Based on this information (and also on the BLOCKING option described below), DB2 will decide whether to use row blocking for a given cursor or not.
By default, DB2 will always use row blocking for cursors defined using the FOR READ ONLY clause, unless the BLOCKING NO bind option has been specified. On the other hand, DB2 will use row blocking for ambiguous cursors (cursors that are not defined as either FOR READ ONLY or FOR UPDATE) if the BLOCKING ALL bind option is used.
To make a long story short: when possible, use the FOR READ ONLY clause in your cursor definitions; if your procedures contain ambiguous cursors, use the BLOCKING ALL bind option. To set the value of the BLOCKING bind option, we also use the DB2_SQLROUTINE_PREPOPTS registry variable. For example, to set the isolation level of SQL procedures to Uncommitted Read and row blocking to BLOCKING ALL, use the following command:
db2set DB2_SQLROUTINE_PREPOPTS="ISOLATION UR BLOCKING ALL"
|
Blocking can be particularly important for procedures returning large result sets.
Other bind options can be also specified for stored procedures by using the DB2_SQLROUTINE_PREPOPTS registry. Please see the topic "Setting Up the SQL Procedures Environment" in the Application Development Guide: Building and Running Applications
for more information. Also, a full explanation about isolation levels, locking and blocking is beyond the scope of this article. See the DB2 Administration Guide: Performance and the SQL Reference entry for DECLARE CURSOR for a complete explanation.
In the absence of side-effects, use SQL functions
As we mentioned in the introduction, SQL procedures and SQL functions are implemented using different technologies. Queries in an SQL procedure are compiled individually, each of them becoming a section in a package. The compilation occurs when the procedure is created, and the queries are not recompiled until the procedure is recreated or until its associated package is rebound.
On the other hand, queries in SQL functions are compiled together, as if the function body were a single query. The compilation occurs every time a statement that uses the function is compiled.
Unlike what happens in SQL procedures, procedural statements in SQL functions are not executed in a different layer than dataflow statements. Therefore, there is no context switch every time control flows from a procedural to a dataflow statement or vice versa.
As a result of these differences, a given piece of procedural code will usually execute much faster when implemented as a function than when implemented as a procedure. But of course, there is a catch. Functions can only contain statements that do not alter the state of the database (for example, INSERT, UPDATE or DELETE statements are not allowed). And also only a subset of the full SQL PL language is allowed in SQL functions (no CALL statements, no cursors, no condition handling).
Despite these restrictions, most SQL procedures with no side-effects can be converted into SQL functions. For example, the following procedure:
CREATE PROCEDURE GetPrice (IN Vendor CHAR(20),
IN Pid INT,
OUT price DECIMAL(10,3))
LANGUAGE SQL
BEGIN
IF Vendor = 'Vendor 1' THEN
SET price = (SELECT ProdPrice
FROM V1Table WHERE Id = Pid);
ELSE IF Vendor = 'Vendor 2' THEN
SET price = (SELECT Price
FROM V2Table WHERE Pid = GetPrice.Pid);
END IF;
END
|
is equivalent to the following function:
CREATE FUNCTION GetPrice (Vendor CHAR(20), PId INT)
RETURNS DECIMAL(10,3)
LANGUAGE SQL
BEGIN
DECLARE price DECIMAL(10,3);
IF Vendor = 'Vendor 1' THEN
SET price = (SELECT ProdPrice
FROM V1Table WHERE Id = Pid);
ELSE IF Vendor = 'Vendor 2' THEN
SET price = (SELECT Price
FROM V2Table WHERE Pid = GetPrice.Pid);
END IF;
RETURN price;
END
|
Note that, whereas the CALL statement is used to call a procedure, a VALUES statement is needed to call a function from the command line:
VALUES (GetPrice('IBM', 324))
|
On the other hand, you can call functions, unlike procedures, in any context where an expression is allowed:
SELECT VName FROM Vendors WHERE GetPrice(Vname, Pid) < 100;
SET price = GetPrice(Vname, Pid);
|
So, as the title of this section indicates, consider using SQL functions instead of SQL procedures when you're just extracting data from the database without performing any changes.
Use temporary tables for temporary data
In Version 7, DB2 introduced temporary tables. Operations on temporary tables are usually much faster than operations on regular tables. Let's look at some of the reasons:
- To begin with, the creation of a temporary table does not involve the insertion of entries in the catalogs, and use of the temporary table does not involve access to the catalogs either; therefore, there is no catalog contention.
- Because temporary tables are only accessible to the application that created them, there is no locking involved in their manipulation.
- If the NOT LOGGED option is specified, operations on temporary tables are not logged (at the expense, of course, of the possibility of rolling back changes). Therefore, if your stored procedures generate large amounts of temporary data that are only meant to be used within a session with the database, storing those data in temporary tables can result in significant performance gains.
Before any use of a temporary table in an SQL procedure, the table definition must be available in the compilation environment. For example, in the following CLP script (which uses '%' as statement terminator), the only purpose of the table definition is to make possible the creation of the SQL procedure:
CONNECT TO sample %
DECLARE GLOBAL TEMPORARY TABLE TT(C1 INT, C2 CHAR(20)) %
CREATE PROCEDURE INSTT(P1 INT, P2 CHAR(20))
BEGIN
INSERT INTO SESSION.TT VALUES(P1, P2);
END %
CONNECT RESET %
|
After the CONNECT RESET command, the temporary table will no longer exist. At runtime, the application will have to ensure that the table exists before the first query that uses it is executed. This last observation brings up an important point we haven't mentioned yet: any query that references a temporary table will be compiled dynamically, even if the query is written as static SQL. Like any other dynamic query, after the query is compiled, it will stay in compiled form in the package cache. The next time the same query is executed, DB2 will recompile it only if it is not found in the cache.
If you plan to create relatively big temporary tables and run several queries on them, consider defining indexes and running runstats on them (the latter obviously after the table has been populated). More on this in the next section.
One last comment on using temporary tables in SQL procedures: if you need to return a result set based on a temporary table that is created in the same procedure, the result set must be defined in a nested compound statement, as shown in the example below:
CREATE PROCEDURE INSTT2(P1 INT, P2 CHAR(20))
BEGIN
DECLARE GLOBAL TEMPORARY TABLE TT(C1 INT, C2 CHAR(20)) %
INSERT INTO SESSION.TT VALUES(P1, P2);
BEGIN
DECLARE C1 CURSOR WITH RETURN FOR SELECT * FROM SESSION.TT;
END;
END %
|
The reason the result set must be defined in a nested compound statement is that the DECLARE GLOBAL TEMPORARY TABLE is an executable statement, and executable statements can only be written after declaration statements such as DECLARE CURSOR. If we declared the table in the outer scope after the cursor definition, the table would not be available in the compilation environment when the DECLARE CURSOR statement is compiled, and therefore the compilation would fail.
Keep the DB2 optimizer informed
When a procedure is created, its individual SQL queries are compiled into sections in a package. The DB2 optimizer chooses an execution plan for a query based, among other things, on table statistics (for example, table sizes or the relative frequency of data values in a column) and indexes available at the time the query is compiled. When tables suffer significant changes, it may be a good idea to let DB2 collect statistics on these tables again. And when statistics are updated, or when new indexes are created it may also be a good idea to rebind the packages associated with SQL procedures that use the tables, to let DB2 create plans that exploit the latest stats and indexes.
Table statistics can be updated using the RUNSTATS command. To rebind the package associated with an SQL procedure, you can use the REBIND_ROUTINE_PACKAGE built-in procedure (available in DB2 Version 8). For example, the following command can be used to rebind the package for procedure MYSCHEMA.MYPROC:
CALL SYSPROC.REBIND_ROUTINE_PACKAGE('P', 'MYSCHEMA.MYPROC', 'ANY')
|
where 'P' indicates that the package corresponds to a procedure and 'ANY' indicates that any of the functions and types in the SQL path are considered for function and type resolution (see the Command Reference entry for the REBIND command for more details.)
Conclusion
In this article, I have provided an assortment of hints and tips that may help to improve the performance of SQL procedures (check out Yip et al.'s book for a good introduction to SQL PL). As a general rule, first make sure the basics of system performance (hardware and OS) and database manager (buffer pools, containers and table spaces, and so on) are covered. The DB2 configuration advisor can be very helpful with the latter. Then make sure that the plans for the key queries in your application are appropriate. Finally, look into improving your stored procedures and applications with the advice given in this article. Good luck with your tuning efforts!
Acknowledgments
Thanks to Lee Johnson, Paul Yip, Drew Bradstock and Clara Liu for their comments on a draft of this article.
Resources
- [1] An, Yongli and Shum, Peter, DB2 Tuning Tips for OLTP Applications, DB2 Developer Domain, July 2001.
- [2] Yip, Paul et al, DB2 SQL Procedural Language for Linux, UNIX and Windows, Prentice Hall, 2003. See http://www-106.ibm.com/developerworks/db2/library/books/sqlplbook/index.html for more information.
About the author  | |  | Gustavo Arocena is a technical manager in the DB2 SQL Compiler area. He joined the IBM Toronto Lab in 1998 and is currently responsible for SQL Procedures and SQL Parser development. Gustavo holds a Master's degree in Computer Science from the University of Toronto in the area of database query languages. |
Rate this page
|  |