# Query Optimization

last updated 5-nov-20

## Query processing overview

Steps in executing SQL query-DBMS:

• Check query syntax
• Validate query-checks data dictionary; verifies objects referred to are database objects and requested operations are valid
• Translate query into relational algebra operations
• Rearrange relational algebra operations into most efficient form
• Use data on table sizes, indexes, order of tuples, distribution of values, to determine how the query will be processed:
• Estimate the "cost" of alternatives and choose the plan with the least estimated cost
• Consider the number of disk accesses, amount of memory, processing time, and communication costs, if any
• Execution plan is then coded and executed

## Relational Algebra Translation

SQL Select...From...Where command usually translates into combination of relational algebra PROJECT...., JOIN, SELECT, respectively.

Quick review of Relational Algebra operations

SELECT-unary operator: σp(table-name)

• p is a predicate, called the θ (theta) condition
• Returns entire rows that satisfy θ

PROJECT-unary operator: Πproj-list(table-name)

• Proj-list is a list of columns
• Returns unique combinations of values for those columns

JOIN-binary operator: table1 table2

• Compares table1 and table2, which have a common column (or columns with same domain)
• Chooses rows from each that match on common column
• Combines those rows, but shows common column only once

## Query Tree

Example:

```SELECT schedule, room
FROM Student NATURAL JOIN Enroll NATURAL JOIN Class
WHERE Major='Math'```
• Graphical representation of the operations and operands in the relational algebra expression
• Leaf nodes are relations
• Unary or binary operations are internal nodes (one child or two, respectively)
• An internal node can be executed when its operands are available (children relation has been computed)
• Node is replaced by the result of the operation it represents
• Root node is executed last, and is replaced by the result of the entire tree

### Doing SELECT early

• Same SQL statement can be translated to different relational algebra statements
• Performing SELECT early reduces size of intermediate nodes
• Push SELECT as far down the tree as possible
• For conjunctive SELECT, do each part on its own tree instead of waiting for join

So the optimized access path is:

We can show the improvement by calculating the number of blocks accessed (read or written) to complete the query. See bottom of this page.

## Some Properties of Natural JOIN

### Associativity

( Student Enroll ) Class
is functionally the same as (but maybe faster than)
Student ( Enroll Class )

### Commutativity, ignoring column order

Enroll Class is same as Class Enroll but not likely faster or slower

Many similar rules exist

## Relational Algebra Equivalences

1. All joins and products are commutative.

• R x S = S x R
• R θ S = S θ R
• R S = S R

2. Joins and products are associative

• (R x S) x T = R x (S x T)
• (R θ S) θ T = R θ (S θ T)
• (R S) T = R (S T)

3. Select is commutative

• σp ( σq (R)) = σqp (R))

4. Conjunctive selects can cascade into individual selects or vice versa

• σp &q&…&z (R) = (σpq...( σ z (R))...))

5. Successive projects can reduced to the final project.

• If list1, list2, … listn are lists of attribute names and each of the listi contains listi-1,
then Πlist1list2 (...Πlistn (R)...) = Πlist1(R)
• So only the last project has to be executed

6. Select and project sometimes commute

• If p involves only the attributes in projlist, then select and project commute
• Πprojlist ( σp (R)) = σp ( Πprojlist (R))

7. Select and join (or product) sometimes commute

• If p involves only attributes of one of the tables being joined, then select and join commute
σp (R S) = (σp (R)) S
• Only if p refers just to R

8. Select sometimes distributes over join (or product)

• For p AND q, where p involves only the attributes of R and q only the attributes of S the select distributes over the join
• σp AND q (R S) = ( σp (R)) ( σq (S))

9. Project sometimes distributes over join (or product)

• If projlist can be split into separate lists, list1 and list2, so that list1 contains only attributes of R and list2 contains only attributes of S, then
• Πprojlist (R S) = ( Πlist1 (R)) ( Πlist2 (S))

10. Union and intersection are commutative

• R ∪ S = S ∪ R
• R ∩ S = S ∩ R
• Set difference is not commutative.

11. Union and intersection are individually associative.

• (R ∪ S) ∪ T = R ∪ (S ∪ T)
• (R ∩ S) ∩ T = R ∩ (S ∩ T)
• Set difference is not associative

12. Select distributes over union, intersection, and difference

• σp (R ∪ S) = σp (R) ∪ σp (S)
• σp (R ∩ S) = σp (R) ∩ σp (S)
• σp (R - S) = σp (R) - σp (S)

13. Project distributes over union, intersection, and difference

• Πprojlist (R ∪ S) = (Πprojlist (R)) ∪ (Πprojlist (S))
• Πprojlist (R ∩ S) = (Πprojlist (R)) ∩ (Πprojlist (S))
• Πprojlist (R - S) = (Πprojlist (R)) - (Πprojlist (S))

14. Project is idempotent-repeating it produces the same result

• Πprojlist (R)( Π`projlist(R)) = ΠprojlistR)

15. Select is idempotent

• σpp(R)) = σp(R)

## Heuristics for Optimization

Do operations that generate smaller tables first.

• Do selection as early as possible. Use cascading, commutativity, and distributivity to move selection as far down the query tree as possible
• Use associativity to rearrange relations so the selection operation that will produce the smallest table will be executed first
• If a product appears as an argument for a selection, where the selection involves attributes of the tables in the product, change the product to a join
• If the selection involves attributes of only one of the tables in the product, apply the selection to that table first
• Do projection early. Use cascading, distributivity and commutativity to move the projection as far down the query tree as possible.
• Examine all projections to see if some are unnecessary
• If a sequence of selections and/or projections have the same argument, use commutativity or cascading to combine them into one selection, one projection, or a selection followed by a projection
• If a sub-expression appears more than once in the query tree, and the result it produces is not too large, compute it once and save it

## Cost Factors

Cost factors of executing a query

• Processing costs once data is in main memory
• Cost of writing and storing intermediate results
• Communication costs
• Cost of writing final results to storage

Most significant factor is the number of disk/drive accesses, the read and write costs

System uses statistics stored in the data dictionary and knowledge about the size, structure and access methods of each file

## Estimating Access Cost

Access cost-number of blocks brought into main memory for reading or written to secondary storage as result

Tables may be stored in

• packed form-blocks contain only tuples from one table
• unpacked form, where tuples are interspersed with tuples from other tables

If unpacked, may have to assume every tuple of relation is in a different block

If packed, estimate the number of blocks from tuple size, number of tuples, and capacity of the block.

Assume packed for homework, tests.

Some useful notation:

• t(R), number of tuples in relation R
• b(R), number of blocks needed to store R
• bf(R), number of tuples of R per block, called the blocking factor of R
• If R is packed, then b(R) = t(R) / bf(R)

### Example:

If the Student relation has blocksize of 4000 bytes, and there are t(Student)=10,000 student records each 200 bytes long, then bf(Student)=20 records fit per block (4000/200). We need 10000/20 or b(Student)=500 blocks to hold this file in packed form.

For the query and optimized version, let's assume the following sizes:

b(Student) = 500, t(Student) = 10000
b(Enroll) = 1250, t(Enroll) = 250,000
(average of 25 enrolled classes per student, 40-50 complete a 4 year degree), 250000 * 20 bytes / 4000 = 1250
b(Class) = 125, t(Class) = 1000 , 1000* 500 bytes / 4000 = 125

A join requires roughly the product of the block counts.

### Unoptimized examples

Student Enroll = 500*1250 blocks = 625,000 block accesses to do this join.
The result set, however, would be 10,000 students * 25 classes * 220 bytes /4000= 13,750 blocks, possibly written somewhere.

(Student Enroll) Class = 13,750 * 125 blocks = 1,718,750 block accesses.
Result set should be same number of tuples 250,000 * 720 bytes / 4000 = 45,000 blocks

Now read the 45,000 blocks to select the Math majors (~ 400?) --> result set is 400*720 / 4000 = 72 blocks. Re-read for projection.

Total reading of 625,000 + 1,718,750 + 45,000 = 2,388,750 blocks

### Optimized  example

Student select Math majors, read 500 blocks → 400*200 bytes/ 4000 or 20 blocks written.

Join (MathStudents Enroll) = 20 * 1250 = 25000 blocks accessed→result is 400 * 25 *220 /4000 = 550 blocks written.

Join with Class = 550 * 125 = 68,750 block aecesses→result set is 400 * 720 / 4000 = 72 blocks, then the projection

500 + 25000 + 68,750 = 94,250 blocks read prior to projection

### Bottom line

Prior to joins, use select and maybe projects to reduce the amount of blocks that need to participate in the join.