Query Optimization

last updated 5-nov-20

Query processing overview

Steps in executing SQL query-DBMS:

 


Relational Algebra Translation

SQL Select...From...Where command usually translates into combination of relational algebra PROJECT...., JOIN, SELECT, respectively.

Quick review of Relational Algebra operations

SELECT-unary operator: σp(table-name)

PROJECT-unary operator: Πproj-list(table-name)

JOIN-binary operator: table1 join table2

 


Query Tree

Example:

query tree example
SELECT schedule, room
FROM Student NATURAL JOIN Enroll NATURAL JOIN Class       
WHERE Major='Math'

Doing SELECT early

So the optimized access path is:

We can show the improvement by calculating the number of blocks accessed (read or written) to complete the query. See bottom of this page.

 


Some Properties of Natural JOIN

Associativity

( Student |X| Enroll ) |X| Class
is functionally the same as (but maybe faster than)
Student|X| ( Enroll |X| Class )

Commutativity, ignoring column order

Enroll |X| Class is same as Class|X| Enroll but not likely faster or slower

Many similar rules exist

 


Relational Algebra Equivalences

1. All joins and products are commutative.

2. Joins and products are associative

3. Select is commutative

4. Conjunctive selects can cascade into individual selects or vice versa

5. Successive projects can reduced to the final project.

6. Select and project sometimes commute

7. Select and join (or product) sometimes commute

8. Select sometimes distributes over join (or product)

9. Project sometimes distributes over join (or product)

10. Union and intersection are commutative

11. Union and intersection are individually associative.

12. Select distributes over union, intersection, and difference

13. Project distributes over union, intersection, and difference

14. Project is idempotent-repeating it produces the same result

15. Select is idempotent

 


Heuristics for Optimization

Do operations that generate smaller tables first.

 


Cost Factors

Cost factors of executing a query

Most significant factor is the number of disk/drive accesses, the read and write costs

System uses statistics stored in the data dictionary and knowledge about the size, structure and access methods of each file

 

 


Estimating Access Costquery tree example

Access cost-number of blocks brought into main memory for reading or written to secondary storage as result

Tables may be stored in

If unpacked, may have to assume every tuple of relation is in a different block

If packed, estimate the number of blocks from tuple size, number of tuples, and capacity of the block.

Assume packed for homework, tests.

Some useful notation:

Example:

If the Student relation has blocksize of 4000 bytes, and there are t(Student)=10,000 student records each 200 bytes long, then bf(Student)=20 records fit per block (4000/200). We need 10000/20 or b(Student)=500 blocks to hold this file in packed form.

For the query and optimized version, let's assume the following sizes:

b(Student) = 500, t(Student) = 10000
b(Enroll) = 1250, t(Enroll) = 250,000
(average of 25 enrolled classes per student, 40-50 complete a 4 year degree), 250000 * 20 bytes / 4000 = 1250
b(Class) = 125, t(Class) = 1000 , 1000* 500 bytes / 4000 = 125

A join requires roughly the product of the block counts.

Unoptimized examples

Student |X| Enroll = 500*1250 blocks = 625,000 block accesses to do this join.
The result set, however, would be 10,000 students * 25 classes * 220 bytes /4000= 13,750 blocks, possibly written somewhere.

(Student |X| Enroll) |X| Class = 13,750 * 125 blocks = 1,718,750 block accesses.
Result set should be same number of tuples 250,000 * 720 bytes / 4000 = 45,000 blocks

Now read the 45,000 blocks to select the Math majors (~ 400?) --> result set is 400*720 / 4000 = 72 blocks. Re-read for projection.

Total reading of 625,000 + 1,718,750 + 45,000 = 2,388,750 blocks

Optimized  example

Student select Math majors, read 500 blocks → 400*200 bytes/ 4000 or 20 blocks written.

Join (MathStudents |X| Enroll) = 20 * 1250 = 25000 blocks accessed→result is 400 * 25 *220 /4000 = 550 blocks written.

Join with Class = 550 * 125 = 68,750 block aecesses→result set is 400 * 720 / 4000 = 72 blocks, then the projection

500 + 25000 + 68,750 = 94,250 blocks read prior to projection

Bottom line

Prior to joins, use select and maybe projects to reduce the amount of blocks that need to participate in the join.