Query processing overview
Steps in executing SQL query-DBMS:
- Check query syntax
- Validate query-checks data dictionary; verifies objects referred to are
database objects and requested operations are valid
- Translate query into relational algebra operations
- Rearrange relational algebra operations into most efficient form
- Use data on table sizes, indexes, order of tuples, distribution of
values, to determine how the query will be processed:
- Estimate the "cost"
of alternatives and choose the plan with the least estimated cost
the number of disk accesses, amount of memory, processing time, and communication
costs, if any
- Execution plan is then coded and executed
Relational Algebra Translation
SQL Select...From...Where command usually translates into combination
of relational algebra PROJECT...., JOIN, SELECT, respectively.
Quick review of Relational Algebra operations
SELECT-unary operator: σp(table-name)
- p is a predicate, called the θ (theta) condition
- Returns entire rows that satisfy θ
PROJECT-unary operator: Πproj-list(table-name)
- Proj-list is a list of columns
- Returns unique combinations of values for those columns
JOIN-binary operator: table1
- Compares table1 and table2, which have a common column (or columns with
- Chooses rows from each that match on common column
- Combines those rows, but shows common column only once
SELECT schedule, room
FROM Student NATURAL JOIN Enroll NATURAL JOIN Class
- Graphical representation of the operations and operands in the relational algebra
- Leaf nodes are relations
- Unary or binary operations are internal nodes (one child or two, respectively)
- An internal node can be executed when its operands are available (children relation has been computed)
- Node is replaced by the result of the operation it represents
- Root node is executed last, and is replaced by the result of the entire
Doing SELECT early
- Same SQL statement can be translated to different relational algebra statements
- Performing SELECT early reduces size of intermediate nodes
- Push SELECT as far down the tree as possible
- For conjunctive SELECT, do each part on its own tree instead of waiting
So the optimized access path is:
We can show the improvement by calculating the number of blocks accessed (read or written) to complete the query. See bottom of this page.
Some Properties of Natural JOIN
( Student Enroll ) Class
is functionally the same as (but maybe faster than)
Student ( Enroll
Commutativity, ignoring column order
Enroll Class is same as Class Enroll but not likely faster or slower
Many similar rules exist
Relational Algebra Equivalences
1. All joins and products are commutative.
- R x S = S x R
- R θ S = S θ R
- R S = S R
2. Joins and products are associative
- (R x S) x T = R x (S x T)
- (R θ S) θ T = R θ (S θ T)
- (R S) T = R (S T)
3. Select is commutative
- σp ( σq (R)) = σq (σp
4. Conjunctive selects can cascade into individual selects or vice versa
- σp &q&…&z (R) = (σp (σq...(
σ z (R))...))
5. Successive projects can reduced to the final project.
- If list1, list2, … listn are lists of attribute names and each of the listi
then Πlist1 (Πlist2 (...Πlistn (R)...)
- So only the last project has to be executed
6. Select and project sometimes commute
- If p involves only the attributes in projlist, then select and project
- Πprojlist ( σp (R)) = σp ( Πprojlist (R))
7. Select and join (or product) sometimes commute
- If p involves only attributes of one of the tables being joined, then select
and join commute
σp (R S) = (σp (R)) S
- Only if p refers just to R
8. Select sometimes distributes over join (or product)
- For p AND q, where p involves only the attributes of R and q only the attributes
of S the select distributes over the join
- σp AND q (R S) = ( σp (R)) ( σq (S))
9. Project sometimes distributes over join (or product)
- If projlist can be split into separate lists, list1 and list2, so that
list1 contains only attributes of R and list2 contains only attributes of
- Πprojlist (R S) = ( Πlist1 (R)) ( Πlist2 (S))
10. Union and intersection are commutative
- R ∪ S = S ∪ R
- R ∩ S = S ∩ R
- Set difference is not commutative.
11. Union and intersection are individually associative.
- (R ∪ S) ∪ T = R ∪ (S ∪ T)
- (R ∩ S) ∩ T = R ∩ (S ∩ T)
- Set difference is not associative
12. Select distributes over union, intersection, and difference
- σp (R ∪ S) = σp (R) ∪ σp
- σp (R ∩ S) = σp (R) ∩ σp
- σp (R - S) = σp (R) - σp (S)
13. Project distributes over union, intersection, and difference
- Πprojlist (R ∪ S) = (Πprojlist (R)) ∪ (Πprojlist (S))
- Πprojlist (R ∩ S) = (Πprojlist (R)) ∩ (Πprojlist (S))
- Πprojlist (R - S) = (Πprojlist (R)) - (Πprojlist (S))
14. Project is idempotent-repeating it produces the same result
- Πprojlist (R)( Π`projlist(R)) = ΠprojlistR)
15. Select is idempotent
Heuristics for Optimization
Do operations that generate smaller tables first.
- Do selection as early as possible. Use cascading, commutativity, and distributivity
to move selection as far down the query tree as possible
- Use associativity to rearrange relations so the selection operation that
will produce the smallest table will be executed first
- If a product appears as an argument for a selection, where the selection
involves attributes of the tables in the product, change the product to a
- If the selection involves attributes of only one of the tables in the product,
apply the selection to that table first
- Do projection early. Use cascading, distributivity and commutativity to
move the projection as far down the query tree as possible.
- Examine all projections to see if some are unnecessary
- If a sequence of selections and/or projections have the same argument,
use commutativity or cascading to combine them into one selection, one projection,
or a selection followed by a projection
- If a sub-expression appears more than once in the query tree, and the result
it produces is not too large, compute it once and save it
Cost factors of executing a query
- Cost of reading files
- Processing costs once data is in main memory
- Cost of writing and storing intermediate results
- Communication costs
- Cost of writing final results to storage
Most significant factor is the number of disk/drive accesses, the read and write
System uses statistics stored in the data dictionary and knowledge about the size, structure and access methods of each file
Estimating Access Cost
Access cost-number of blocks brought into main memory for reading or written
to secondary storage as result
Tables may be stored in
- packed form-blocks contain only tuples from one table
- unpacked form, where tuples are interspersed with tuples from other tables
If unpacked, may have to assume every tuple of relation is in a different block
If packed, estimate the number of blocks from tuple size, number of tuples,
and capacity of the block.
Assume packed for homework, tests.
Some useful notation:
- t(R), number of tuples in relation R
- b(R), number of blocks needed to store R
- bf(R), number of tuples of R per block, called the blocking factor of R
- If R is packed, then b(R) = t(R) / bf(R)
If the Student relation has blocksize of 4000 bytes, and there are t(Student)=10,000 student records each 200 bytes long, then bf(Student)=20 records fit per block (4000/200).
We need 10000/20 or b(Student)=500 blocks to hold this file in packed form.
For the query and optimized version, let's assume the following sizes:
b(Student) = 500, t(Student) = 10000
b(Enroll) = 1250, t(Enroll) = 250,000
(average of 25 enrolled classes per student, 40-50 complete a 4 year degree), 250000 * 20 bytes / 4000 = 1250
b(Class) = 125, t(Class) = 1000 , 1000* 500 bytes / 4000 = 125
A join requires roughly the product of the block counts.
Student Enroll = 500*1250 blocks = 625,000 block accesses to do this join.
The result set, however, would be 10,000 students * 25 classes * 220 bytes /4000= 13,750 blocks, possibly written somewhere.
(Student Enroll) Class = 13,750 * 125 blocks = 1,718,750 block accesses.
Result set should be same number of tuples 250,000 * 720 bytes / 4000 = 45,000 blocks
Now read the 45,000 blocks to select the Math majors (~ 400?) --> result set is 400*720 / 4000 = 72 blocks. Re-read for projection.
Total reading of 625,000 + 1,718,750 + 45,000 = 2,388,750 blocks
Student select Math majors, read 500 blocks → 400*200 bytes/ 4000 or 20 blocks written.
Join (MathStudents Enroll) = 20 * 1250 = 25000 blocks accessed→result is 400 * 25 *220 /4000 = 550 blocks written.
Join with Class = 550 * 125 = 68,750 block aecesses→result set is 400 * 720 / 4000 = 72 blocks, then the projection
500 + 25000 + 68,750 = 94,250 blocks read prior to projection
Prior to joins, use select and maybe projects to reduce the amount of blocks that need to participate in the join.