Steps in executing SQL query-DBMS:

- Check query syntax
- Validate query-checks data dictionary; verifies objects referred to are database objects and requested operations are valid
- Translate query into relational algebra operations
- Rearrange relational algebra operations into most efficient form
- Use data on table sizes, indexes, order of tuples, distribution of
values, to determine how the query will be processed:
- Estimate the "cost" of alternatives and choose the plan with the least estimated cost
- Consider the number of disk accesses, amount of memory, processing time, and communication costs, if any

- Execution plan is then coded and executed

** SQL Select...From...Where** command usually translates into combination
of relational algebra PROJECT...., JOIN, SELECT, respectively.

Quick review of Relational Algebra operations

SELECT-unary operator: σ_{p}(table-name)

- p is a predicate, called the θ (theta) condition
- Returns entire rows that satisfy θ

PROJECT-unary operator: Π_{proj-list}(table-name)

- Proj-list is a list of columns
- Returns unique combinations of values for those columns

JOIN-binary operator: table1 table2

- Compares table1 and table2, which have a common column (or columns with same domain)
- Chooses rows from each that match on common column
- Combines those rows, but shows common column only once

Example:

SELECT schedule, room FROM Student NATURAL JOIN Enroll NATURAL JOIN Class WHERE Major='Math'

- Graphical representation of the operations and operands in the relational algebra expression
- Leaf nodes are relations
- Unary or binary operations are internal nodes (one child or two, respectively)
- An internal node can be executed when its operands are available (children relation has been computed)
- Node is replaced by the result of the operation it represents
- Root node is executed last, and is replaced by the result of the entire tree

- Same SQL statement can be translated to different relational algebra statements
- Performing SELECT early reduces size of intermediate nodes
- Push SELECT as far down the tree as possible
- For conjunctive SELECT, do each part on its own tree instead of waiting for join

So the optimized access path is:

We can show the improvement by calculating the number of blocks accessed (read or written) to complete the query. See bottom of this page.

**( Student Enroll ) Class **

is functionally the same as (but maybe faster than)

**Student ( Enroll
Class )**

* Enroll Class* is same as

Many similar rules exist

1. All joins and products are commutative.

- R x S = S x R
- R
_{θ}S = S_{θ}R - R S = S R

2. Joins and products are associative

- (R x S) x T = R x (S x T)
- (R
_{θ}S)_{θ}T = R_{θ}(S_{θ}T) - (R S) T = R (S T)

3. Select is commutative

- σ
_{p}( σ_{q}(R)) = σ_{q}(σ_{p}(R))

4. Conjunctive selects can cascade into individual selects or vice versa

- σ
_{p &q&…&z}(R) = (σ_{p}(σ_{q}...( σ_{ z}(R))...))

5. Successive projects can reduced to the final project.

- If list
_{1}, list_{2}, … list_{n}are lists of attribute names and each of the list_{i}contains list_{i-1},

then Π_{list1}(Π_{list2}(...Π_{listn}(R)...) = Π_{list1}(R) - So only the last project has to be executed

6. Select and project sometimes commute

- If p involves only the attributes in projlist, then select and project commute
- Π
_{projlist}( σ_{p}(R)) = σ_{p}( Π_{projlist}(R))

7. Select and join (or product) sometimes commute

- If p involves only attributes of one of the tables being joined, then select
and join commute

σ_{p}(R S) = (σ_{p}(R)) S - Only if p refers just to R

8. Select sometimes distributes over join (or product)

- For p AND q, where p involves only the attributes of R and q only the attributes of S the select distributes over the join
- σ
_{p AND q}(R S) = ( σ_{p}(R)) ( σ_{q}(S))

9. Project sometimes distributes over join (or product)

- If projlist can be split into separate lists, list1 and list2, so that list1 contains only attributes of R and list2 contains only attributes of S, then
- Π
_{projlist}(R S) = ( Π_{list1}(R)) ( Π_{list2}(S))

10. Union and intersection are commutative

- R ∪ S = S ∪ R
- R ∩ S = S ∩ R
- Set difference is not commutative.

11. Union and intersection are individually associative.

- (R ∪ S) ∪ T = R ∪ (S ∪ T)
- (R ∩ S) ∩ T = R ∩ (S ∩ T)
- Set difference is not associative

12. Select distributes over union, intersection, and difference

- σ
_{p}(R ∪ S) = σ_{p}(R) ∪ σ_{p}(S) - σ
_{p}(R ∩ S) = σ_{p}(R) ∩ σ_{p}(S) - σ
_{p}(R - S) = σ_{p}(R) - σ_{p}(S)

13. Project distributes over union, intersection, and difference

- Π
_{projlist}(R ∪ S) = (Π_{projlist}(R)) ∪ (Π_{projlist}(S)) - Π
_{projlist}(R ∩ S) = (Π_{projlist}(R)) ∩ (Π_{projlist}(S)) - Π
_{projlist}(R - S) = (Π_{projlist}(R)) - (Π_{projlist}(S))

14. Project is idempotent-repeating it produces the same result

- Π
_{projlist}(R)( Π`_{projlist}(R)) = Π_{projlist}R)

15. Select is idempotent

- σ
_{p}(σ_{p}(R)) = σ_{p}(R)

Do operations that generate smaller tables first.

- Do selection as early as possible. Use cascading, commutativity, and distributivity to move selection as far down the query tree as possible
- Use associativity to rearrange relations so the selection operation that will produce the smallest table will be executed first
- If a product appears as an argument for a selection, where the selection involves attributes of the tables in the product, change the product to a join
- If the selection involves attributes of only one of the tables in the product, apply the selection to that table first
- Do projection early. Use cascading, distributivity and commutativity to move the projection as far down the query tree as possible.
- Examine all projections to see if some are unnecessary
- If a sequence of selections and/or projections have the same argument, use commutativity or cascading to combine them into one selection, one projection, or a selection followed by a projection
- If a sub-expression appears more than once in the query tree, and the result it produces is not too large, compute it once and save it

Cost factors of executing a query

- Cost of reading files
- Processing costs once data is in main memory
- Cost of writing and storing intermediate results
- Communication costs
- Cost of writing final results to storage

Most significant factor is **the number of disk/drive accesses**, the read and write
costs

System uses statistics stored in the data dictionary and knowledge about the** size, structure and access methods **of each file

Access cost-number of blocks brought into main memory for reading or written to secondary storage as result

Tables may be stored in

**packed form**-blocks contain only tuples from one table-
**unpacked form**, where tuples are interspersed with tuples from other tables

If unpacked, may have to assume every tuple of relation is in a different block

If packed, estimate the number of blocks from tuple size, number of tuples, and capacity of the block.

**Assume packed for homework, tests.**

Some useful notation:

- t(R), number of tuples in relation R
- b(R), number of blocks needed to store R
- bf(R), number of tuples of R per block, called the
*blocking factor*of R - If R is packed, then b(R) = t(R) / bf(R)

If the Student relation has

blocksize of 4000bytes, and there aret(Student)=10,000student records each200 byteslong, thenbf(Student)=20records fit per block (4000/200). We need 10000/20 orb(Student)=500blocks to hold this file in packed form.For the query and optimized version, let's assume the following sizes:

b(Student) = 500, t(Student) = 10000(average of 25 enrolled classes per student, 40-50 complete a 4 year degree), 250000 * 20 bytes / 4000 = 1250

b(Enroll) = 1250, t(Enroll) = 250,000

b(Class) = 125, t(Class) = 1000, 1000* 500 bytes / 4000 = 125

A join requires roughly the product of the block counts.

**Student Enroll** = 500*1250 blocks = 625,000 block accesses to do this join.

The result set, however, would be 10,000 students * 25 classes * 220 bytes /4000= **13,750 blocks**, possibly written somewhere.

**(Student Enroll) Class **= 13,750 * 125 blocks = 1,718,750 block accesses.

Result set should be same number of tuples 250,000 * 720 bytes / 4000 = 45,000 blocks

Now read the 45,000 blocks to select the Math majors (~ 400?) --> result set is 400*720 / 4000 = 72 blocks. Re-read for projection.

Total reading of 625,000 + 1,718,750 + 45,000 =

2,388,750 blocks

Student select Math majors, read 500 blocks → 400*200 bytes/ 4000 or 20 blocks written.

Join (MathStudents Enroll) = 20 * 1250 = 25000 blocks accessed→result is 400 * 25 *220 /4000 = 550 blocks written.

Join with Class = 550 * 125 = 68,750 block aecesses→result set is 400 * 720 / 4000 = 72 blocks, then the projection

500 + 25000 + 68,750 =** 94,250 blocks read **prior to projection

Prior to joins, use select and maybe projects to reduce the amount of blocks that need to participate in the join.