Distributed Databases

last updated 11/18/14

Distributed Database Systems

Advantages of Distribution --by design

Compared to a single, centralized system that provides remote access, distributed system advantages are

Disadvantages (more likely scenario)

 


Architecture Design Considerations

Factors the designer considers in choosing an architecture for a distributed system

Architecture Alternatives

 

 

 

 


Types of Distributed Systems

Homogeneous
Heterogeneous
  • All nodes use the same hardware and software
  • Nodes have different hardware or software
  • Require translations of codes and word lengths due to hardware differences
  • Translation of data models and data structures due to software differences
  • XML can help here

 


Software Components of DDBMS

Data communications component (DC)

Local database management system (DBMS)

Global data dictionary (GDD)

Distributed database management system component (DDBMS)

Not all sites necessarily have all these components

 

 

 

 


DDBMS Functions

Provide the user interface--needed for location transparency

Locate the data--directs queries to proper site(s)

Process queries--local, remote, compound (global)

Provide network-wide concurrency control and recovery procedures

Provide data translation in heterogeneous systems

 

 


Data Placement Alternatives

Centralized-- all data at one site only

Replicated-- all data duplicated at all sites; read one, write all

Partitioned

Hybrid-- Combination of the others (some centralized, some partitioned with replication)

Factors in data placement decisions

 


Types of Transparency

Data distribution transparency -- user is not aware of how the data is distributed

DBMS heterogeneity transparency -- user is not aware of the different databases and architectures

Transaction transparency

Performance transparency

 

 


ACIDity of Distributed Transaction

Although a distributed transaction is consistent, maintaining isolation in a multi-database is an important issue

Even if local sites are serializable, subtransactions of two distributed transactions might be serialized in different orders at different sites

Although a distributed transaction is consistent, maintaining atomicity in a multidatabase is an important issue

Guaranteeing that subtransactions of a distributed transaction either all commit or all abort in spite of failures (e.g., message loss, site crash) requires the use of a two-phase commit protocol

 

 

 


Global Serializability

Theorem: If all sites use a two-phase locking protocol and a two-phase commit protocol is used, transactions are globally serializable

Transactions are serialized in the same order at every site – the order in which the transactions committed

Global deadlock can be another result of implementing two-phase locking and two-phase commit protocols

System uses deadlock detection algorithms or timeout to deal with this.

 

 


Replication

Information is often replicated in a distributed system

Major implementation problem: how do you keep the replicas synchronized when a replicated data item is updated?

Implementation of Replication

DBMSs provide replica control modules to make replication invisible to the application

Typical implementation: read one/write all

As compared with non-replicated systems, the performance of read is better, but is worse for write operations

Synchronous update systems: all replicas updated as part of transaction. Supports serializability, but performance bad, deadlocks frequent, and cannot handle disconnected sites

Asynchronous update systems: one replica updated as part of transaction. Others updated after transaction commits. Performance better, deadlocks less frequent, and disconnected sites can be supported, but serializability is sacrificed.

Practical systems are generally asynchronous

 


Transaction Management for DDBMS

Each site that initiates transactions has a transaction coordinator to manage transactions that originate there

Additional concurrency control problem for distributed databases: multiple-copy inconsistency problem

 


Locking Protocols

Extension of two-phase locking protocol for transaction processing

Single-site lock manager

Distributed lock manager

Primary copy

Majority locking

 


Global Deadlock Detection

Each site has local wait-for graph-detects only local deadlock

Need global wait-for graph

 


Timestamping Protocols

One site could issue all timestamps

Instead, multiple sites could issue them

 


Recovery from failures

Must guarantee atomicity and durability of transactions

Failures include usual types, plus

Network partitioning

Handling Node Failure

 

 


Commit Protocols

Two-phase commit protocol

Implemented as an exchange of messages between the coordinator and the cohorts

Phase 1:
Prepare message
(coordinator to cohort)
  • If cohort wants to abort, it aborts
  • If cohort wants to commit, it moves all update log records to non-volatile store and forces a prepared record to its log
  • Cohort sends a (ready or aborting) vote message to coordinator
Phase 1:
Vote message
(cohort to coordinator): Cohort indicates ready to commit or aborting.
  • If any are aborting, coordinator decides abort
  • If all are ready, coordinator decides commit and forces commit record to its log
  • Coordinator sends commit/abort message to all cohorts that voted ready
Phase 2:
Commit/abort message
(coordinator to cohort):
  • Cohort commits locally by forcing a commit record to its log. Or, if abort message, it aborts
  • Cohort sends done message to coordinator
Phase 2:
Done message
(cohort to coordinator):
  • When coordinator receives done message from all cohorts, it writes a complete record to its log

 

 


Distributed Query Processing

Queries can be

Must consider cost of transferring data between sites

Semijoin operation is sometimes used when a join of data stored at different sites is required

Steps in Distributed Query Processing

  1. Accept userís request
  2. Check requestís validity
  3. Check userís authorization
  4. Map external to logical level
  5. Determine request processing strategy
  6. For heterogeneous system, translate query to DML of data node(s)
  7. Encrypt request
  8. Determine routing (job of Data Communications component)
  9. Transmit messages (job of DC component)
  10. Each node decrypts its request
  11. Do update synchronization
  12. Each data nodeís local DBMS does its processing
  13. Nodes translate data, if heterogeneous system
  14. Nodes send results to requesting node or other destination node
  15. Consolidate results, edit, format results
  16. Return results to user

 

 


Correctness

Application designers must be aware of the fact that real-world systems do not always support ACID executions even if all transactions are consistent