CS360 Final
Chapter 6. ER-Diagram?
ER Diagram -(Reduction)-> Relation Schemas
Entity Set
- Strong entity set
(entity set).pk => (schema).pk
- composite attribute: flatten (separtate attribute fore each component)
- multivalued attribute: new schema $R$ w/
(ES).pk + (Multivalued attribute)
- e.g.
INST_PHONE = (**ID**, **phone_number**)
- e.g.
student(**ID**, name, tot_cred)
- Weak entity set
:
(entity set).pk/discriminator => (schema).pk
section(*course_id*, *sec_id*, *semester*, *year*)
ββββββββββββββββββ ββββββββββββββββββ
β COURSE β β SECTION β
β β ββββββββββββββββ β β
β **course_id** βββββ€ sec_course βββββ *sec_id* β
β title β ββββββββββββββββ β *semester* β
β credits β β *year* β
ββββββββββββββββββ ββββββββββββββββββ
Representing Relationship Set
pK: union of pks
participating in relationship set- many-many: create new schema
ADVISOR = (s_id, i_id)
ββββββββββββββββββ ββββββββββββββββββ
β INSTRUCTOR β β STUDENT β
β β ββββββββββββββββ β β
β **ID** βββββ€ advisor βββββ **ID** β
β name β ββββββββββββββββ β name β
β salary β β tot_cred β
ββββββββββββββββββ ββββββββββββββββββ
- many-one: add (
pk
of "one" side) attribute to "many" side -> foreign key constraint - one-one: either side can be chosed
- partial participation: may contain null values
Redundancy of Schemas
TBD
How many tables are required? : Many-Many relationship, multivalued attribute require new table
Extended E-R Feature
Specialization: Top-down / Generalization: Bottom-up
ββββββββββββββββββ
β PERSON β
β β
β **ID** β
βββββββΆβ name ββββββββ
β β street β β
β β city β β
β ββββββββββββββββββ β
ββββββββββββββββββ ββββββββββββββββββ
β EMPLOYEE β β STUDENT β
ββββββΆβ βββββββ β β
β β salary β β β tot_credits β
β ββββββββββββββββββ β ββββββββββββββββββ
ββββββββββββββββββ ββββββββββββββββββ
β INSTRUCTOR β β SECRETARY β
β β β β
β rank β β hours_per_week β
ββββββββββββββββββ ββββββββββββββββββ
- Overlapping: EMPLOYEE / STUDENT
- Disjoint: INSTRUCTOR / SECRETARY
Attribute Inheritance: lower-level ES inherits higher-level: LOWER IS A HIGHER
- Specialization to Schema
- method 1: requires access to two tables
- schema for HIGH entity
- schema for LOW entity: HIGH.pk + local attribute
person: ID, name, street, city stduent: ID, tot_cred employee: ID, salary
- method 2: redundant infos
- each ES contains local and inherited attributes
person: ID, name, street, city stduent: ID, name, street, city, tot_cred employee: ID, name, street, city, salary
- method 1: requires access to two tables
Completeness Constraint (Total/Partial generalization)
- Total generalization: Each HIGH must belong to some LOW
- Partial is default
Aggregation
- How to express relationships among relationships?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β βββββββββββββββ β
β β project ββββ β
β βββββββββββββββ β β
β β β β
β βΌ β β
β βββββββββββββββ βββββββββββββββ β ββββββββββββββββ
β β instructor ββββββΆβ proj_guide ββββΌββββ student ββ
β βββββββββββββββ βββββββββββββββ β ββββββββββββββββ
β β β β β
ββββββββββΌββββββββββββββββββββββββββββββΌβββββββββββΌββββββββ
β βββββββββββββββ β β
βββββββββββββΆβ eval_for ββββ΄βββββββββββ
βββββββββββββββ
β²
β
βββββββββββββββ
β evaluation β
βββββββββββββββ
> Relationship proj_guide is redundatant! => aggregate (proj_guide)
pk
of aggregated relationship +pk
or associated ES + any descriptive attributes
Big Data
Motivation
very large volumes of data being colleteced
- Volume: larger amount of data stored
- Velocity: higher rate of insertion
- Variety: many types of data,
Big data query: high scalability
Big Data storage system
Distributed file system
> large collection of machiines, but gives single view to the clients
- e.g. Hadoop File System (HDFS)
-
NameNode
- filename -> list of block identifiers
-
DataNode
- block identifier -> physical location
-
Data coherency
- write once read many
-
βββββββββββββββββββββ
β NN β
β fn -> block β
βββββββββββββββββββββ
β
β
βΌ
βββββββββββββββββββββ
β DN β
β block -> addr β
βββββββββββββββββββββ
'##### Key-value storage
- Partitioned records across multiple machines
- queries routed by system to appropriate machine, replica consistency
- replicated data among machines (better availability)a
- What they store?
- bytes (with associated keys)
- wide-table (attribute name + associated key)
- json (Document stores)
- supporting operations
- put
- get
- delete
- optional operations
- range query
- query on non-key attributes
- kvs's are note full-fledge DBs! (NoSQl)
parallel and distributed DBs
- run on multiple machines (cluster): store and process query on large machines
Replication and consistency
- Availability: system can run even if parts have failed
- Consistency: system operations sees the latest version
MapReduce
Abstracting the issues of distributed and parallel environment from programming (functional programing?)
- Input: "One a penny, two a penny, host cross buns"
- Processing(split): ("One a penny", "two a penny," "host cross buns")
- map
- reduce
map(k1, v1) -> list(k2, v2)
e.g. (void, textline: string) -> (first: string, count: int)
reduce(k2, list(v2)) -> list(k3, v3)
e.g. (first: string, counts: []int) -> (first: string, total: int)
// Map function
public static
class Map extends Mapper<LongWritable, Text, Text, IntWritable>
// Map function