Chapter 6. ER-Diagram?

ER Diagram -(Reduction)-> Relation Schemas

Entity Set

Strong entity set
- (entity set).pk => (schema).pk
- composite attribute: flatten (separtate attribute fore each component)
- multivalued attribute: new schema $R$ w/ (ES).pk + (Multivalued attribute)
  - e.g. INST_PHONE = (**ID**, **phone_number**)

student(**ID**, name, tot_cred)

Weak entity set : (entity set).pk/discriminator => (schema).pk

section(*course_id*, *sec_id*, *semester*, *year*)
┌────────────────┐                      ┌────────────────┐
│     COURSE     │                      │    SECTION     │
│                │   ┌──────────────┐   │                │
│ **course_id**  │◀──┤  sec_course  ├───│    *sec_id*    │
│     title      │   └──────────────┘   │   *semester*   │
│    credits     │                      │     *year*     │
└────────────────┘                      └────────────────┘

Representing Relationship Set

pK: union of pks participating in relationship set
many-many: create new schema

ADVISOR = (s_id, i_id)
┌────────────────┐                      ┌────────────────┐
│   INSTRUCTOR   │                      │    STUDENT     │
│                │   ┌──────────────┐   │                │
│     **ID**     │───┤   advisor    ├───│     **ID**     │
│      name      │   └──────────────┘   │      name      │
│     salary     │                      │    tot_cred    │
└────────────────┘                      └────────────────┘

many-one: add (pk of "one" side) attribute to "many" side -> foreign key constraint
one-one: either side can be chosed
partial participation: may contain null values

Redundancy of Schemas

TBD

How many tables are required? : Many-Many relationship, multivalued attribute require new table

Extended E-R Feature

Specialization: Top-down / Generalization: Bottom-up

                               ┌────────────────┐
                               │     PERSON     │
                               │                │
                               │     **ID**     │
                        ┌─────▶│      name      │◀─────┐
                        │      │     street     │      │
                        │      │      city      │      │
                        │      └────────────────┘      │
               ┌────────────────┐             ┌────────────────┐
               │    EMPLOYEE    │             │    STUDENT     │
         ┌────▶│                │◀────┐       │                │
         │     │     salary     │     │       │  tot_credits   │
         │     └────────────────┘     │       └────────────────┘
┌────────────────┐           ┌────────────────┐
│   INSTRUCTOR   │           │   SECRETARY    │
│                │           │                │
│      rank      │           │ hours_per_week │
└────────────────┘           └────────────────┘

Overlapping: EMPLOYEE / STUDENT
Disjoint: INSTRUCTOR / SECRETARY

Attribute Inheritance: lower-level ES inherits higher-level: LOWER IS A HIGHER

Specialization to Schema
- method 1: requires access to two tables
  - schema for HIGH entity
  - schema for LOW entity: HIGH.pk + local attribute
```
person: ID, name, street, city
stduent: ID, tot_cred
employee: ID, salary
```
- method 2: redundant infos
  - each ES contains local and inherited attributes
```
person: ID, name, street, city
stduent: ID, name, street, city, tot_cred
employee: ID, name, street, city, salary
```

Completeness Constraint (Total/Partial generalization)

Total generalization: Each HIGH must belong to some LOW
Partial is default

Aggregation

How to express relationships among relationships?

┌─────────────────────────────────────────────────────────┐
│                     ┌─────────────┐                     │
│                     │   project   │──┐                  │
│                     └─────────────┘  │                  │
│                            │         │                  │
│                            ▼         │                  │
│ ┌─────────────┐     ┌─────────────┐  │   ┌─────────────┐│
│ │ instructor  │────▶│ proj_guide  │◀─┼───│   student   ││
│ └─────────────┘     └─────────────┘  │   └─────────────┘│
│        │                             │          │       │
└────────┼─────────────────────────────┼──────────┼───────┘
         │            ┌─────────────┐  │          │
         └───────────▶│  eval_for   │◀─┴──────────┘
                      └─────────────┘
                             ▲
                             │
                      ┌─────────────┐
                      │ evaluation  │
                      └─────────────┘

> Relationship proj_guide is redundatant! => aggregate (proj_guide)

pk of aggregated relationship + pk or associated ES + any descriptive attributes

Big Data

Motivation

very large volumes of data being colleteced

Volume: larger amount of data stored
Velocity: higher rate of insertion
Variety: many types of data,

Big data query: high scalability

Big Data storage system

Distributed file system

> large collection of machiines, but gives single view to the clients

e.g. Hadoop File System (HDFS)
- NameNode
  - filename -> list of block identifiers
- DataNode
  - block identifier -> physical location
- Data coherency
  - write once read many

┌───────────────────┐
│        NN         │
│    fn -> block    │
└───────────────────┘
          │
          │
          ▼
┌───────────────────┐
│        DN         │
│   block -> addr   │
└───────────────────┘

'##### Key-value storage

Partitioned records across multiple machines
queries routed by system to appropriate machine, replica consistency
replicated data among machines (better availability)a
What they store?
- bytes (with associated keys)
- wide-table (attribute name + associated key)
- json (Document stores)
supporting operations
- put
- get
- delete
optional operations
- range query
- query on non-key attributes
kvs's are note full-fledge DBs! (NoSQl)

parallel and distributed DBs

run on multiple machines (cluster): store and process query on large machines

Replication and consistency

Availability: system can run even if parts have failed
Consistency: system operations sees the latest version

MapReduce

Abstracting the issues of distributed and parallel environment from programming (functional programing?)

Input: "One a penny, two a penny, host cross buns"
Processing(split): ("One a penny", "two a penny," "host cross buns")
map
reduce

map(k1, v1) -> list(k2, v2)
e.g. (void, textline: string) -> (first: string, count: int)

reduce(k2, list(v2)) -> list(k3, v3)
e.g. (first: string, counts: []int) -> (first: string, total: int)

// Map function
public static 
class Map extends Mapper<LongWritable, Text, Text, IntWritable>

// Map function

CS360 Final

Chapter 6. ER-Diagram?

ER Diagram -(Reduction)-> Relation Schemas

Entity Set

Representing Relationship Set

Redundancy of Schemas

Extended E-R Feature

Specialization: Top-down / Generalization: Bottom-up

Completeness Constraint (Total/Partial generalization)

Aggregation

Big Data

Motivation

Big data query: high scalability

Big Data storage system

Distributed file system

parallel and distributed DBs

Replication and consistency

MapReduce

Algebraic Operations