Difference between revisions of "Data Science"

From TedYunWiki
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
* Variety: number of columns/dimensions/sources
 
* Variety: number of columns/dimensions/sources
 
* Velocity: number of rows/bytes ''per unit time''
 
* Velocity: number of rows/bytes ''per unit time''
 +
(Veracity: Can we trust this data?)
 +
 +
=== Data Model ===
 +
Three components:
 +
* Structures
 +
* Constraints
 +
* Operations
 +
 +
What is a database? '''A collection of information organized to afford efficient retrieval.'''
 +
 +
Why do we need a database?
 +
* Sharing
 +
* Data model enforcement
 +
* Scale
 +
* Flexibility
 +
 +
=== Relational Algebra ===
 +
http://en.wikipedia.org/wiki/Relational_algebra
 +
 +
Operations
 +
* Union $\cup$, intersection $\cap$, difference $-$
 +
* Selection $\sigma$
 +
* Projection $\Pi$
 +
* Join $\bowtie$
 +
* (Extended RA) Duplicate elimination $d$
 +
* (Extended RA) Grouping and aggregation $g$
 +
* (Extended RA) Sorting $t$
 +
 +
==== Join ====
 +
* Equi-join $\bowtie_{A=B}$
 +
* $\theta$-join $\bowtie_\theta$

Latest revision as of 22:50, 2 September 2013

The Three V's of Big Data

  • Volume: number of rows/objects/bytes
  • Variety: number of columns/dimensions/sources
  • Velocity: number of rows/bytes per unit time

(Veracity: Can we trust this data?)

Data Model

Three components:

  • Structures
  • Constraints
  • Operations

What is a database? A collection of information organized to afford efficient retrieval.

Why do we need a database?

  • Sharing
  • Data model enforcement
  • Scale
  • Flexibility

Relational Algebra

http://en.wikipedia.org/wiki/Relational_algebra

Operations

  • Union $\cup$, intersection $\cap$, difference $-$
  • Selection $\sigma$
  • Projection $\Pi$
  • Join $\bowtie$
  • (Extended RA) Duplicate elimination $d$
  • (Extended RA) Grouping and aggregation $g$
  • (Extended RA) Sorting $t$

Join

  • Equi-join $\bowtie_{A=B}$
  • $\theta$-join $\bowtie_\theta$