Data Modeling


DS220 Data Modeling

  • Analyze requirements
  • Identify entities and relationships
  • Identify queries
  • Specify the schema
  • Optimize, optimize, optimize
  • CQL

Case Study

KillrVideo, Inc.
Problems

  • Scalability
  • Reliability
  • Ease of use

Solutions Attempted
Relational Database Problem

  • Single points of failure
  • Scaling complexity
  • Reliability issues
  • Difficult to serve users worldwide

WHy Cassandra

  • Peers instead of master/slave
  • Linear scale performance
  • Always on reliability
  • Data can be stored geographically close to clients

Keyspaces

Top-level namespace/container
Replication parameters required

CREATE KEYSPACE killrvideo
   WITH replication = {'class': 'SimpleStrategy',
   'replication_factor': 1
   };

Use

  • USE switches between keyspaces
    USE killrvideo ;
    

Tables

  • keyspaces contain tables
  • tables contain data

Primary Keyys

Basic Data types

  • text
  • int
  • UUID (Universally Unique Identifier)
    • Generate via uuid()
    • the third part begin with 4
  • TIMEUUID embeds a TIMESTAMP value
    • Sortable
    • Generate viad now()
    • the third part begin with 1
    • TIMESTAMP
      • Milliseconds since January 1 1970 at 00:00:00 MT
      • Displayed in cqlsh as yyyy-mm-dd HH:mm:ssZ

COPY

  • Imports/exports CSV
    COPY table1 (column1, column2, column3) FROM 'table1data.csv';
    
  • Header parameter skips the first line in the file
    COPY table1 (column1, column2, column3) FROM 'table1data.csv'
    WITH HEADER=true;
    

SELECT

SELECT *
FROM table1;

SELECT column1, column2, column3
FROM table1;

SELECT COUNT(*)
FROM table1;

SELECT *
FROM table1
LIMIT 10;

Exercise 2

Create Keyspaces

 CREATE KEYSPACE killrvideo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

Create Table

CREATE TABLE videos (
  video_id timeuuid ,
  added_date timestamp ,
  description text ,
  title text ,
  user_id uuid ,
  primary KEY (video_id)
);

Describe Table

DESCRIBE TABLE videos ;

Import CSV file

COPY videos FROM 'videos.csv' WITH HEADER= true ;

Select

SELECT * FROM videos LIMIT 10;

SELECT count(*) FROM videos ;

Exercise 3

USE killrvideo ;

 CREATE TABLE videos_by_title_year (
  title text ,
  added_year int ,
  added_date timestamp ,
  description text ,
  user_id uuid ,
  video_id uuid ,
  primary KEY ((title, added_year))
) ;

COPY videos_by_title_year FROM 'videos_by_title_year.csv' WITH HEADER= true ;

SELECT * FROM videos_by_title_year WHERE title= 'Sleepy Grumpy Cat' and added_year = 2015;

if query the table with only title of added_year, the system will return an error

SELECT * FROM videos_by_title_year WHERE title= 'Sleepy Grumpy Cat' ;

The reason is that, cassandra use partition key to calculate the hash code, and use hash code to find the place of value.

keyspace --> partition --> clustering columns
Partition keys determine a grouping criteria whereas clustering columns determine ordering criteria

Exercise 4

USE killrvideo ;

CREATE TABLE bad_videos_by_tag_year_table (
  tag text,
  added_year int,
  added_date timestamp,
  title text,
  description text,
  user_id uuid,
  video_id timeuuid,
  PRIMARY KEY ((video_id))
);

DESCRIBE TABLE bad_videos_by_tag_year_table ;

COPY bad_videos_by_tag_year_table (tag, added_year, video_id, added_date, description, title, user_id) FROM 'videos_by_tag_year.csv' WITH HEADER=true;

SELECT count(*) FROM bad_videos_by_tag_year_table ;

DROP TABLE bad_videos_by_tag_year_table ;

with this example, because some video_id are same, thus the following will be treated as upsert for the table. Some records are overlaped.

Recreate a table

CREATE TABLE videos_by_tag_year (
  tag text,
  added_year int,
  video_id timeuuid,
  added_date timestamp,
  description text,
  title text,
  user_id uuid,
  PRIMARY KEY ((tag), added_year, video_id)
) WITH CLUSTERING ORDER BY (added_year DESC);

DESCRIBE TABLE videos_by_tag_year ;

COPY videos_by_tag_year FROM 'videos_by_tag_year.csv' WITH HEADER=true;

SELECT COUNT (*) FROM videos_by_tag_year ;

SELECT * FROM videos_by_tag_year WHERE tag = 'trailer' and added_year = 2015 ;

SELECT * FROM videos_by_tag_year WHERE tag = 'cql' and added_year = 2013 ;

SELECT * FROM videos_by_tag_year WHERE tag = 'cql' and added_year < 2015 ;