Data Modeling
DS220 Data Modeling
- Analyze requirements
- Identify entities and relationships
- Identify queries
- Specify the schema
- Optimize, optimize, optimize
- CQL
Case Study
KillrVideo, Inc.
Problems
- Scalability
- Reliability
- Ease of use
Solutions Attempted
Relational Database Problem
- Single points of failure
- Scaling complexity
- Reliability issues
- Difficult to serve users worldwide
WHy Cassandra
- Peers instead of master/slave
- Linear scale performance
- Always on reliability
- Data can be stored geographically close to clients
Keyspaces
Top-level namespace/container
Replication parameters required
CREATE KEYSPACE killrvideo
WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 1
};
Use
- USE switches between keyspaces
USE killrvideo ;
Tables
- keyspaces contain tables
- tables contain data
Primary Keyys
Basic Data types
- text
- int
- UUID (Universally Unique Identifier)
- Generate via uuid()
- the third part begin with 4
- TIMEUUID embeds a TIMESTAMP value
- Sortable
- Generate viad now()
- the third part begin with 1
- TIMESTAMP
- Milliseconds since January 1 1970 at 00:00:00 MT
- Displayed in cqlsh as yyyy-mm-dd HH:mm:ssZ
COPY
- Imports/exports CSV
COPY table1 (column1, column2, column3) FROM 'table1data.csv';
- Header parameter skips the first line in the file
COPY table1 (column1, column2, column3) FROM 'table1data.csv' WITH HEADER=true;
SELECT
SELECT *
FROM table1;
SELECT column1, column2, column3
FROM table1;
SELECT COUNT(*)
FROM table1;
SELECT *
FROM table1
LIMIT 10;
Exercise 2
Create Keyspaces
CREATE KEYSPACE killrvideo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
Create Table
CREATE TABLE videos (
video_id timeuuid ,
added_date timestamp ,
description text ,
title text ,
user_id uuid ,
primary KEY (video_id)
);
Describe Table
DESCRIBE TABLE videos ;
Import CSV file
COPY videos FROM 'videos.csv' WITH HEADER= true ;
Select
SELECT * FROM videos LIMIT 10;
SELECT count(*) FROM videos ;
Exercise 3
USE killrvideo ;
CREATE TABLE videos_by_title_year (
title text ,
added_year int ,
added_date timestamp ,
description text ,
user_id uuid ,
video_id uuid ,
primary KEY ((title, added_year))
) ;
COPY videos_by_title_year FROM 'videos_by_title_year.csv' WITH HEADER= true ;
SELECT * FROM videos_by_title_year WHERE title= 'Sleepy Grumpy Cat' and added_year = 2015;
if query the table with only title of added_year, the system will return an error
SELECT * FROM videos_by_title_year WHERE title= 'Sleepy Grumpy Cat' ;
The reason is that, cassandra use partition key to calculate the hash code, and use hash code to find the place of value.
keyspace --> partition --> clustering columns
Partition keys determine a grouping criteria whereas clustering columns determine ordering criteria
Exercise 4
USE killrvideo ;
CREATE TABLE bad_videos_by_tag_year_table (
tag text,
added_year int,
added_date timestamp,
title text,
description text,
user_id uuid,
video_id timeuuid,
PRIMARY KEY ((video_id))
);
DESCRIBE TABLE bad_videos_by_tag_year_table ;
COPY bad_videos_by_tag_year_table (tag, added_year, video_id, added_date, description, title, user_id) FROM 'videos_by_tag_year.csv' WITH HEADER=true;
SELECT count(*) FROM bad_videos_by_tag_year_table ;
DROP TABLE bad_videos_by_tag_year_table ;
with this example, because some video_id are same, thus the following will be treated as upsert for the table. Some records are overlaped.
Recreate a table
CREATE TABLE videos_by_tag_year (
tag text,
added_year int,
video_id timeuuid,
added_date timestamp,
description text,
title text,
user_id uuid,
PRIMARY KEY ((tag), added_year, video_id)
) WITH CLUSTERING ORDER BY (added_year DESC);
DESCRIBE TABLE videos_by_tag_year ;
COPY videos_by_tag_year FROM 'videos_by_tag_year.csv' WITH HEADER=true;
SELECT COUNT (*) FROM videos_by_tag_year ;
SELECT * FROM videos_by_tag_year WHERE tag = 'trailer' and added_year = 2015 ;
SELECT * FROM videos_by_tag_year WHERE tag = 'cql' and added_year = 2013 ;
SELECT * FROM videos_by_tag_year WHERE tag = 'cql' and added_year < 2015 ;