CSV
- a
.csv
is a flat-file database - the data for each column is separated by commas, and each row is on a new line, saved simply as a file
- very portable; good for sorting; bad for searching
- if you use
with open
for opening files inpython
, it will handle the eventual closing of the file as well. So essentially, the file will remain open while you are indented inside thewith
block. - the
DictReader
presumes that the first row is going to be column headings (if your file doesn’t have headers, you’ll have to manually configureDictReader
)
🤔 What does to canonicalize mean?
- to canonicalize means to format your data in some standard way
This is how you’d define a dictionary:
And this is how you’d sort by dict value (not key):
Relational database
- programs that store data, ultimately in files, but with additional data structures that allow us to search and store data more efficiently.
- SQLite is a lightweight version of it (very much used on mobile apps)
- instead of
.csv
, the data is now stored in a binary file - inside the file live the tables (aka sheets inside a spreadsheet)
sqlite3
command-line tool for interacting with the db- you can import a
.csv
file intosqilite3
SQL
When working with data, you generally deal with these four actions:
CREATE
andINSERT
SELECT
(akaREAD
in theCRUD
world)UPDATE
DELETE
For example:
SELECT title FROM shows;
SELECT * FROM shows;
(selects all columns)
SQL supports many functions that we can use to count and summarize data:
AVG
COUNT
DISTINCT
, for getting distinct values without duplicatesLOWER
MAX
MIN
UPPER
- …
You can also add clauses, or phrases that modify the query:
WHERE
, matching results on a strict conditionLIKE
, matching results on a less strict conditionORDER BY
, ordering results in some wayLIMIT
, limiting the number of resultsGROUP BY
, grouping results in some way- …
Example queries:
sqlite> SELECT title FROM shows WHERE title LIKE "%Office%";
(%
is a wildcard)sqlite> SELECT DISTINCT(UPPER(title)) FROM shows ORDER BY UPPER(title);
SELECT UPPER(title), COUNT(title) FROM shows GROUP BY UPPER(title) ORDER BY COUNT(title);
🤔 What does it mean to normalize data?
- To reduce redundancy and ensure a single source of truth.
Data Types
BLOB
, for “binary large object”, raw binary data that might represent filesINTEGER
NUMERIC
, number-like but not quite a number, like a date or timeREAL
, for floating-point valuesTEXT
, like strings
Attributes
NOT NULL
, there must be some valueUNIQUE
, means that the value for that column must be unique for every row in the tablePRIMARY KEY
, like theid
column above that will be used to uniquely identify each rowFOREIGN KEY
, like theshow_id
column above that refers to a column in some other table
Back in our Python file, this is how you can create a squlite db programmatically:
More queries:
SELECT title FROM shows WHERE id IN (SELECT show_id FROM genres WHERE genre = "Musical");
Data Types in MySQL
INTEGER
smallint
, with fewer bitsinteger
bigint
, with more bits
NUMERIC
boolean
date
datetime
numeric(scale,precision)
, with a fixed number of digitstime
timestamp
REAL
real
double precision
, with twice as many bits
TEXT
char(n)
, a fixed number of charactersvarchar(n)
, a variable number of characters, up to some limitn
text
, a string with no limit
Imdb schema design
Indexes
- indexes tell our database program to create additional data structures so we can search and sort with logarithmic time (instead of
O(n)
):sqlite> CREATE INDEX title_index ON shows (title);
- this creates
B-trees
, like binary trees in C, with nodes organized such that we can search faster than linearly - A
B-tree
is a family tree that is very wide and not that tall (aka it tries to keep the leaf nodes as close to the root as possible)
JOIN
Problems
- SQL injection attacks: possible in any application that takes in a user input (e.g., Bobby Tables)
- race conditions: code in a multi-threaded environment can be commingled or mixed in each thread (e.g., Instagram egg with 50 million likes). The solution is to use
TRANSACTIONS
which lock a specific row in a table to only one update at a time.