Codecademy  Data Science 1
Posted on 24/07/2019, in Data Science.This note is created when I started to learn the Data Science on Codecademy.
A day in life  Data Analyst
 Data extraction with SQL
 Programming basics with Python
 Data analysis using pandas, a Python library
 Data visualization using Matplotlib, a Python library
 Machine Learning using scikitlearn, a Python library
Relational Database Management System (RDMS)
 RDBMS use SQL language to access the database.
 Popular RDBMS:
 SQLite:
 all of the data can be stored locally
 popular choice for databases in cellphones, PDAs, MP3 players, settop boxes, and other electronic gadgets. The SQL courses on Codecademy use SQLite.
 MySQL:
 the most popular open source SQL database
 easy to use, inexpensive, reliable, large community of developers
 poor performance when scaling, open source development has lagged
 does not include some advanced features that developers may be used to
 PostgreSQL:
 open source SQL database
 shares many of the same advantages of MySQL
 foreign key support without requiring complex configuration.
 slower in performance than other databases
 Oracle DB:
 not open sourced (Oracle Corporation owns)
 for large applications, particularly in the banking industry
 SQL Server:
 Microsoft owns
 Large enterprise applications mostly use SQL Server.
 offers a free entrylevel version called Express
 SQLite:
SQL
 Just look up at this site!

ALTER TABLE
statement adds a new column to a table.ALTER TABLE celebs ADD COLUMN twitter_handle TEXT;

Constraints that add information about how a column can be used are invoked after specifying the data type for a column.
CREATE TABLE celebs ( id INTEGER PRIMARY KEY, name TEXT UNIQUE, date_of_birth TEXT NOT NULL, date_of_death TEXT DEFAULT 'Not Applicable' );

AS
SELECT name AS 'ten' FROM movies;
DISTINCT
is used to return unique values in the output. It filters out all duplicate values in the specified column(s).LIKE
can be a useful operator when you want to compare similar values. Check this for other usesages. A
CASE
statement allows us to create different outputs (usually in the SELECT statement). It is SQL’s way of handling ifthen logic.  Cross join

with
statementsWITH previous_results AS ( SELECT ... ... ... ... ) SELECT * FROM previous_results JOIN customers ON _____ = _____;
Numpy with Statistics
np.percentile(d, 40)
gives the number which divides arrayd
into 40% and 60%.
histogram:
plt.hist(commutes, range=(20,50), bins=6)
 A unimodal dataset has only one distinct peak. (1 đỉnh)
 A bimodal dataset has two distinct peaks. This often happens when the data contains two different populations. (2 đỉnh)
 A multimodal dataset has more than two peaks.
 A uniform dataset doesn’t have any distinct peaks.
 A symmetric dataset has equal amounts of data on both sides of the peak. Both sides should look about the same.
 A skewright dataset has a long tail on the right of the peak, but most of the data is on the left.
 A skewleft dataset has a long tail on the left of the peak, but most of the data is on the right.
 The type of distribution affects the position of the mean and median. In heavily skewed distributions, the mean becomes a less useful measurement.
 the normal distribution, which is a symmetric, unimodal distribution.
 random number generator (fit a normal distribution):
a = np.random.normal(loc=0, scale=1, size=100000)
loc
(mean of normal dist),scale
(SD of ND),size
(# of random numbers)
 We expect that 68% of our dataset to be between [meanstd, mean+std]
 68% of our samples will fall between +/ 1 standard deviation of the mean
 95% of our samples will fall between +/ 2 standard deviations of the mean
 99.7% of our samples will fall between +/ 3 standard deviations of the mean
 The binomial distribution can help us. It tells us how likely it is for a certain number of “successes” to happen, given a probability of success and a number of trials.
 The binomial distribution is important because it allows us to know how likely a certain outcome is, even when it’s not the expected one.
 Exp: 70% số người mua vị gà (70 trong 100 người sẽ chọn gà) nhưng khả năng “7 trong 10 người chọn gà” thì rất thấp (27% mà thôi).
np.random.binomial(10, 0.30, size=10000)
# Our basketball player has a 30% chance of making any individual basket. He took 10 shots and made 4 of them, even though we only expected him to make 3. What percent chance did he have of making those 4 shots? a = np.random.binomial(10, 0.30, size=10000) np.mean(a == 4) # 0.1973 # 2nd way len(a[a==4]) / len(a)
Hypothesis Testing (SciPy)
 Link course.
 engagement > time people spend on your website.
 Performing an A/B test — are the different observations really the results of different conditions (i.e., Condition A vs. Condition B)? Or just the result of random chance?
 Conducting a survey — is the fact that men gave slightly different responses than women a real difference between men and women? Or just the result of chance?
 The individual measurements on Monday, Tuesday, and Wednesday are called samples. A sample is a subset of the entire population. The mean of each sample is the sample mean and it is an estimate of the population mean.
 Central Limit Theorem:
 Sometime, you measured more one sample than the others. That makes your sample selection skewed to one direction of the total population.
 if we have a large enough sample size, all of our sample means will be sufficiently close to the population mean.
 Hypothesis Tests:
 Hypothesis testing is a mathematical way of determining whether we can be confident that the null hypothesis is false.
 null hypothesis ($H_0$): the null hypothesis is the proposition that there is no effect or no relationship between phenomena or populations. (ThoughtCo)
 Chúng ta có thể test các null hypothesis này để thấy rằng chúng có thể sai mà từ đó thấy được mối quan hệ của các thành phần.
 The alternate hypothesis ($H_A$ or $H_1$)
 Example (How to State a Null Hypothesis?): Mối liên quan giữa số lần tập thể dục mỗi tuần và số kg giảm được. Giả sử mỗi tuần tập 5 lần sẽ giảm 6kg. Bây giờ ta giảm số lần tập mỗi tuần xuống còn 3 thì liệu số kg giảm được sẽ ít hơn 6 ko?
 $H_A=H_1={ \mu<6 }$ (Alternate hypothesis)
 $H_0 = { \mu\ge 6 }$ (chẳng những không giảm mà còn tăng)
 Cách biểu diễn khác: $H_0 = { \mu = 6 }$ (giảm số lần tập cũng không ảnh hưởng đến số kg giảm)
 Other example:
 “Hyperactivity is unrelated to eating sugar” (Tăng động không liên quan đến ăn đường) is an example of a null hypothesis.
 Type I = False Positive, Type II = False Negative. Check my article about Confusion matrix.
 Type I = FP = the null hypothesis is rejected even though it is true.
 Type II = FN = the null hypothesis is accepted even though it is false.
 PValues: A hypothesis test provides a numerical answer, called a pvalue, that helps us decide how confident we can be in the result.
 a pvalue is the probability that we yield the observed statistics under the assumption that the null hypothesis is true.
 Example: A pvalue of 0.05 would mean that there is a 5% chance that there is no difference between the two population means.
 A higher pvalue is more likely to give a FP so if we want to be very sure that the result is not due to just chance, we will select a very small pvalue.