### IBM Data Course 6: Data Analysis with Python

Posted on 13/04/2019, in Data Science, Python.

This note was first taken when I learnt the IBM Data Professional Certificate course on Coursera.

settings_backup_restore
keyboard_arrow_right Go to Course 7.

## Week 1: Importing Datasets

• Why Data Analysis
• Data is everywhere: collected by Data Scientist or automatically (when you click somewhere on the website, for example)
• Data is not information, with Data analysis/data science, it’s information
• Data Analysis plays an important role in
• Discovering useful info
• Predicting future or the unknown
• Example (example csv file here)
• CSV = comma-separated values.
• Tom want to sell a car but which price is reasonable?
• Is there data on the prices of other cars and their characteristics?
• What features of cars affect their prices? (color, brand, horsepower, else?)
• We need data and how to understand data
• Understanding data
• Using CSV
• Each line represents a row
• Each of the attributes in the dataset
• Python Packages for DS
• We have divided the Python data analysis libraries into three groups.
• Scientifics computing libraries:
• Pandas (data structures & tools): primary instrument -> data frame (2 dimensional table)
• Numpy (arrays & matrices)
• Scipy (integrals, solving differential equatins, optimization)
• Visualization Libearies
• Matplotlib (plots & graphs, most popular)
• Seaborn (based on matplotlib, plots: heat maps, time series, violin plots)
• Algorithmic libraries: machine learning -> develop a model using data + obtain predictions
• Scikit-learn (machine learning: regression, classification,clustering…): built on Numpy, Scipy and Matplotlib
• Statsmodels (Explore data, estimate statistical models and perform statistical tests)
• Importing and Exporting Data in Python:
• Two important properties
• Format: the way data is encoded. Various format (.csv, .json, .xlsx, .hdf,…)
• Path: where data is stored (local or online)
• In python: pd.read_csv(), pd.read_json(), pd.read_excel(), pd.read_sql() and the same for pd.to_??()
  import pandas as pd
url = "path/to/data/file/"

df.to_csv(path) // export to another csv file

// print nth first/last rows
df.tail(n)

  headers = ["col1", "col2", "col3"]

• Getting Started Analyzing Data in Python:
• Understand your data before you begin any analysis
• Pandas type: object, int64, float64, datetime64, timedelta[ns] (different from native python types)
• Check the data types of objects: pd.dtypes
• Return the statistical summary: df.describe()
• Full summary: df.describe(include = 'all')
• Or: df.info()

## Week 2: Data Wrangling

• Pre-processing Data in Python
• Mapping raw form to another format to prepare for further analysis.
• Other calls: data cleaning / data wrangling
• Identify + handle missing values
• Data formatting
• Data normalization (centering/scaling)
• Data Binning: creates bigger categories from a set of numerical values. It is particularly useful for comparison between groups of data.
• Turning categorical values to numeric variables
• Dealing with missing values
• They could be presented as: ?, N/A,
• Drop the missing values: drop the variable, drop the data entry (if you don’t have many observation):
  df.dropna()
df.dropna(axis=0) // drop entire row
df.dropna(axis=1) // drop entire column

// drop some row whose missing value in colum 'price'
df.dropna(subset=["price"], axis=0, inline=True) // inline means df will be modified after this method is applied
df.dropna(subset=["price"], axis=0) // doesn't change df -> good way to be sure you're performing the correct the copperation

• Replacing missing values:
• replace it with an avarage
• replace it by frequency (values appear most often)
• replace it based on other functions
  df.replace(<missing value>, <new value>)

mean = df["col1"].mean()
df["col1"].replace(np.nan, mean)

• Leaving it as missing value
• Data Formatting in Python
• Change mile per galon (mpg) to litre per km (l/100km): df["col1"] = 235/df["col1"]
• Rename a column: df.rename(columns={"col_old": "col_new"}, inplace=True)
• Somtimes, incorrect data types.
• objects: “a”, “hello”,…
• int64: 1,3,5
• float: 1.2
• others
• Check data type: df.dtypes()
• Convert data type: df.astype(), e.g. df["price"].astype("int")
• Data Normalization in Python
• diff ranges, hard to compare, the bigger will influnce the result most.
• Diff approaches
• Simple feature scaling: $x_{new} = \dfrac{x_{old}}{x_{max}}$
• Min-max: $x_{new} = \dfrac{x_{old}-x_{min}}{x_{max}-x_{min}}$
• Z-score: $x_{new} = \dfrac{x_{old}-\mu}{\delta}$, usually between (-3,3) based on normal distribution.
  df["col1"] = df["col1"]/df["col1"].max() // simple feature scaling
df["col1"] = (df["col1"] - df["col1"].min())/(df["col1"].max() - df["col1"].min()) // min-max
df["col1"] = (df["col1"] - df["col1"].mean())/df["col1"].std()

• Binning
• “Groups of values into bins”
  bins = np.linspace(min(df["price"]), max(df["price"], 4)) // make 4 equal spaced numbers

group_names = ["low", "medium", "high"]
df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, inlcude_lowest=True)

• Turning categorical variables into quantitative variables in Python
• Problem: most statistical models cannot take in the objects/strings as input
• Solution: assign 0 or 1 in each category -> One-hot encoding

## Week 3: Exploratory Data Analysis

• Exploratory Data Analysis (EDA):
• Summary main characteristics of data
• Get better understanding about data
• Uncover relations between variables
• Extract import variables
• Descriptive Statistics
• df.describe() : NaN will be excluded
• Summerize the categorical data is by using df.value_counts()
• Using box-plots (Seaborn package)
• Scatter plot: relationship between 2 variables
• GroupBy in Python
• group data into categories
• find the average “price” of each car based on “body-style”
  df[['price','body-style']].groupby(['body-style'],as_index= False).mean()

• df.pivot() makes table like excel, easier for visualizing. A pivot table has one variable displayed along the columns and the other variable displayed along the rows.
• Heatmap plot: plot the target variable over the multiple variable
• Correlation
• Measure to what extent diff variables are interdependent
• Correlation doesn’t imply causation (quan hệ nhân quả): there a relation between A and B but we don’t have enough info to know which one causes the other?
• Correlation - Positive/negative Linear Relationship (y=ax, a>0 or a<0)
  sns.regplot(x="var1", y="var2", data=df)
plt.ylim(0,)

// or
df[["col1", "col2"]].corr()

• Correslation - Statistics
• Pearson correlation: measures the strenght of correlation between 2 features
• Correlation coefficients
• p-value
• strong correlation: correlation close to 1 + p-value < 0.001
  import scipy.stats as stats
pearson_coef, p_value = stats.pearsonr(df['col1'], df['col2'])

• Analysis of Variance (ANOVA)
• How impact of a categorical feature on the target?
• Finding correlation between diff groups of a categorical variable.
• What we obtain from ANOVA
• F-test score: variation between sample group means divided by variation within sample group.
• p-value: confidence degree.

small F-score

big F-score

## Week 4: Model Development

Check the lab for better understanding in a case-study.

• Model Development
• Linear Regression:
• the predictor (independent) variable x
• the target (dependent) variable y
• $y = b_0 + b_1x$ where
• $b_0$ is intercept: lm.intercept_
• $b_1$ is slope: lm.coef_
  // import
from sklearn.linear_model import LinearRegression

// create LR object
lm = LinearRegression()

// define variables
X = df[["col1"]]
Y = df[["col2"]]

// fit
lm.fit(X, Y)

// predict
Yhat = lm.predict(X)

• Multiple Linear Regression
  Z = df[["col1", "col2", "col3"]]
Y = df[["coln"]]
lm.fit(Z, Y)
Yhat = lm.predict(X)

• Model Evaluation using Visualization
• Regression plot
  import seaborn as sns

sns.regplot(x="col1", y="col2", data=df)
plt.ylim(0,)

• Residual plot: check diff actual values and predicted values
• If we have 0-mean, that’s linear regression
• Randomly spread out the x-axis
• If we don’t have 0-mean (sometimes positive, sometimes negative), that’s non-linear
• Not randomly spread out the x-axis.

Linear regression

Nonlinear

  import seaborn as sns
sns.residplot(df["feature"], df["target"])

• Distribution plot:
• counts predicted value versus the actual value
• These plots are extremely useful for visualizing models with more than one independent variable or feature.
• If we use multiple variables (left), the predicted is closed to actual
  import seaborn as sns
ax1 = sns.distplot(df["price"], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Value", ax=ax1)

• Polynomial Regression and Pipelines
  f = np.polyfit(x,y,3) // 3rd oreder polynomial
p = np.poly1d(f)
print(p) // print out the model : ax^3 + bx^2 + cx + d

// polynomial with multiple variables
from sklearn.preprogressing import PolynomialFeatures
pr = PolynomialFeatures(degree=2, include_bias=False)
x_polly = pr.fit_transform(x[["col1", "col2"]])

• We can normalize the each feature simultaneously
  from sklearn.preprocessing import StandardScaler
SCALE =  StandardScaler()
SCALE.fit(x_data[["col1", "col2"]])
x_scale = SCALE.transform(x_data[["col1", "col2"]])

  from sklearn.pipeline import Pipeline

• Measures for In-Sample Evaluation
• Mean Square Error (MSE): difference between predicted values and actual values
  from sklearn.metrics import mean_squared_error
mean_squared_error(df["col1"], Y_predict)

• R-squared (Coefficient of Determination): how close the data is to the fited regression line
• Using: lm.score(X, y)
• Usually between 0 and 1
• If <0 -> overfitting

$R^2=1$ : good

$R^2=0$ : worst

• Prediction and Decision Making
• See in the lab!!!

## Week 5: Model Evaluation

Check the lab for better understanding in a case-study.

• Model Evaluation and Refinement: tells us how a model perform in the real world.
• In-sample evaluation tells us how well our model fits the data already given to train it.
• Problem: It does not give us an estimate of how well the train model can predict new data.
• Solution: in-sample (training data) and out-of-sample (test data)
• Split data set into: 70% training and 30% test:
  from sklearn.model_selection import train_test_split

• Generalization error is a measure of how well our data does at predicting previously unseen data.
• All our error estimates are relatively close together, but they are further away from the true generalization performance. To overcome this problem, we use cross-validation.
• It’s a model validation techniques for assessing how the results of a statistical analysis (model) will generalize to an independent data
• It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
• It is important the validation and the training set to be drawn from the same distribution otherwise it would make things worse.
• Validation help us evaluate the quality of the model
• Validation help us select the model which will perform best on unseen data
• Validation help us to avoid overfitting and underfitting.
  from sklearn.model_selection import cross_val_score

sklearn.model_selection.train_test_split // make train/test

• Overfitting, Underfitting and Model Selection
• everything on the left box is considered as overfitting, right underfitting
• We can calculate different R-squared values as follows.
• Ridge Regression: prevent overfitting
• In polynomial equations, the coefficients going with the high order terms are very big. The Ridge regression will control these coefficients by introduce a parameter alpha.
• alpha too large -> these coeff seem to be zero -> underfitting
• alpha = 0 -> overfitting
• in order to track alpha, we use cross validation
• in Python
• To choose a good alpha, we start with the smaller one, increase step by step and then choose the one make R-squared values be max. Or with MSE.
• Minimize MSE or maximize R-squared.
• Grid Search
• Grid Search allows us to scan through multiple free parameters with few lines of code.
• Scikit-learn has a means of automatically iterating over these hyperparameters (like alpha) using cross-validation. This method is called Grid Search.
• What are the advantages of Grid Search is how quickly we can test multiple parameters.
Top