IBM Data Course 8: Machine Learning with Python (Week 4 to 6)
Posted on 26/04/2019, in Data Science, Python, Machine Learning.This note was first taken when I learnt the IBM Data Professional Certificate course on Coursera.
settings_backup_restore
Go back to Course 8 (week 1 to 3).
keyboard_arrow_right
Go to Course 9 (final course).
tocIn this post
Week 4: Clustering
What’s clustering?
 The important requirement is to use the available data to understand and identify how customers are similar to each other. Let’s learn how to divide a set of customers into categories, based on characteristics they share.
 Clustering can group data only unsupervised, based on the similarity of customers to each other.
 Having this information would allow your company to develop highly personalized experiences for each segment.
 Clustering means finding clusters in a dataset, unsupervised.
 What’s difference between clustering and classification?
 Application of clustering
 Why clustering?
 Exploratory data analysis
 Summary generation
 Outlier detection
 Finding duplicates
 Preprocessing steps
 Clustering algorithms
 Partitionedbased clustering
 Relatively efficient
 eg. Kmeans, kmedian, fuzzy cmeans
 used for medium or large size databases
 Hierarchical clustering
 produces trees of clusters
 eg. agglomerative, divisive
 very intuitive
 good for small size dataset
 Densitybased clustering
 produces arbitary shaped clusters
 eg. DBSCAN
 used for special cluster or there is noise in your dataset
 Partitionedbased clustering
kMeans Clustering
The lab of Kmeans. This is another lab from course 9.
 KMeans can group data only unsupervised based on the similarity of customers to each other.
 Partitioning clustering
 Kmeans divides the data into nonoverlapping subsets (clusters) without any clusterinternal structure
 Examples within a cluster are very similar
 Examples across different clusters are very different
 Onjective of kmeans
 To form clusters in such a way that similar samples go into a cluster, and dissimilar samples fall into different clusters.
 To minimize the “intra cluster” distances and maximize the “intercluster” distances.
 To divide the data into nonoverlapping clusters without any clusterinternal structure
 kmeans clustering  initialize k
 Initialize k=3 centrois randomly
 Choose randomly 3 points in our data points.
 Choose randomly 2 points (not in our data points)
 Find the distance to these centroids from each of data points.
 Assign each point to the closest centroid > use distance matrix
 However, because we choose randomly the centroids, the SSE (sum of squared differences between each point and its centroid) is big
 Move the centroid to the mean of each cluster
 Go back to step 2 and repeat until there are no more changes.
 Above algorithm leads to local optimal (not guarantee about global optimal)
 Run the whole process multiple times with different starting conditions (the algorithm is quite fast)
 Initialize k=3 centrois randomly
 Accuracy of kmeans
 elbow method:
 If we increase K, the mean distances are always deacrease > cannot increase K forever
 But, there is an “elbow”, the decreasing is much. We choose the K at this elbow!
 elbow method:
Hierarchical Clustering
 Hierarchical clustering algorithms build a hierarchy of clusters where each node is a cluster consisting of the clusters of its daughter nodes.
 Strategies for hierarchical clustering generally fall into two types,
 divisive (phân tán ra) : dividing the cluster, it’s topdown.
 agglomerative (tích tụ vào) : amass or collect things, it’s bottomup > more popular!
 Hierarchical clustering is illustrated by dendrogram (sơ đồ nhánh).
 Agglomerative algorithm
 To calculate distance between clusters:
 Hierarchical clustering has longer runtimes than Kmeans!
 Advantages vs disavadtages
 Hierarchical clustering vs Kmeans
Hierarchical Clustering (lab)
Generating Random Data
from sklearn.datasets.samples_generator import make_blobs
X1, y1 = make_blobs(n_samples=50, centers=[[4,4], [2, 1], [1, 1], [10,4]], cluster_std=0.9)
# cluster_std: The standard deviation of the clusters. The larger the number, the further apart the clusters
# Choose a number between 0.51.5
Plot the data
plt.scatter(X1[:, 0], X1[:, 1], marker='o')
Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')
agglom.fit(X1,y1)
Find the distance matrix
from scipy.spatial import distance_matrix
dist_matrix = distance_matrix(X1,X1)
Hierarchical clustering
from scipy.cluster import hierarchy
Z = hierarchy.linkage(dist_matrix, 'complete')
Save the dendrogram to a variable called dendro
dendro = hierarchy.dendrogram(Z)
Check the lab to know how to use scipy to calculate distance matrix.
Densitybased Clustering (DBSCAN Clustering)
The lab of DensityBased Clustering
 Works based on density of objects.
 When applied to tasks with arbitrary shaped clusters or clusters within clusters, traditional techniques might not be able to achieve good results.
 Densitybased clustering locates regions of high density that are separated from one another by regions of low density.
 A specific and very popular type of densitybased clustering is DBSCAN (DensityBased Spatial Clustering of Applications with Noise).
 The wonderful attributes of the DBSCAN algorithm is that it can find out any arbitrary shaped cluster without getting effected by noise.
 One of the most common clustering algorithm.
 The algorithm has no notion of outliers that is, all points are assigned to a cluster even if they do not belong in any.
 Tt does not require one to specify the number of clusters such as K in Kmeans
 2 parameters
 R (radius of neighborhood) that if include enough number of points within, we call it a dense area.
 M (min number of neighbors): the min number of data points we want in a neighborhood to define a cluster.
 DBSCAN ALgorithm: all points fall into 3 types
 Core point (red) : have enough M points around.
 Border point (yellow) : in the area of R with other point but have <M points around.
 Outlier point (gray) : don’t have any points around.
Week 5 : Recommender Systems
Intro to recommender systems
 Recommender systems try to capture these patterns and similar behaviors, to help predict what else you might like.
 There are generally 2 main types of recommendation systems:
 Contentbased : Show me more of the same of what I’ve liked before.
 Collaborative filtering : Tell me what’s popular among my neighbors, I also may like it.
 (also) Hybrid recommender systems : combines various mechanisms.
 Implementing recommender systems: Memorybased & Modelbased:
Contentbased recommender systems
The lab of Contentbased recommender systems
 tries to recommend items to users based on their profile. The user’s profile revolves around that user’s preferences and tastes.
 based on similarity between those items
 For a very new genre that the user never watch, the systems didn’t work properly! > we can use collaborative filering.
Collaborative Filtering
The lab of Collaborative Filtering
 Collaborative filtering is based on the fact that relationships exist between products and people’s interests.
 2 approaches
 userbased: based on user’s neighborhood
 itembased: based on item’s similarity
 Userbased: if you watch the same movie with your neighbor and she watches a new film, you may like that film too.
 Find the similar users > using something like Pearson Correlation Function.
 Why? Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .

How?
 Process
 Select a user with the movies the user has watched
 Based on his rating to movies, find the top X neighbours
 Get the watched movie record of the user for each neighbour.
 Calculate a similarity score using some formula
 Recommend the items with the highest score
 Challenge of collaborative filtering
Week 6: Final Project
 You will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.
 If you have already sign up and have an account on Watson Studio (previous courses), you can sign in to the page of creating projects here (the guide on the course is not really helpful).