Anh-Thi DINH

Data Science 1

Posted on 10/10/2018, in Machine Learning, Data Science.

This series starts from the beginning of learning Data Science and try solving the Titanic problem on Kaggle.



  • Installing Anaconda (to use pandas and other neccessary packages)
  • Add Anaconda to PATH: export PATH=/home/thi/anaconda3/bin:$PATH
  • If you have something strange, check this note of python first

Getting start

  • Loan prediction : an example of a finished project.
  • The first step in creating a project is to decide on a topic. You want the topic to be something you’re interested in and motivated to explore. It’s very obvious when people are making projects just to make them, rather than out of a genuine interest in the topic.
    • Think about what sectors or angles you’re really interested in, then find data sets relating to those sectors.
    • Review several data sets, and find one that seems interesting enough to explore.
  • Some resources
    • - A directory of government data downloads
    • /r/datasets - A subreddit that has hundreds of interesting data sets
    • Awesome datasets - A list of data sets hosted on GitHub
    • - A great blog post with hundreds of interesting data sets


Data analyst, data scientist and data engineer

I found many paths on the internet to enter the data science’s world. Most of them categorized into 3 types: analyst - scientist - engineer. I wanna find out the difference between them and what is the perfect fit for me. In this section, I am not an expert to answer this question for you, I noted this answer only for me.

  • Explanation on Dataquest.
  • Data analyst: the bridge, the driver, from the past show the present. Common tasks done by data analysts include data cleaning, performing analysis and creating data visualizations.
  • Data scientist: behind the scenes, from the past show the future. They apply their expertise in statistics and building machine learning models to make predictions and answer key business questions. Jobs: clean, analyze, and visualize data (like data analyst) + have more depth and expertise in these skills, and will also be able to train and optimize machine learning models.
  • Data engineer: the workers build and optimize the systems that allow data scientists and analysts to perform their work. The data engineer ensures that any data is properly received, transformed, stored, and made accessible to other users.

Data science, data mining, big data

I heard alot about these these terms.

Learning platforms

  • Datacamp vs Dataquest on reddit