### DataQuest 2: Step 2 - Pandas and Numpy fundamentals

Posted on 25/09/2018, in Data Science.

This note is used for my notes about the Data Scientist path on dataquest. I take this note after I have some basics on python with other notes, that’s why I just write down some new-for-me things.

settings_backup_restore

## Mission 289 - Introduction to NumPy

keyboard_arrow_right See Numpy notes.
• import numpy as np : starts to use numpy
• <np.ndarray> = np.array(<list-of-lists>)
• <np.ndarray>.shape gives dimension shape of an array in a tuple type (like list but not be modified or immutable), e.g. (2,3)
• numpy display a ... to summerize an array with a large data
• ndarray[row, column] or nparray[row]
• Selecting

list of lists ndarray note
an element list[1][3] ndarray[1, 3]
rows list[:3] ndarray[:3] the same
columns [row[4] for row in list] ndarray[:,4]
multiple columns [[row[1], row[3]] for row in a] ndarray[:,[1,3]]
• %timeit -r 1 -n 1 python_subset() : see the time for a single run
• numpy quicker than list of list 30 times
• Vectorized operators (elementwise): +, -, *, /, % (remainder when a divided by b), **, // (floor division, rounding down to the nearest number)
• Alternative for / is np.divide(<ndarray1>, <ndarray2>)
• Max/Min/Mean/Median/Sum:
• Whole array: <ndarray.min()> or np.min(<ndarray>). The same for max, mean, median, sum.
• Each row: <ndarray.max(axis=1)>
• Each colum: <ndarray.max(axis=0)>
• Functions vs Methods:
• Functions act as stand alone segments of code that usually take an input
• Methods are special functions that belong to a specific type of object
• np.expand_dims(<ndarray>):
zeros = [0 0 0] # zeros.shape is (3,)
zeros_2d = np.expand_dims(zeros,axis=0) # zeros_2d.shape is (1,3)
# axis=0 w.r.t row
# axxis=1 w.r.t column

• numpy.concatenate([<ndarray1>,<ndarray2>],axis=0): combine 2 ndarrays

ones = [[ 1  1  1]
[ 1  1  1]] # shape is (2,3)
zeros = [ 0 0 0 ] # 1d
zeros_2d = np.expand_dims(zeros,axis=0) # to (1,3)
combined = np.concatenate([ones,zeros_2d],axis=0)
# result
# [[ 1 1 1]
#  [ 1 1 1]
#  [ 0 0 0]]

• Sort: np.argsort(<ndarray>) gives the sorted index of <ndarray> (sorting follows elements in ndarray but gives the result in index)

fruit = np.array(['o', 'b', 'a', 'g', 'c'])
sorted_order = np.argsort(fruit) # gives [2, 1, 4, 3, 0]
fruit[sorted_order] # gives np.array(['a', 'b', 'c', 'g', 'o'])


NumPy only supports sorting in ascending order

Can sort one column of an array then apply to the whole by nd.array[sorted_order]

## Mission 290 - Boolean Indexing with NumPy

• taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1) : reads a text file into a NumPy ndarray.
• <ndarray>.dtype: see the internal datatype that has been used
• taxi = taxi[1:] removes the first row.
• np.array([2,4,6,8]) < 5 gives a boolean array
• ndarray[<bool array>] give a new ndarray but only at True indexes of bool array (remove False indexes)

ndadday # 4x3
bool1 = [True False True True]
bool2 = [True False True]
ndarray[bool1] # takes the rows
ndaary[:,bool2] # takes the columns

• c[c[:,1] > 2, 1] = 99
• ndarray.shape gives a tuple (#rows, #columns).

• ndarray.shape[0] gives number of rows

## Mission 291 - Introduction to Pandas

settings_backup_restore See the note of Pandas.
• import pandas as pd
• f500 = pd.read_csv("<file-name>", index_col=0)
• type(f500) gives pandas.core.frame.DataFrame
• f500.shape gives a tuple (500, 16)
• Series is the pandas type for one-dimensional objects (columns, the 1st column is alway the header of pd obj)
• 1D pd object $\Rightarrow$ series
• 2D pd object $\Rightarrow$ dataframe
• Unlike NumPy, pandas does not use the same type for 1D and 2D arrays.
• f500.dtypes gives type of each column
• f500.head() returns first 5 rows

• f500.head(10) returns first 10 rows
• f500.tail()
• Unlike NumPy, pandas does not use the same type for 1D and 2D arrays.
• Select single element: df.loc["Sinopec Group", "revenues"]
• Select rows:
• Single: df.loc["a"]
• List: df.loc[["a","b"]]
• Slice: df["a":"c"] or df.loc["a":"c"]
• Select columns:
• Single: df["a"] or df.loc[:,"a"]
• df["a"] gives a df but df.loc[:,"a"] gives a series

• List: df[["a", "b"]] or df.loc[:,["a","b"]]
• Slice: df.loc[:, ["a":"c"]]
• Select rows in series object
• Single: s["a"] or s.loc["a"]
• List: s[["a","b"]] or s.loc[["a","b"]]
• Slice: s["a":"c"] or s.loc["a":"c"]
• s.describe(): gives count, mean, std, min, 25%, 50%, 75%, max or other types. (method chaining)

revs = f500["revenues"]
print(revs.describe())

• There is also dataframe.describe()
• for all: all_desc = f500.describe(include='all')
• for only numeric: print(f500.describe(include=['O']))
• Methods: .max(), .min(), .mean(), .median(), .mode(), .sum() applied to both series and dataframe.
• df.<method>(axis=0) or df.<method>(axis="index") calculate along the row axis. default
• df.<method>(axis=1) or df.<method>(axis="column") calculate along the column axis.
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
# or without axis=0 because it's default

• s.value_counts(): displays each unique non-null value from a series, with a count of the number of times that value is used.
• s.value_counts(dropna=True) exclude null values when making calculations.
• s.value_counts(dropna=False) includes also the null (normally, it doesn’t)
• s.value_counts(normalize=True) use percentages instead of counts
• Top 5 most common values of a column

top5_countries = f500["country"].value_counts().head()

• df.max(numeric_only = True) only display the numeric columns
• We can assign values in pandas using selecting tools above (like in numpy)

top5_rank_revenue["revenues"] = 0 # whole columns change to 0
top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999 # 1 item

• boolean indexing:

s_bool = df["<column>"] = 8 # gives a boolean series.
result = df[s_bool] # apply to whole df, DON'T use .loc
result_name = df.loc[s_bool, "<column>"] # consider a column, USE .loc[]

• Coupling

f500.loc[f500["sector"] == "abc"] = "ABC"

• np.nan = NaN

## Mission 292 - Exploring Data with pandas

import pandas as pd
import numpy as np

• pandas uses NumPy objects behind the scenes to store the data.
• .loc[] vs .iloc[]
• loc: label based selection
• iloc: integer position based selection
• Select:
• an element: df.iloc[2,0]
• row: df.iloc[1]
• column: df.iloc[:, 1]
• slices: df.iloc[1:5], df.iloc[[1,3], 1:5']
• list: df.iloc[:,[1,2,5]]
• Slice:
• with .loc[], the ending slide is included
• with .iloc[], the dening slice is not included
• The same for series but don’t forget that series is 1-D
• Import data (cf)

~~~ python

# remove the index name (text in the first line, first column)

f500.index.name = None ​~~~

• Sort the rows of f500 by columns employees (it returns another df but does not change the df itself)

sorted_emp = f500.sort_values(by=["employees"], ascending=False)

• s.str.contains("<str>") : check if str is in s or not
• s.str.contains("<str>", regex = False) if we want to consider str as a string.
• s.str.endswith("<str>") : check if s ends with str or not
• s.str.startswith("<str>")
• s.isnull() or s.notnull() : check s contains NaN or null

rev_change_null = f500[f500["revenue_change"].isnull()]
print(rev_change_null[["company","country","sector"]])


previously_ranked["rank","revenues"] # error
previously_ranked[["rank","revenues"]] # columns
previously_ranked.loc["rank","revenues"] # rows
previously_ranked.loc[:, ["rank","revenues"]] # columns

previously_ranked.loc["rank"] # dataframe
previously_ranked["rank"] # series

previously_ranked.loc["rank"] - previously_ranked.loc["prev_rank"] # 2 different columns (dataframe)
previously_ranked["rank"] - previously_ranked["prev_rank"] # 1 column (series)

• Using boolean operators

combined = (f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China")
combined = over_265 & china
final_cols = ["company", "revenues"]
result = f500_sel[combined, final_cols]


or just one line

result = f500_sel.loc[(f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China"), final_cols]

• Comparision operators: ==, ~(a==b) (not equal)
• Panda index alignment. If a dataframe food and a series colors have the same index (but diff order), they can be couple with
• food["color"] = colors
• Discards any items that have an index that doesn’t match the dataframe
• Loops in df: loop over a dataframe, it returns the column index labels, rather than the rows as we might expect.
• s.unique()returns an array of unique values from any series
• Find the average revenue for each unique country in f500

# Create an empty dictionary to store the results
avg_rev_by_country = {}

# Create an array of unique countries
countries = f500["country"].unique()

# Use a for loop to iterate over the countries
for c in countries:
# Use boolean comparison to select only rows that
# correspond to a specific country
selected_rows = f500[f500["country"] == c]
# Calculate the mean average revenue for just those rows
mean = selected_rows["revenues"].mean()
# Assign the mean value to the dictionary, using the
# country name as the key
avg_rev_by_country[c] = mean


## Mission 294 - Guided Project: Exploring Ebay Car Sales Data

• s.sort_index(ascending=False) to sort index of series
• s.sort_values() : to sort values
• See first 10 letters of each row on each column: autos["ad_created"].str[:10]
• df[(df["col"] > x ) & (df["col"] < y )] = df[df["col"].between(x,y)] to take the values of column “col” between x and y.
• s.value_counts().sort_index(ascending=True): sort values of s for easier examination.
• autos = autos[autos["price"].between(1,351000)]: only take values between an amount.
• element-wise logical comparison: & (and), == (equal), | (or), ~ (not)
• number of current examined elements: autos.shape[0]
• Find how many “less or greater” in percent:
(~autos["registration_year"].between(1900, 2016)).sum() / autos.shape[0]

• Many ways to select rows in a dataframe that fall within a value range for a column.
autos = autos[autos["registration_year"].between(1900, 2016)]

• A series will present a (hide) column index and a column value. We can use s.index to show the indexes of this series.
• Combine the data from both series objects into a single dataframe
• Quickly create a

# pandas series:
s = pd.Series([True, True, False, True])
# or from a dictionary abc
s = pd.Series(abc) # key in dict becomes the index in series

# pandas dataframe:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
# or from a series
df = pd.DataFrame(s, columns=["<name>"]) # column name will be set to 0 by default

• Add many series into a df: convert 1 series to df and then add other series to this df as new columns.
Top