首先,我们练习加载库:
# 1.Load libraries #
import pandas as pd
import numpy as np
file_dir = "https://raw.githubusercontent.com/zhendata/Medium_Posts/master/City_Zhvi_1bedroom_2018_05.csv"
# read csv file into a Pandas dataframe
raw_df = pd.read_csv(file_dir)
# check first 5 rows of the file
# use raw_df.tail(5) to see last 5 rows of the file
raw_df.head(5)
保存文件是dataframe.to_csv()。如果您不想保存索引号码,请使用dataframe.to_csv(index = False)。
这个数据中有多少行和列?
raw_df.shape
# the results is a vector: (# of rows, # of cols)
# Get the number of rows
print(raw_df.shape[0])
# column is raw_df.shape[1]
数据的数据类型是什么,有多少列是数值类型?
# Check the data types of the entire table's columns
raw_df.dtypes
# Check the data type of a specific column
raw_df['RegionID'].dtypes
# result: dtype('int64')
如果想更加具体地了解数据,请使用select_dtypes()来列入或排除数据类型。问:如果我只想看2018的数据,该怎么做?
按数据类型选择列:
# if you only want to include columns of float data
raw_df.select_dtypes(include=['float64'])
# Or to get numerical columns by excluding objects (non-numeric)
raw_df.select_dtypes(exclude=['object'])
# Get a list of all numerical column names #
num_cols = raw_df.select_dtypes(include=[np.number]).columns.tolist()
例如,如果你只想要float和integer列:
# select a subset of columns by names
raw_df_info = raw_df[['RegionID', 'RegionName', 'State', 'Metro', 'CountyName']]
# drop columns by names
raw_df_sub = raw_df_info.drop(['RegionID','RegionName'],axis=1)
raw_df_sub.head(5)
如果我不喜欢列名,如何重命名?例如,将“State”更改为“state_”; 'City'改为'city_':
# Change column names #
raw_df_renamed1 = raw_df.rename(columns= {'State':'state_', 'City':'city_})
# If you need to change a lot of columns: this is easy for you to map the old and new names
old_names = ['State', 'City']
new_names = ['state_', 'city_']
raw_df_renamed2 = raw_df.rename(columns=dict(zip(old_names, new_names))
# 1. For each column, are there any NaN values?
raw_df.isnull().any()
# 2. For each column, how many rows are NaN?
raw_df.isnull().sum()
# the results for 1&2 are shown in the screenshot below this block
# 3. How many columns have NaNs?
raw_df.isnull().sum(axis=0).count()
# the result is 271.
# axis=0 is the default for operation across rows, so raw_df.isnull().sum().count() yields the same result
# 4. Similarly, how many rows have NaNs?
raw_df.isnull().sum(axis=1).count()
# the result is 1324
isnull.any()
isnull.sum()
raw_df_metro = raw_df[pd.notnull(raw_df['Metro'])]
# If we want to take a look at what cities have null metros
raw_df[pd.isnull(raw_df['Metro'])].head(5)
Metro值为N/A的行
选择2000之后没有null的数据子集:
如果要在7月份选择数据,需要找到包含“-07”的列。要查看字符串是否包含子字符串,可以在string中使用substring,它将输出true或false。
# Drop NA rows based on a subset of columns: for example, drop the rows if it doesn't have 'State' and 'RegionName' info
df_regions = raw_df.dropna(subset = ['State', 'RegionName'])
# Get the columns with data available after 2000: use.startwith("string") function #
cols_2000= [x for x in raw_df.columns.tolist() if '2000-' in x]
raw_df.dropna(subset=cols_2000).head(5)
选择我们希望拥有至少50个非NA值的行,但不限列:
# Drop the rows where at least one columns is NAs.
# Method 1:
raw_df.dropna()
#It's the same as df.dropna(axis='columns', how = 'all')
# Method 2:
raw_df[raw_df.notnull()]
# Only drop the rows if at least 50 columns are Nas
not_null_50_df = raw_df.dropna(axis='columns', thresh=50)
填充或替换(impute)NA:
#fill with 0:
raw_df.fillna(0)
#fill NA with string 'missing':
raw_df['State'].fillna('missing')
#fill with mean or median:
raw_df['2018-01'].fillna((raw_df['2018-01'].mean()),inplace=True)
# inplace=True changes the original dataframe without assigning it to a column or dataframe
# it's the same as raw_df['2018-01']=raw_df['2018-01'].fillna((raw_df['2018-01'].mean()),inplace=False)
# fill values with conditional assignment by using np.where
# syntax df['column_name'] = np.where(statement, A, B) #
# the value is A is the statement is True, otherwise it's B #
# axis = 'columns' is the same as axis =1, it's an action across the rows along the column
# axis = 'index' is the same as axis= 0;
raw_df['2018-02'] = np.where(raw_df['2018-02'].notnull(), raw_df['2018-02'], raw_df['2017-02'].mean(), axis='columns')
在汇总或连接数据之前,我们需要确保没有重复的行。
我们想看看是否有任何重复的城市或地区。我们需要确定在分析中使用的唯一ID(city和region)。
# Check duplicates #
raw_df.duplicated()
# output True/False values for each column
raw_df.duplicated().sum()
# for raw_df it's 0, meaning there's no duplication
# Check if there's any duplicated values by column, output is True/False for each row
raw_df.duplicated('RegionName')
# Select the duplicated rows to see what they look like
# keep = False marks all duplicated values as True so it only leaves the duplicated rows
raw_df[raw_df['RegionName'].duplicated(keep=False)].sort_values('RegionName')
删除重复的值。
'CountyName'和'SizeRank'组合已经是唯一的了。所以我们只使用列来演示drop_duplicated的语法。
# Drop duplicated rows #
# syntax: df.drop_duplicates(subset =[list of columns], keep = 'first', 'last', False)
unique_df = raw_df.drop_duplicates(subset = ['CountyName','SizeRank'], keep='first')
github:https://gist.github.com/zhendata/5d73068e5b31b616938af51bedf65382