Steps to visit a website

1. IMPORT packages, 2. get CHROME driver, 3. enter the site with DRIVER.GET('link')

driver.find_elements(By.XPATH, '//div') # returns list row.find_element # returns single element

element.get_attribute('href') # or 'class'

initialize dictionary

data = {'Col1': [], 'Col2': []}

for row in driver.find_elements( By.XPATH, //div[@class=='item']']:

df['Price'].str.replace('$', '')

df['Col'] = pd.to_numeric(df['Col'])

df['new_col'] = df['sales'] * 0.34

df.fillna(['Col': 'value'}) OR df['Col'] = df['Col'].fillna(df['Col'].median())

Help

Options

focusNode

Didn't know it?
click below

Knew it?
click below

Don't Know

Remaining cards (0)

Know

retry

shuffle

restart

0:00

Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

Normal Size Small Size show me how

MIS 665 - MT1

Midterm 1 MIS 665 Business Analytics

Term	Definition
importing pandas package	import pandas as pd
Read csv file	df = pd.read_csv('data/file.csv')
save to csv	df.to_csv('data/output.csv')
Show first 5 rows	df.head()
show column names	df.columns
show count of (rows, columns)	df.shape
basic statistics	df.describe()
descriptive statistics (mean, median, etc.)	df['col'].mean()
select single column	df['col']
select multiple columns	df[['col1', 'col2']]
filter rows	df.loc[df['col'] == 'value'] # (wrap in len() for count)
sort values	d.sort_values('col')
count values (freq. table)	df['col'].value_counts()
group and count values	df.groupby('col').size()
convert to list	df.values.tolist()
Steps to visit a website	1. IMPORT packages, 2. get CHROME driver, 3. enter the site with DRIVER.GET('link')
xpath by class	//div[@class=='example']
xpath by attribute	//td[@date-stat='player]
find elements	driver.find_elements(By.XPATH, '//div') # returns list row.find_element # returns single element
get url or class	element.get_attribute('href') # or 'class'
initialize dictionary	data = {'Col1': [], 'Col2': []}
loop through rows	for row in driver.find_elements( By.XPATH, //div[@class=='item']']:
convert to dataframe	df = pd.DataFrame(data)
try / except	try / except
check data types	df.info()
count missing	df.isnull().sum()
drop missing rows	df.dropna()
remove $	df['Price'].str.replace('$', '')
to numeric	df['Col'] = pd.to_numeric(df['Col'])
create new columns	df['new_col'] = df['sales'] * 0.34
3 blocks	import --> driver --> get URL
ETL	Extract, Transform, Load
fill missing values	df.fillna(['Col': 'value'}) OR df['Col'] = df['Col'].fillna(df['Col'].median())
remove columns (1 means columns)	df.drop('Col1', axis=1)
check for duplicates	df.duplicated().sum()
drop duplicates	df.drop_duplicates()
reset index after removing	df.reset_index(drop=True)
remove whitespace	str.strip()
replace chars	str.replace(,)
search text	str.contains()
merge data	pd.merge(df1, df2, on='key')
join only matching rows	how='inner'
join all from left + matches	how='left'
join all from right+ matches	how='right'
join all rows from both	how='outer'
what to check first	df.info() and df.isnull.sum()
how to find inconsistent text values	value_counts()
advanced filtering and	&
advanced filtering or	\|
advanced filtering not	~
check if value in list	.isin
check between ranges	.between
query SQL-like syntax	df.query('Dept == "Sales" and Age > 25')
count rows per group	df.groupby('Col').size()
aggregations	.agg(['sum', 'mean', 'min'])
pivot table	df.pivot_table(index='Row', columns='Col', values='Data', aggfunc='sum')
time series - set date as index	df.set_index('date'), df.sort_index()
resample by time period	df.resample('ME').sum() # Monthly, yearly is YE
filter by date range	df.loc['2026-06']
count row per group	groupby().size()
multiple aggregations at once	.agg()
numerical data chart (distribution)	df['col'].plot.hist() OR .box()
categorical data (counts)	df.groupby('col').size().plot.bar()
Two numerical columns (relationship)	df.plot.scatter(x=, y=)
Categorical + numerical (mean values)	df.groupby('cat')['num'].mean().plot.bar()
count by category chart	df.groupby('').size().plot.bar()
use .unstack() when	doing a multi-column groupby chart
seaborn violin plot	sns.violinplot(x=, y=, data=)
correlation matrix	df.corr(numeric_only=True)
correlation with target column	df.corr()['Target'].sort_values()
charts for numerical distribution	histogram / boxplot
chart for categorical counts / comparisons	bar chart
use .unstack() after groupby for grouped bars	use .unstack() after groupby for grouped bars
stacked=True for stacked bar charts	stacked=True for stacked bar charts
Divide by row totals for proportion charts	Divide by row totals for proportion charts
correlation overview with seaborn	sns.heatmap(df.corr())
statistically significant p-value	p-value < 0.05
What is SQL	Structured Query Language
steps to connect to database in python	import package, connect with create_engine, .get_table_names(), pd.read_sql()
steps in time series	set date time index, assign date time index as index, sort index
SQL Query Structure	SELECT [DISTINCT] <columns> FROM <table> [WHERE <condition>] [GROUP BY <columns>] [HAVING <condition>] [ORDER BY <columns>] [LIMIT <n>]
SELECT * FROM table	df
SELECT col1, col2 FROM table	df[['col1', 'col2']]
DISTINCT	.unique()
UPPER()	.str.upper()
WHERE clause	filtering
AND	&
OR	\|
IS NULL	.isna()
IS NOT NULL	.notna()
COUNT(*), SUM(), AVG(), MAX(), MIN()	SQL aggregate functions
SELECT col, COUNT(*) FROM t GROUP BY col	df.groupby('col').size()
GROUP BY HAVING	.groupby() # filter after groupby
ORDER BY col ASC	.sort_values()
LIMIT 5	.head()
SQL: SELECT * FROM t1 INNER JOIN t2 ON t1.key = t2.key	Pandas: pd.merge(df1, df2, on='key', how='')
AS new_name	.rename(columns={})
# display the people from 'Kansas'	df.loc[df['location'] == 'Kansas'].head()

Created by: lexi.welte

"Know" box contains:
Time elapsed:
Retries: