click below
click below
Normal Size Small Size show me how
MIS 665 - MT1
Midterm 1 MIS 665 Business Analytics
| Term | Definition |
|---|---|
| importing pandas package | import pandas as pd |
| Read csv file | df = pd.read_csv('data/file.csv') |
| save to csv | df.to_csv('data/output.csv') |
| Show first 5 rows | df.head() |
| show column names | df.columns |
| show count of (rows, columns) | df.shape |
| basic statistics | df.describe() |
| descriptive statistics (mean, median, etc.) | df['col'].mean() |
| select single column | df['col'] |
| select multiple columns | df[['col1', 'col2']] |
| filter rows | df.loc[df['col'] == 'value'] # (wrap in len() for count) |
| sort values | d.sort_values('col') |
| count values (freq. table) | df['col'].value_counts() |
| group and count values | df.groupby('col').size() |
| convert to list | df.values.tolist() |
| Steps to visit a website | 1. IMPORT packages, 2. get CHROME driver, 3. enter the site with DRIVER.GET('link') |
| xpath by class | //div[@class=='example'] |
| xpath by attribute | //td[@date-stat='player] |
| find elements | driver.find_elements(By.XPATH, '//div') # returns list row.find_element # returns single element |
| get url or class | element.get_attribute('href') # or 'class' |
| initialize dictionary | data = {'Col1': [], 'Col2': []} |
| loop through rows | for row in driver.find_elements( By.XPATH, //div[@class=='item']']: |
| convert to dataframe | df = pd.DataFrame(data) |
| try / except | try / except |
| check data types | df.info() |
| count missing | df.isnull().sum() |
| drop missing rows | df.dropna() |
| remove $ | df['Price'].str.replace('$', '') |
| to numeric | df['Col'] = pd.to_numeric(df['Col']) |
| create new columns | df['new_col'] = df['sales'] * 0.34 |
| 3 blocks | import --> driver --> get URL |
| ETL | Extract, Transform, Load |
| fill missing values | df.fillna(['Col': 'value'}) OR df['Col'] = df['Col'].fillna(df['Col'].median()) |
| remove columns (1 means columns) | df.drop('Col1', axis=1) |
| check for duplicates | df.duplicated().sum() |
| drop duplicates | df.drop_duplicates() |
| reset index after removing | df.reset_index(drop=True) |
| remove whitespace | str.strip() |
| replace chars | str.replace(,) |
| search text | str.contains() |
| merge data | pd.merge(df1, df2, on='key') |
| join only matching rows | how='inner' |
| join all from left + matches | how='left' |
| join all from right+ matches | how='right' |
| join all rows from both | how='outer' |
| what to check first | df.info() and df.isnull.sum() |
| how to find inconsistent text values | value_counts() |
| advanced filtering and | & |
| advanced filtering or | | |
| advanced filtering not | ~ |
| check if value in list | .isin |
| check between ranges | .between |
| query SQL-like syntax | df.query('Dept == "Sales" and Age > 25') |
| count rows per group | df.groupby('Col').size() |
| aggregations | .agg(['sum', 'mean', 'min']) |
| pivot table | df.pivot_table(index='Row', columns='Col', values='Data', aggfunc='sum') |
| time series - set date as index | df.set_index('date'), df.sort_index() |
| resample by time period | df.resample('ME').sum() # Monthly, yearly is YE |
| filter by date range | df.loc['2026-06'] |
| count row per group | groupby().size() |
| multiple aggregations at once | .agg() |
| numerical data chart (distribution) | df['col'].plot.hist() OR .box() |
| categorical data (counts) | df.groupby('col').size().plot.bar() |
| Two numerical columns (relationship) | df.plot.scatter(x=, y=) |
| Categorical + numerical (mean values) | df.groupby('cat')['num'].mean().plot.bar() |
| count by category chart | df.groupby('').size().plot.bar() |
| use .unstack() when | doing a multi-column groupby chart |
| seaborn violin plot | sns.violinplot(x=, y=, data=) |
| correlation matrix | df.corr(numeric_only=True) |
| correlation with target column | df.corr()['Target'].sort_values() |
| charts for numerical distribution | histogram / boxplot |
| chart for categorical counts / comparisons | bar chart |
| use .unstack() after groupby for grouped bars | use .unstack() after groupby for grouped bars |
| stacked=True for stacked bar charts | stacked=True for stacked bar charts |
| Divide by row totals for proportion charts | Divide by row totals for proportion charts |
| correlation overview with seaborn | sns.heatmap(df.corr()) |
| statistically significant p-value | p-value < 0.05 |
| What is SQL | Structured Query Language |
| steps to connect to database in python | import package, connect with create_engine, .get_table_names(), pd.read_sql() |
| steps in time series | set date time index, assign date time index as index, sort index |
| SQL Query Structure | SELECT [DISTINCT] <columns> FROM <table> [WHERE <condition>] [GROUP BY <columns>] [HAVING <condition>] [ORDER BY <columns>] [LIMIT <n>] |
| SELECT * FROM table | df |
| SELECT col1, col2 FROM table | df[['col1', 'col2']] |
| DISTINCT | .unique() |
| UPPER() | .str.upper() |
| WHERE clause | filtering |
| AND | & |
| OR | | |
| IS NULL | .isna() |
| IS NOT NULL | .notna() |
| COUNT(*), SUM(), AVG(), MAX(), MIN() | SQL aggregate functions |
| SELECT col, COUNT(*) FROM t GROUP BY col | df.groupby('col').size() |
| GROUP BY HAVING | .groupby() # filter after groupby |
| ORDER BY col ASC | .sort_values() |
| LIMIT 5 | .head() |
| SQL: SELECT * FROM t1 INNER JOIN t2 ON t1.key = t2.key | Pandas: pd.merge(df1, df2, on='key', how='') |
| AS new_name | .rename(columns={}) |
| # display the people from 'Kansas' | df.loc[df['location'] == 'Kansas'].head() |