ACADEMIC CATALOG

STAT411 STATISTICAL DATA MINING

Course Code:	2460411
METU Credit (Theoretical-Laboratory hours/week):	4 (3.00 - 2.00)
ECTS Credit:	6.0
Department:	Statistics
Language of Instruction:	English
Level of Study:	Undergraduate
Course Coordinator:	Prof.Dr. CEYLAN YOZGATLIGİL
Offered Semester:	Fall and Spring Semesters.

Course Objectives

This course aims to equip students with a robust understanding of the data mining pipeline, from raw data to actionable insights. By the end of the course, students should be able to:

Master data preparation techniques: Learn to identify and address common data issues such as missing values, noisy data, and outliers.
Apply advanced data analysis methods: Gain proficiency in exploratory data analysis (EDA) and use modern visualization tools to uncover patterns and relationships in data.
Utilize feature engineering and selection: Understand and apply various data transformation, encoding, and feature selection methods to prepare data for machine learning models.
Implement regularization and dimension reduction: Apply techniques like Ridge, LASSO, and Elastic Net to handle multicollinearity and prevent overfitting. Additionally, learn to use methods like PCA and t-SNE to reduce data dimensionality for better visualization and model performance.
Build and evaluate predictive models: Develop a strong foundation in a variety of modeling techniques, including regression, and understand the trade-offs involved in model building, such as the bias-variance trade-off.
Explore advanced data mining applications: Gain exposure to specialized topics like association rules and recommendation systems, understanding their underlying principles and practical applications.

Course Content

Descriptive and predictive mining. Data preprocessing: cleaning transformation. outlier detection, missing data imputation. Dimension reduction, Principal Component Analysis (PCA). Sampling, oversampling. Exploratory data analysis (EDA). Clustering methods: partitioning, hierarchical, density-based, model-based. Predictive modeling. Regression. Variable selection. Robust and nonlinear regression. Nonparametric regression. Classifiers. Logistic regression. Decision trees. Random Forest. Model evaluation and validation. Real-life applications using recent available software.

Course Learning Outcomes

Upon successful completion of this course, students will be able to:

Perform Exploratory Data Analysis (EDA): Independently conduct EDA on a given dataset, use appropriate visualization techniques (including newly developed methods), and summarize key findings.
Preprocess and clean messy data: Apply various data cleaning techniques to handle missing values, remove duplicates, and correct inconsistencies. Identify and manage issues like multicollinearity, confounding, and interaction effects.
Transform and engineer features: Select and apply suitable data transformation methods (e.g., scaling, binarization, encoding) and feature selection techniques to optimize data for machine learning algorithms.
Use Regularization for Model Tuning: Implement and compare the performance of Ridge, LASSO, and Elastic Net regression to build more stable and generalized models.
Reduce Data Dimensionality: Apply Principal Component Analysis (PCA) for linear dimension reduction and use non-linear techniques like t-SNE and UMAP for visualization of high-dimensional data. Explain the differences and appropriate use cases for each method.
Clustering: Implement clustering algorithms like K-Means, Breathing K-Means, Hierarchical Clustering, DBSCAN, and GMM to discover natural groupings within data.
Handle Missing Data and Model Evaluation: Employ different strategies to handle missing data and apply cross-validation techniques to evaluate and compare the performance of models accurately, understanding the bias-variance trade-off.
Discover Association Rules and Build Recommendation Systems: Analyze transactional data to discover association rules (e.g., using the Apriori algorithm) and understand the fundamentals of building a recommendation system, such as collaborative filtering.

Program Outcomes Matrix

Level of Contribution
#	Program Outcomes	0	1	2	3
1	Applying the knowledge of statistics, mathematics and computer to statistical problems and developing analytical solutions.				✔
2	Defining, modeling and solving real life problems that involve uncertainty, and interpreting results.				✔
3	To decide on the data collection technique, and apply it through experiment, observation, questionnaire or simulation.				✔
4	Analysing small and big volumes of data and interpreting results.				✔
5	Utilizing up-to-date techniques, computer hardware and software required for statistical applications; developing software programs and numerical solutions for specific problems when necessary.				✔
6	Taking part in intradisciplinary and interdisciplinary teamwork, using time efficiently, taking leadership responsibilities and being entrepreneurial.				✔
7	Taking responsibility in individual work and offering authentic solutions.				✔
8	Following contemporary developments and publications in statistical science, conducting research, being open to novelty and thinking critically.				✔
9	Efficiently communicating in Turkish and English to define and analyze statistical problems and to interpret the results.			✔
10	Having a professional and ethical sense of responsibility.				✔
11	Developing computational solutions to statistical problems that cannot be solved analytically.				✔
12	Having theoretical background and developing new theories in statistics, building relations between theoretical and practical knowledge.			✔
13	Serving the society with the expertise in the field.				✔

0: No Contribution 1: Little Contribution 2: Partial Contribution 3: Full Contribution

General

Academic Units

STAT411 STATISTICAL DATA MINING

Course Objectives

Course Content

Course Learning Outcomes

Program Outcomes Matrix