STAT411 STATISTICAL DATA MINING
Course Code: | 2460411 |
METU Credit (Theoretical-Laboratory hours/week): | 4 (3.00 - 2.00) |
ECTS Credit: | 6.0 |
Department: | Statistics |
Language of Instruction: | English |
Level of Study: | Undergraduate |
Course Coordinator: | Prof.Dr. CEYLAN YOZGATLIGİL |
Offered Semester: | Fall and Spring Semesters. |
Course Objectives
This course aims to equip students with a robust understanding of the data mining pipeline, from raw data to actionable insights. By the end of the course, students should be able to:
-
Master data preparation techniques: Learn to identify and address common data issues such as missing values, noisy data, and outliers.
-
Apply advanced data analysis methods: Gain proficiency in exploratory data analysis (EDA) and use modern visualization tools to uncover patterns and relationships in data.
-
Utilize feature engineering and selection: Understand and apply various data transformation, encoding, and feature selection methods to prepare data for machine learning models.
-
Implement regularization and dimension reduction: Apply techniques like Ridge, LASSO, and Elastic Net to handle multicollinearity and prevent overfitting. Additionally, learn to use methods like PCA and t-SNE to reduce data dimensionality for better visualization and model performance.
-
Build and evaluate predictive models: Develop a strong foundation in a variety of modeling techniques, including regression, and understand the trade-offs involved in model building, such as the bias-variance trade-off.
-
Explore advanced data mining applications: Gain exposure to specialized topics like association rules and recommendation systems, understanding their underlying principles and practical applications.
Course Content
Descriptive and predictive mining. Data preprocessing: cleaning transformation. outlier detection, missing data imputation. Dimension reduction, Principal Component Analysis (PCA). Sampling, oversampling. Exploratory data analysis (EDA). Clustering methods: partitioning, hierarchical, density-based, model-based. Predictive modeling. Regression. Variable selection. Robust and nonlinear regression. Nonparametric regression. Classifiers. Logistic regression. Decision trees. Random Forest. Model evaluation and validation. Real-life applications using recent available software.
Course Learning Outcomes
Upon successful completion of this course, students will be able to:
- Perform Exploratory Data Analysis (EDA): Independently conduct EDA on a given dataset, use appropriate visualization techniques (including newly developed methods), and summarize key findings.
- Preprocess and clean messy data: Apply various data cleaning techniques to handle missing values, remove duplicates, and correct inconsistencies. Identify and manage issues like multicollinearity, confounding, and interaction effects.
- Transform and engineer features: Select and apply suitable data transformation methods (e.g., scaling, binarization, encoding) and feature selection techniques to optimize data for machine learning algorithms.
- Use Regularization for Model Tuning: Implement and compare the performance of Ridge, LASSO, and Elastic Net regression to build more stable and generalized models.
- Reduce Data Dimensionality: Apply Principal Component Analysis (PCA) for linear dimension reduction and use non-linear techniques like t-SNE and UMAP for visualization of high-dimensional data. Explain the differences and appropriate use cases for each method.
- Clustering: Implement clustering algorithms like K-Means, Breathing K-Means, Hierarchical Clustering, DBSCAN, and GMM to discover natural groupings within data.
- Handle Missing Data and Model Evaluation: Employ different strategies to handle missing data and apply cross-validation techniques to evaluate and compare the performance of models accurately, understanding the bias-variance trade-off.
- Discover Association Rules and Build Recommendation Systems: Analyze transactional data to discover association rules (e.g., using the Apriori algorithm) and understand the fundamentals of building a recommendation system, such as collaborative filtering.
Program Outcomes Matrix
Level of Contribution | |||||
# | Program Outcomes | 0 | 1 | 2 | 3 |
1 | Applying the knowledge of statistics, mathematics and computer to statistical problems and developing analytical solutions. | ✔ | |||
2 | Defining, modeling and solving real life problems that involve uncertainty, and interpreting results. | ✔ | |||
3 | To decide on the data collection technique, and apply it through experiment, observation, questionnaire or simulation. | ✔ | |||
4 | Analysing small and big volumes of data and interpreting results. | ✔ | |||
5 | Utilizing up-to-date techniques, computer hardware and software required for statistical applications; developing software programs and numerical solutions for specific problems when necessary. | ✔ | |||
6 | Taking part in intradisciplinary and interdisciplinary teamwork, using time efficiently, taking leadership responsibilities and being entrepreneurial. | ✔ | |||
7 | Taking responsibility in individual work and offering authentic solutions. | ✔ | |||
8 | Following contemporary developments and publications in statistical science, conducting research, being open to novelty and thinking critically. | ✔ | |||
9 | Efficiently communicating in Turkish and English to define and analyze statistical problems and to interpret the results. | ✔ | |||
10 | Having a professional and ethical sense of responsibility. | ✔ | |||
11 | Developing computational solutions to statistical problems that cannot be solved analytically. | ✔ | |||
12 | Having theoretical background and developing new theories in statistics, building relations between theoretical and practical knowledge. | ✔ | |||
13 | Serving the society with the expertise in the field. | ✔ |
0: No Contribution 1: Little Contribution 2: Partial Contribution 3: Full Contribution