STAT411 STATISTICAL DATA MINING

Course Code:2460411
METU Credit (Theoretical-Laboratory hours/week):4 (3.00 - 2.00)
ECTS Credit:6.0
Department:Statistics
Language of Instruction:English
Level of Study:Undergraduate
Course Coordinator:Prof.Dr. CEYLAN YOZGATLIGİL
Offered Semester:Fall and Spring Semesters.

Course Objectives

This course aims to equip students with a robust understanding of the data mining pipeline, from raw data to actionable insights. By the end of the course, students should be able to:

  • Master data preparation techniques: Learn to identify and address common data issues such as missing values, noisy data, and outliers.

  • Apply advanced data analysis methods: Gain proficiency in exploratory data analysis (EDA) and use modern visualization tools to uncover patterns and relationships in data.

  • Utilize feature engineering and selection: Understand and apply various data transformation, encoding, and feature selection methods to prepare data for machine learning models.

  • Implement regularization and dimension reduction: Apply techniques like Ridge, LASSO, and Elastic Net to handle multicollinearity and prevent overfitting. Additionally, learn to use methods like PCA and t-SNE to reduce data dimensionality for better visualization and model performance.

  • Build and evaluate predictive models: Develop a strong foundation in a variety of modeling techniques, including regression, and understand the trade-offs involved in model building, such as the bias-variance trade-off.

  • Explore advanced data mining applications: Gain exposure to specialized topics like association rules and recommendation systems, understanding their underlying principles and practical applications.


Course Content

Descriptive and predictive mining. Data preprocessing: cleaning transformation. outlier detection, missing data imputation. Dimension reduction, Principal Component Analysis (PCA). Sampling, oversampling. Exploratory data analysis (EDA). Clustering methods: partitioning, hierarchical, density-based, model-based. Predictive modeling. Regression. Variable selection. Robust and nonlinear regression. Nonparametric regression. Classifiers. Logistic regression. Decision trees. Random Forest. Model evaluation and validation. Real-life applications using recent available software.


Course Learning Outcomes

Upon successful completion of this course, students will be able to:

  • Perform Exploratory Data Analysis (EDA): Independently conduct EDA on a given dataset, use appropriate visualization techniques (including newly developed methods), and summarize key findings.
  • Preprocess and clean messy data: Apply various data cleaning techniques to handle missing values, remove duplicates, and correct inconsistencies. Identify and manage issues like multicollinearity, confounding, and interaction effects.
  • Transform and engineer features: Select and apply suitable data transformation methods (e.g., scaling, binarization, encoding) and feature selection techniques to optimize data for machine learning algorithms.
  • Use Regularization for Model Tuning: Implement and compare the performance of Ridge, LASSO, and Elastic Net regression to build more stable and generalized models.
  • Reduce Data Dimensionality: Apply Principal Component Analysis (PCA) for linear dimension reduction and use non-linear techniques like t-SNE and UMAP for visualization of high-dimensional data. Explain the differences and appropriate use cases for each method.
  • Clustering: Implement clustering algorithms like K-Means, Breathing K-Means, Hierarchical Clustering, DBSCAN, and GMM to discover natural groupings within data.
  • Handle Missing Data and Model Evaluation: Employ different strategies to handle missing data and apply cross-validation techniques to evaluate and compare the performance of models accurately, understanding the bias-variance trade-off.
  • Discover Association Rules and Build Recommendation Systems: Analyze transactional data to discover association rules (e.g., using the Apriori algorithm) and understand the fundamentals of building a recommendation system, such as collaborative filtering.

Program Outcomes Matrix

Level of Contribution
#Program Outcomes0123
1Applying the knowledge of statistics, mathematics and computer to statistical problems and developing analytical solutions.
2Defining, modeling and solving real life problems that involve uncertainty, and interpreting results.
3To decide on the data collection technique, and apply it through experiment, observation, questionnaire or simulation.
4Analysing small and big volumes of data and interpreting results.
5Utilizing up-to-date techniques, computer hardware and software required for statistical applications; developing software programs and numerical solutions for specific problems when necessary.
6Taking part in intradisciplinary and interdisciplinary teamwork, using time efficiently, taking leadership responsibilities and being entrepreneurial.
7Taking responsibility in individual work and offering authentic solutions.
8Following contemporary developments and publications in statistical science, conducting research, being open to novelty and thinking critically.
9Efficiently communicating in Turkish and English to define and analyze statistical problems and to interpret the results.
10Having a professional and ethical sense of responsibility.
11Developing computational solutions to statistical problems that cannot be solved analytically.
12Having theoretical background and developing new theories in statistics, building relations between theoretical and practical knowledge.
13Serving the society with the expertise in the field.

0: No Contribution 1: Little Contribution 2: Partial Contribution 3: Full Contribution