1️⃣ What is Data Mining?
- Data mining: 🌟 Discover valid, novel, useful, and understandable patterns in massive datasets.
Cross Disciplines
- Databases
- Machine learning
- Statistics
- Neural network
- Parallel / Distributed computing
2️⃣ Characteristics of Data Mining
- Massive dataset
- Automatically searching for interesting patterns from historical data
- Fast
- Scalable
- Update easily
- Practical
- Decision support
3️⃣ The main tasks of Data Mining
Clustering
Classification
Anomaly Detection
Association Rules
Social Analysis
Recommender Systems
Sequence Mining
🌬 Association Rule Mining
Detect sets of attributes or items that frequently co-occur in many database records and rules among them.
关联规则挖掘,旨在寻找到经常同时出现在数据库记录和其中的规则中的属性或项集。
🔥 Classification
Build a model of classes on training dataset, and then, assign a new record to one of several predefined classes.
一个简单的分类模型:决策树。
分类是在先验基础上(经验、样本),对未定义的数据赋予新的属性。
🍦 Clustering
Partition the dataset into groups such that elements in a group have lower inter-group similarity and higher intra-group similarity.
聚类的核心在于对于无标签数据,通过模型将其划分为具有簇间差异最大、簇内差异最小的簇
🐍 Anomaly Detection
Anomalies: 🌟 The set of objects are considerably dissimilar from the remaining of the data.
Give a set of n
objects, and k
, the number of expected anomalies, find the top k
objects that are considerably dissimilar or inconsistent with the remaining data.
📘 Sequence Mining
Given a set of sequences, find the complete set of frequent subsequences.
给定一组序列,找到他的频繁子序列的完整集合
4️⃣ The kinds of Date
- Relational Databases
- 说人话就是处理结构化数据,以关系表的形式存储
- Data Warehouses
- 数据仓库顾名思义,是⼀个很⼤的数据存储集合,出于企业的分析性报告和决策⽀持⽬的⽽创建,对多样的业务数据进⾏筛选与整合。
- Transactional Databases
- Spatial Data
- Time Series
- Text Databases
- Multimedia databases
- Data Streams
- Biomedical Data
- World-Wide Web
- Graph
5️⃣ Knowledge Discovery Process
说人话就是:
从数据库中进行数据清洗和聚合
,接着在数据仓库中进行选择和变换
,通过数据挖掘
,对得到的模式进行评估
,最终得到可用的知识
。
Key Step🔑
1️⃣ Learning the application domain
- relevant prior knowledge and goals of application
2️⃣ Creating a target data resource
3️⃣ Data cleaning and preprocessing. (may take 60% of effort)
4️⃣ Data reduction and transformation
5️⃣ Choosing the mining algorithms to search for patterns of interest.
6️⃣ Pattern evaluation and knowledge presentation.
7️⃣ Use of discovered knowledge
可以归纳为三个阶段
- 用户画像阶段
- 数据处理阶段
- 评估应用阶段
用户画像阶段,需要学习用户相关领域的知识,创建我们的目标数据资源
数据处理阶段,需要进行数据预处理、数据降维和变换、数据挖掘这几个步骤,找到感兴趣的模式
评估应用阶段,则是将得到的模式进行评估,利用该模式对知识进行表达和运用
6️⃣ Interesting Patterns
Measures
🌟 A pattern is interesting if it is easily understood
by humans, valid
on new or test data with some degree of certainly, potentially useful, novel,
or validates some hypothesis
that a user seeks to confirm.
✏️ 如果一种模式“很容易被人类理解”,对新数据或测试数据“有效”,并具有某种程度的“肯定、潜在有用、新颖”或“验证用户试图确认的某些假设”,那么这种模式就是有趣的。
Objective or Subjective
Objective: based on statistics and structures of pattern
s, e.g., support, confidence, etc.
Subjective: based on user’s belief in the data
, e.g., unexpectedness, novelty, actionability, etc.
all or only interesting patterns
All: Completeness
Only: An optimization problem–challenging
- Can a data mining system find only the interesting patterns
- approaches
- First generate all the patterns and then filter out the uninteresting ones.
- Guide and constrain the discovery process.
7️⃣ Research Issues in Data Mining
- Mining methodology
- User interaction
- Applications and social impacts