Spark MLlib网页长青

发布于：2025-05-12 ⋅ 阅读:(151) ⋅ 点赞:(0)

一、实验目的

1．掌握Spark SQL中用户自定义函数的编写。

2. 掌握特征工程的OneHotEncoder、VectorAssembler。

3. 熟悉决策树算法原理，能够使用Spark MLlib库编写程序

4. 掌握二分类问题评估方法

5. 能够使用TrainValidation和crossValidation交叉验证找出最佳模型。

6. 掌握随机森林的算法原理。

7. 掌握使用Spark MLlib解决实际问题。

二、实验要求

Stumble Upon是一个个性化的搜索引擎，会按用户的兴趣和网页评分等记录推荐用户感兴趣的网页，有些网页是暂时性的，比如新闻，这些文章可能只是在某一段时间会对读者有意义，而有些则是长青的，读者会对这些文章有长久兴趣。

本次实训目标就是使用决策树二元分类分析StumbleUpon数据集，预测网页是暂时性的（ephemeral）或是长青的（evergreen），并调校参数找出最佳参数组合，提高预测准确度。数据集共有7395行，27列。

url	string	Url of the webpage to be classified
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonLinkRatio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4	double