Abstract
Data mining is an approach to discover knowledge from large data. Pollutant forecasting is an important problem in the environmental sciences. This paper tries to use data mining methods to forecast fine particles (PM2.5) concentration level in a new town of Hong Kong rural area. There are several classification algorithms available in data mining, such as Artificial Neural Network (ANN), Boosting, k-Nearest Neighbours (k-NN), and so forth. All of them are popular machine learning algorithms in various data mining areas, which including environmental data mining, educational data mining, financial data mining, etc. This paper builds PM2.5 concentration level predictive models based on ANN, Boosting (i.e. AdaBoost.M1), k-NN by using R packages. The data set includes 2009 to 2011 period meteorological data and PM2.5 data. The PM2.5 concentration is divided into 2 levels: low and high. The critical point is 25μg/m^3 (24 hours mean. The parameters of both models are selected by multiple cross validation. According to 100 replications of 10-fold cross validation, the testing accuracy of AdaBoost.M1 is around 0.846~0.868, which is the best result among three algorithms in this paper.