Applying text and data mining techniques to forecasting the trend of petitions filed to e-People

Applying text and data mining techniques to forecasting the trend of petitions filed to e-People

Expert Systems with Applications 37 (2010) 7255–7268 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

2MB Sizes 0 Downloads 67 Views

Expert Systems with Applications 37 (2010) 7255–7268

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Applying text and data mining techniques to forecasting the trend of petitions filed to e-People Jong Hwan Suh a,*, Chung Hoon Park b, Si Hyun Jeon c a

Information and Communication Technology Research and Development Center, SAMSUNG SDS, 159-1 Samseong-dong, Gangnam-gu, Seoul 135-798, Republic of Korea Consulting Division, SAMSUNG SDS, 707-19 Yeoksam-dong, Gangnam-gu, Seoul 135-918, Republic of Korea c Anti-corruption and Civil Rights Commission (ACRC), Uijuro 81, Seodaemun-gu, Seoul 125-705, Republic of Korea b

a r t i c l e Keywords: Text mining Data mining Petition Keyword extracting Document clustering Forecasting e-Government Open Innovation e-People

i n f o

a b s t r a c t As the Internet has been the virtual place where citizens are united and their opinions are promptly shifted into the action, two way communications between the government sector and the citizen have been more important among activities of e-Government. Hence, Anti-corruption and Civil Rights Commission (ACRC) in the Republic of Korea has constructed the online petition portal system named e-People. In addition, the nation’s Open Innovation through e-People has gained increasing attention. That is because e-People can be applied for the virtual space where citizens participate in improving the national law and policy by simply filing petitions to e-People as the voice of the nation. However, currently there are problems and challenging issues to be solved until e-People can function as the virtual space for the nation’s Open Innovation based on petitions collected from citizens. First, there is no objective and systematic method for analyzing a large number of petitions filed to e-People without a lot of manual works of petition inspectors. Second, e-People is required to forecast the trend of petitions filed to e-People more accurately and quickly than petition inspectors for making a better decision on the national law and policy strategy. Therefore, in this paper, we propose the framework of applying text and data mining techniques not only to analyze a large number of petitions filed to e-People but also to predict the trend of petitions. In detail, we apply text mining techniques to unstructured data of petitions to elicit keywords from petitions and identify groups of petitions with the elicited keywords. Moreover, we apply data mining techniques to structured data of the identified petition groups on purpose to forecast the trend of petitions. Our approach based on applying text and data mining techniques decreases time-consuming manual works on reading and classifying a large number of petitions, and contributes to increasing accuracy in evaluating the trend of petitions. Eventually, it helps petition inspectors to give more attention on detecting and tracking important groups of petitions that possibly grow as nationwide problems. Further, the petitions ordered by their petition groups’ trend values can be used as the baseline for making a better decision on the national law and policy strategy. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction Owing to the development of information and communication technology (ICT), particularly the Internet, the government sectors around the world have tried to progress themselves into the electronic government (e-Government), a.k.a. digital government. The construction of e-Government aims at providing citizens with services quickly and accurately, effectiveness of government work, innovation by redesign work process, and raising national competitiveness by improving productivity (Lee & Jung, 2004). Therefore, the anticipated benefits of e-Government can be more efficiency, greater convenience, improved services, * Corresponding author. Tel.: +82 10 3087 1229; fax: +82 70 7016 0029. E-mail address: [email protected] (J.H. Suh). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.04.002

better accessibility of public services, less corruption, more transparency, revenue growth, and cost reductions (Atkinson & Castro, 2008). According to Palvia and Sharma (2007), various activities of e-Government can be summarized into four categories with respect to interaction domains. The first type is to push information over the Internet, e.g. regulatory services, issue briefs, notifications, etc. Secondly, some models aim at improving two way communications between the government agency and the citizen, a business, or another government agency. In the second type of models, users can engage in dialogue with government agencies and post problems, comments, or requests to the government agencies. Third, e-Government helps conduct transactions such as lodging tax returns and applying for services and grants. The fourth type is based on governance, e.g. online polling, voting, and campaigning.

7256

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

Among these four types in activities of e-Government, the second one has recently been more important, especially in the Republic of Korea. This is because the Internet has started to be used as the virtual space where citizens are united and their opinions are promptly shifted into the action such as a mass rally. Therefore, Anti-corruption and Civil Rights Commission (ACRC), one of the government sectors in the Republic of Korea, has decided to strengthen its activities of collecting public opinions by building up the online petition portal system. At last, in June 2006, ACRC constructed the online petition portal system called e-People1 on purpose to hear the voice of the nation by merging scattered online channels which had collected civil complaints and petitions. As a result, ACRC could achieve fame by developing e-People with its winning the Best Demonstration Stand Award at the eChallenge Conference and Exhibition 2008 held in Stockholm, Sweden (ACRC, 2008). Nevertheless, we expect that e-People will be faced with another challenging need as Open Innovation gains increasing attention as a new paradigm for innovation in the company’s business. Shortly, the concept behind Open Innovation is that companies cannot rely entirely on their own research but they should instead buy or license processes or inventions from other companies (Chesbrough, 2003). Similarly, the government sectors will also need to consider putting the concept of Open Innovation into action in the way like the innovation of the national law and policy not only from the inside, i.e. themselves, but also from the outside, i.e. citizens. Thus, we predict that the concept of Open Innovation is going to affect on the activities of e-Government sooner or later. As a result, we reached an agreement that we should evolve e-People into the virtual space where boundaries between citizens and government sectors are melted down so that citizens can participate in innovating on the government system conveniently by filing petitions to e-People. However, currently there are problems and challenging issues until e-People realize the concept of Open Innovation in the government sector with raising e-Government of the Republic of Korea to the next level. And they can be summarized into two matters as follows. First, there’s no objective and systematic method to analyze a large number of petitions filed to e-People without a lot of manual works of petition inspectors. As you may expect, petitions in e-People are collected from citizens all around the places in the Republic of Korea. Therefore, the number of petitions to be read is beyond the man’s ability. Besides, more than half data that a petition contains is text-based, and thereby it is difficult to understand petitions at a glance by manual works of petition inspectors. This makes it hard to conceive the voice of the nation on the basis of petitions filed to e-People. Therefore, we need to take advantage of text mining techniques. If keywords are elicited from text in petitions and petitions are clustered into petition groups with the elicited keywords, petition inspectors will be able to focus on analyzing the trend of petitions and consequently conceiving the voice of the nation with their manual works being reduced much in reading a great number of petitions. However, the result from our literature survey showed there are few researches done about applying text mining techniques to petitions to overcome these problems and we need to perform the related research. As the second matter, it is required to forecast the trend of petitions more accurately and quickly by using e-People rather than petition inspectors for the better national law and policy strategy. e-People currently perceives the importance of petitions after they get serious actually because they are evaluated by manual analyses of petition inspectors. So it takes a lot of time to find important petitions that might grow as nationwide problems while delaying

1

www.epeople.go.kr.

planning and practicing their related national law and policy strategy. However, if the prediction models are built up by applying data mining techniques to petitions, we are going to be able to predict the trend of petitions more accurately and quickly by using ePeople rather than petition inspectors. In the end, the predicted trend of petitions will contribute to making it possible for the government sectors to make a better decision on the national law and policy strategy. Therefore, to applying data mining techniques for forecasting the trend of petitions can be the challenging issue to be solved by us. Hence, we propose the framework of applying text and data mining techniques to petitions filed to e-People to solve those problems and challenging issues that are stated previously. In other words, we apply text mining techniques to unstructured data to elicit keywords from petitions and identify groups of petitions with the elicited keywords. Moreover, we apply data mining techniques to structured data of the identified petition groups to forecast the trend of petitions. To sum up contributions of our applying to text and data mining techniques to petitions filed to e-People, we provide an objective and systematic method for analyzing a large number of petitions filed to e-People and predicting the trend of petitions with manual works of petition inspectors being reduced. And we consequently help the government sectors to make a better decision on the national law and policy strategy on the basis of the trend of petitions forecasted by our approach. The rest of the paper is structured as follows. In Section 2, we introduce the taxonomy of e-Government, and we take a look at related works on keyword extracting and document clustering in text mining, and forecasting models with data mining techniques. In Section 3, we explain the framework of our methodology through three subsections: eliciting keywords from petitions; identifying petition groups; forecasting the trend of petition. In Section 4, we apply the methodology suggested in Section 3 to 8 groups of petitions filed to e-People, i.e. the online petition portal system constructed by ACRC of the Republic of Korea. In detail, we perform the 8 fold validation on the prediction models based on RBFNs and C5.0 after dividing the 8 petition groups of petitions repeatedly 8 times into 7 petition groups for training sets and the rest petition group for the test set. And we discuss the implication of results in the performance validation. In Section 5, finally we conclude the paper with discussion of contributions and further researches.

2. Literature reviews 2.1. e-Government and its cases in the Republic of Korea e-Government refers to the delivery of national or local government information and services via the Internet or other digital means to citizens, businesses or other governmental agencies (UN & ASPA, 2002; Evans & Yen, 2006). And its cases can be categorized into three groups according to Marchionini, Sanan, and Brabdt (2003). First, government to citizen (G2C) service of e-Government provides one-stop and on-line access to information and services to citizens. And G2C applications enable citizens to ask questions and receive answers about filing income taxes, paying taxes, renewing driver licenses, petition, and so on. In the case of the Republic of Korea, the online civil service portal site named as government for citizen (G4C2) enables integrated one-site delivery of a wide array of information provided by various administrative agencies. Through the G4C portal site, citizens are able to obtain support related to over 5,000 types of public services. In particular, G4C does 2

www.egov.go.kr.

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

not only provide description of the public service’s process but also gives information on relevant agencies in charge, commissions, necessary materials and relevant laws and regulations. Second, government to business (G2B) service of e-Government deals with businesses, e.g. suppliers, using the Internet and the other information and communication technologies. In fact, G2B includes two two-way interactions and transactions, government to business and business to government (B2G). Therefore, two key areas in G2B are e-Procurement and auctioning of government surpluses. In the Republic of Korea, public procurement service (PPS), the central procurement agency, established the Korea online e-Procurement system (KONEPS3) with the concept of a single window to which all public organizations including central and local governments can access. All procedures like procurement requests, bids, contracts, and payments are automated through KONEPS. PPS also offers a one-stop service on bids and contracts, whilst being connected to 80 external systems like that of financial institutions. Once registered on the KONEPS, the companies are allowed to make a bid for all open tenders and check related bidding information. Currently, about 36,000 public organizations and 170,000 companies use KONEPS. Third, government to government (G2G) services of e-Government deals with those activities that take place between different government organizations and agencies. Many activities in G2G are aimed at improving the efficiency and effectiveness of overall government operations by eliminating redundancy and duplication. G2G is also beneficial in terms of crime detection, homeland security, intergovernmental cooperation, development of emergency response systems, and linking of law enforcement agencies. For example, the Republic of Korea’s government introduced a government work process management system named On-nara business process system (BPS) to its 54 central agencies. On-nara BPS allows all work processes to be handled on-line, from planning policies to making decisions and sharing the final data produced. Different from the conventional method that focused on human and experience, it is a new method focused on system and knowledge. And it is based on records management and task management. Records management standardizes the process of handling records, and keeps records on all decision-making procedures to increase clarity and accountability of administration. Task management classifies work by function and objective, and manages work systematically to focus on objective-oriented work promotion. In this paper, we are concerned about e-People that is the wellknown as G2C service portal system in the Republic of Korea. To introduce shortly, the e-People system is originated from the big drum called Shinmungo which was installed by King Taejong about 600 years ago and had been beaten by the public to appeal to the king for grievances. Inheriting this wisdom of ancestors dwelling in Shinmungo that the public’s voice is the voice of the God, the Republic of Korea created e-People system for open dialogue with the public in order to resolve grievances and petition services. In particular, it integrates petition, proposal, and policy discussion services operated by 303 governmental organizations including central administrative organizations, local autonomous bodies and public institutions.

2.2. Information extraction and document clustering in text mining Text mining is defined as the process to discover previously unknown information by automatically extracting information from various unstructured data (Delen & Crossland, 2008). Though text mining is similar to data mining because it is considered as a part of the general field in data mining, it is different from data mining 3

www.koneps.go.kr.

7257

mainly in that it is designed to extract information from unstructured data like text, not structured data like categorical, ordinal and continuous variables (Yang & Lee, 2005). Benefits of text mining are obvious in the areas where a majority of data is stored in some sorts of unstructured form. And applications of text mining include information extraction, topic tracking, summarization, categorization, clustering, concept linkage, information visualization, and question answering (Fan, Wallace, Rich, & Zhang, 2006; Weng & Lin, 2003). Among these application areas, we’re mainly concerned on information extraction and clustering for the application of text mining in this paper. Firstly, information extraction in text mining is generally term extraction and selection from the given document as the preliminary of document representation and clustering (Wei, Yang, & Lin, 2008). Especially, term selection has been an important issue as the basis not only for information retrieval but also for further analysis (Aliguliyev, 2009; Chau, Huang, Qin, Zhou, & Chen, 2006; Delen & Crossland, 2008). The commonly used methods for keyword selection are the term frequency inverse document frequency (tfidf), Chi-square (v2) statistics, information gain, and latent semantic indexing (LSI). These methods determine a score for each term, and a given number of terms with the highest scores are selected (Guzella & Caminhas, 2009; Sebastiani, 2002). To explain in detail, tfidf is the combination of term frequency (tf) and inverse document frequency (idf) when tf is a measure of how frequently a term occurs in a document, and idf is a measure of how few other documents contain the term (Salton, Wong, & Yang, 1975; Wang, Peng, & Hu, 2006). tfidf aims at balancing the local and the global term occurrences in the document (Aliguliyev, 2009). The information gain estimates the capability of category prediction for each term present in training documents by calculating the number of information bits of terms in predefined documents categories (Yang & Pederson, 1997). v2 statistic estimates the membership between the terms and the predefined document categories, and terms with the membership less than a prefixed threshold are removed. Different from information gain and v2, LSI is an unsupervised algorithm and it does not consider the dependency between the feature terms and the categories while it uses the singular value decomposition (SVD) algorithm. LSI performs the linear transforms on the original feature terms and obtains new feature terms, which are the linear combinations of the original feature terms (Chen, Tsai, & Chan, 2008; Scott, 1990). In this paper, we employ tfidf for keyword selection in extracting keywords from texts of our methodology because tfidf is considered as the most important and powerful feature for keyword selection (Boley et al., 1999; Roussinov & Chen, 1999; Wei et al., 2008). Next, document clustering in text mining is to group similar documents into clusters on the basis of their contents. The document in the resulted clusters shows the maximal similarity to those in the same cluster, and shares minimal similarity with documents in the other clusters at the same time (Wei, Yang, Hsiao, & Cheng, 2006). In detail, after feature extraction and selection, a document is represented as a feature vector jointly defined by the previously selected features with representation methods such as binary representation scheme (presence or absence of a keyword in a document), tf, and tfidf. And the target documents are grouped into distinct clusters on the basis of the selected keywords and their respective values in each document. Common clustering methods include partitioning-based, hierarchical, and Kohonen neural network, a.k.a. self-organizing map (SOM). A partitioningbased approach partitions a set of documents into multiple nonoverlapping clusters, and the k-means clustering method is a commonly used partitioning-based approach. Given n documents, the k-means clustering method first selects k documents as initial k clusters. And then, it iteratively assigns each document to the

7258

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

most similar cluster based on the mean value of the documents in each cluster (Boley et al., 1999; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Li, Chung, & Holt, 2008). On the other hand, a hierarchical approach builds a binary clustering hierarchy whose leaf nodes represent the documents to be clustered. A representative hierarchical clustering algorithm is the hierarchical agglomerative clustering (HAC) method, which starts with as many clusters as there are documents, i.e. each cluster contains only one document (Voorhees, 1986). On the basis of a specific intercluster similarity measure of choice, e.g. single link, complete link, groupaverage link, Ward’s method), the two most similar clusters are merged to form a new cluster. This merging process continues until either a hierarchy emerges with a single cluster at the top or a termination condition holds, e.g. intercluster similarity is less than a prespecified threshold (Roussinov & Chen, 1999; Voorhees, 1986). The SOM network approach is a categorization network developed by Kohonen, and it was originally designed for solving problems that involve tasks such as clustering, visualization, and abstraction (Kiang, Fisher, Chen, Fisher, & Chi, 2009; Kohonen, 1995). The main function of SOM networks is to map the input data from an n-dimensional space to a lower dimensional plot, usually one or two-dimensional, while maintaining the original topological relations. The physical locations of points on the map show the relative similarity between the points in the multi-dimensional space. Each node one the map can be considered as a cluster itself (Kiang et al., 2009; Roussinov & Chen, 1999). On the basis of these clustering methods, recently the two step clustering approach has gained a lot of attention because of its improved performance comparing to single step approaches (Kuo, An, Wang, & Chung, 2006; Kuo, Ho, & Hu, 2002; Strauch et al., 2007; Wu et al., 2006). Therefore, for clustering algorithm in this study, we suggest the two step clustering approach modified with SOM and k-means clustering method. In other words, our clustering method suggested for this paper is the two step clustering method in which the number of clusters, i.e. k in the k-means clustering method, is determined by SOM and then documents are grouped into k clusters by k-means clustering method iteratively until there’s no document left without being grouped into any cluster.

are gaining wide popularity in data mining applications, and major categories of machine learning techniques are artificial neural networks (ANNs), rule induction (RI), case-based reasoning (CBR), genetic algorithms, and inductive logic programming (ILP; Bose & Mahapatra, 2001). Among a variety of data mining techniques, we mainly focus on introducing two commonly used techniques, i.e. ANNs and RI, for our building forecasting models in this paper because they have been used for classification and prediction most commonly (Bose & Mahapatra, 2001). First, ANNs are computer models built to emulate the human pattern recognition function through a similar parallel processing structure of multiple inputs (Zhang & Zhou, 2004). ANNs have been concluded to be better than various conventional methods (Alfaro, García, Gámez, & Elizondo, 2008; Lam, 2004; Sun, Choi, Au, & Yu, 2008), and widely used for forecasting and analyzing real world problems such as investigating long-term tidal predictions (Lee, 2004), forecasting price of index futures and stock market returns (Enke & Thawornwong, 2005; Tsaih, Hsu, & Lai, 2008), improving customer satisfaction (Deng, Chen, & Pei, 2007), predicting flank war in drills (Panda, Chakraborty, & Pal, 2007), and enhancing job completion time prediction in the semi conductor fabrication factory (Chen, 2007). Though there are numerous different types of ANNs, the multi-layered perceptron (MLP) model has been the appropriate one used for forecasting, prediction, and general decision making. And the most popular is the multi-layer feed-forward networks (FFNNs) with the back-propagation learning algorithm (Akhlaghi & Kompany-Zareh, 2005). However, the MLP model has some drawbacks that their learning processes are time-consuming and it has a tendency to get stuck at local minima (Yu, Lai, & Wang, 2008). On the other hand, radial basis function networks (RBFNs) has been notified recently as the potential alternative approach because it offers some advantages such as robustness to noisy data when it is compared with FFNN (Derks, Pastor, & Buydens, 1995;

Petitionit

2.3. Forecasting using data mining techniques Data mining, a.k.a. knowledge discovery in database (KDD), is the process to identify hidden knowledge, unknown patterns, and new rules from large databases that are potentially useful and ultimately understandable for making crucial decisions (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). It is a discipline of growing interest and importance, and an application area that can provide significant competitive advantage to an organization by exploiting the potential of large data warehouses (Bose & Mahapatra, 2001). Nowadays, data mining software vendors are integrating fundamental data mining capabilities into database engines, so that users can execute data mining tasks in parallel inside database, which reduces response time (Zhang & Zhou, 2004). Though the two high-level primary goals of data mining in practice tend to be prediction and description, data mining can be mainly classified into four categories such as association rule mining, classification and prediction, clustering analysis, and sequential pattern and time-series mining according to the tasks that data mining is called upon to accomplish (Fayyad et al., 1996; Han & Kamber, 2001; Larose, 2005). Moreover, data mining techniques can be generally divided into two categories such as statistics, and machine learning (Chen, Sakaguchi, & Frolick, 2000). Contrary to statistical modeling methods that require data set conformed to rigid distribution criteria, pattern discovery algorithms based on machine learning techniques impose fewer restrictions and produce patterns that are easy to understand. Therefore, they

Elicit keywords from petition it

Keywords of petition it

Identify petition group of petitionit

Petition group of petitionit

Forecasting the trend of petition it

The trend of petitionit Fig. 1. The framework of forecasting the trend of petitionit suggested in this paper.

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

Walczak & Massart, 2000). As a matter of fact, many pattern recognition experiments show that the RBFNs superior over other neural network approaches in the following senses according to Sahin (1997). First, the RBFNs are good at modeling nonlinear data effectively. Second, they can be trained in on stage rather than using an iterative process as in MLP and learn the given application quickly. Third, the RBFNs produce classification accuracies from 5% to 10% higher than accuracies produced by the back propagation algorithm. And it is also indicated that the RBFNs perform better than the conventional kernel classifiers. Fourth, the RBFNs are quite successful for identifying regions of sample data not in any known class because it uses a non-monotonic transfer function based on the Gaussian density function. Therefore, we make use of the RBFNs for ANNs approach in this paper. Second, RI models belong to the logical, pattern distillation based approaches of data mining. Based on data sets, RI produces a set of if–then rules to represent significant patterns and create prediction models. Such models are fully transparent and provide complete explanations of their predictions (Zhang & Zhou, 2004). As one commonly used and well-known type of RI, RI creates a decision tree or a set of decision from training examples with a known classification, and it is usually used for the purpose of classifying or prediction (Chou, 1991). When a decision tree-based model is applied to a data, each record flows through the tree along

7259

a path determined by a series of tests until the record reaches a leaf or terminal node of the tree. There, it is given a final result, either of a discrete or continuous value. A decision tree takes the form of top–down tree structure where decisions are made at each node, and it may be translated into a set of rules. The resulting decision tree is then applied to a test data set to evaluate its accuracy with new examples. When a decision tree is overfitted to a training data set, its classification accuracy with new data may diminish. The tree must then be pruned to eliminate overfitting before it is deployed in a real life application (Bose & Mahapatra, 2001). In literature, there are various decision tree algorithms such as chi-squared automatic interactive detector (CHAID), classification and regression trees (CART), interactive dichotomizer version 3 (ID3), C4.5, and C5.0. These produce decision trees that are different from one another in the following ways how many splits are allowed at each level of the decision tree, how those splits are chosen when the decision tree is built, and how the decision tree growth is limited to prevent overfitting (Berry & Linoff, 2000; Ture, Tokatli, & Kurt, 2009). According to Bose and Mahapatra (2001), C4.5 is the most popular one among these decision tree algorithms. However, we select C5.0 algorithm for this study because C5.0 offers improvements for C4.5 that is an extension and revision of ID3 algorithm (Wu et al., 2006). C5.0 can be most effectively used in processing huge data set specifically. Since it uses boosting method

Petitionit

Petitions (XML)

Collect words from and … of petitionit

W0it, words appearing in petitionit

Filter out stop words from W 0it

Stop words dictionary

W1it, words in petitionit without stop words

Select words with high scores, i.e. tfidf >TFIDF and ths >THS, from W 1it

W2it, words with high scores

Refine words from W 2it into the set of keywords for petitionit by petition investigators’ reviews

Kit, keywords of petition it Fig. 2. Steps to elicit keywords from petitionit.

Topic map of keywords (XML)

7260

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

to increase modeling accuracy, it is also known as boosting trees. Also, it is a lot faster in speed and is more memory efficient than C4.5 (Chang & Chen, 2009). 3. Methodology On purpose to make a better decision on the national law and policy strategy by forecasting the trend of petitions as the voice of the nation, steps in our approach of applying text and data mining techniques to petitions filed to e-People are summarized as shown in Fig. 1. First, we elicit keywords from petitionit which is the ith petition filed to e-People at the time of t. And then, we identify the petition group of petitionit by either classifying petitionit into one of petition groups discovered previously at the time of (t  1) or searching petitions with using keywords elicited from petitionit. In the end, we forecast the trend of petitionit with using petition information of the petition group identified for petitionit.

the filed petition are required first of all. Hence, Fig. 2 explains five steps to elicit keywords of petitionit which is the ith petition filed by an applicant at the time of t. First, all words appearing in    and    of petitionit (XML) are collected into W0it which is a set of words appearing in petitionit. Second, stop words are filtered out from W0it on the basis of the stop words dictionary and these words without stop words for petitionit are set into W1it. Third, scores for words in W1it are calculated by using tfidf and ths, and then words in W1it with high scores, i.e. tfidf > TFIDF and ths > THS, are selected into W2it. Though there’re different types of formulas for tfidf, the formula that we make use of in this paper is adopted from Wang et al. (2006). Hence, tfidf of the jth word appearing in    and    of petitionit, i.e. wordj 2 W 1it , is defined as

tfidfijt ¼ tfijt  idfijt :

ð1Þ

Here, tfijt is the normalized frequency of wordj 2 W 1it , and it is defined as

3.1. Elicit keywords from petitionit Once an applicant files a petition against the government sectors to e-People, the filed petition is stored in database as the XML format. To identify the petition group of the filed petition and subsequently forecast the trend of the filed petition with using petition information of the identified petition group, keywords of

tfijt ¼

nijt  ; max nijt

ð2Þ

j 2 W 1it

where nijt is the frequency of wordj 2 W 1it . In addition, idfijt is the inverted document frequency of wordj 2 W 1it , and it is defined as:

Petitions(t-1) (XML)

Matrix A(aij)

Topic map of keywords (XML)

SOM

The number of clusters Matrix A'(aij) Yes The last petition cluster

Is the number of clusters equal to1?

No

Petitions from inconsistent petition clusters

k-means clustering

Petition clusters

Select consistent petition clusters

Consistent petition clusters

Petition groups, PG(t-2)

Discover petition groups

Petition groups, PG(t-1) Fig. 3. Discovering petition groups at the time of (t  1), i.e. PG(t1).

7261

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268



 qt idfijt ¼ log ; cijt

 thsijt ¼

ð3Þ

1; if wordj 2 W 1it appears in < title >  < =titile > of petitionit 0; otherwise

:

ð4Þ

In the end, words in W2it that are selected with high scores are refined into Kit, i.e. the set of keywords for petitionit, by petition investigators’ reviews with referring to keywords predefined in

where qt is the size of total petitions stored in database (XML) at the time of t and cijt is the number of petitions where wordj 2 W 1it appears among total petitions stored in database (XML) at the time of t. On the other hand, thsijt is the existence of wordj 2 W 1it in    of petitionit, and it is Keywords of petition it

PG(t-1)

Search petition group of petition it among petition groups in PG(t-1) with petition investigator’s review

Yes

Does petition group of petitionit exist among petition groups in PG(t-1)?

Add petitionit into the searched petition group

Petition inspectors

No

Petitions(t-1) (XML)

Search petitions by using keywords of petition it

Petitions similar with petitionit

Integrate petitionit and its similar petitions as a new petition group

Petition group of petitionit Fig. 4. Identifying petition group of petitionit.

PG(t-1)

Petitions (XML)

Get feature values of petition groups in PG (t-1) Petition group of petitionit

Daily reports

Get trend values of petition groups in PG (t-1) Get feature values of petition group of petition it Matrix B

Matrix C

Train prediction models by using RBFNs and C5.0

Forecasting the trend value of petition it

tvkt

{0, 1}, the predicted trend value of petition it

Prediction models

Phase 1. Training phase at the time of (t-1) Fig. 5. Steps to forecast the trend of petitionit.

Phase 2. Forecasting phase at the time of t

7262

Table 1 Formulas to get feature values at the time of t for the kth petition group in PGt. Feature

Formula

b

Absolute quantity Relative quantityc

rqk1t ¼ aqk1t =

(6)

Pr

(9)

k¼1 aqk1t

Midterm period, ma P aqk2t ¼ m i¼1 hkðtiþ1Þ rqk2t ¼ aqk2t =

Long term period, la (7)

Pr

Pl

(10) rqk3t ¼ aqk3t =

k¼1 aqk2t

(8)

i¼1 hkðtiþ1Þ

aqk3t ¼

Pr

k¼1 aqk3t

(11)

(12) adk2t ¼ aqk2t  aqk2ðt1Þ

(13) adk3t ¼ aqk3t  aqk3ðt1Þ

(14)

Relative increase and rdk1t ¼ aqk1t =aqk1ðt1Þ decrease of quantity  P  Recency rek1t ¼ si¼1 i  hkðtiþ1Þ =aqk1t

(15) rdk2t ¼ aqk2t =aqk2ðt1Þ

(16) rdk3t ¼ aqk3t =aqk3ðt1Þ

(17)

(19) rek3t ¼ Pl i  hkðtiþ1Þ =aq k3t i¼1

(20)

Ps

Frequency of appearanced

fak1t ¼

Variance of quantity

v qk1t ¼

Ages’ weight in applicantse

awk1t ¼

Male’s weight in applicantsf

mwk1t ¼

Location’s weight in applicantsg

lwk1t

Foreigner’s weight in applicationsi

ðjÞ

ðjÞ

if

i¼1

(18) rek2t ¼

v kðtiþ1Þ =s

2 hkðtiþ1Þ  rek1t =ðs  1Þ (   n Age1jkt =aqk1t if aqk1t – 0

(21) fak2t ¼

Ps 

(24)

i¼1

for 0 otherwise (   1 n Malekt =aqk1t if aqk1t – 0

0 otherwise n   1 ¼ n Locationjkt =aqk1t

aqk1t – 00otherwise:for j ¼ 1; . . . ; 7 n   ¼ n Institution1jkt =aqk1t

j ¼ 1; . . . ; 7

(27)

(30)

Pm  i¼1

Pm

v qk2t ¼ ðjÞ

awk2t ¼

i¼1

if

v kðtiþ1Þ =m

2 Pm  =ðm  1Þ i¼1 hkðtiþ1Þ  rek2t (   2 n Agejkt =aqk2t if aqk2t – 0

mwk2t ¼

(33) lwðjÞ k2t

 i  hkðtiþ1Þ =aqk2t

for 0 otherwise (   2 n Malekt =aqk2t if aqk2t – 0

0 otherwise n   2 ¼ n Locationjkt =aqk2t

aqk2t – 00otherwise:for j ¼ 1; . . . ; 7 n   n Institution2jkt =aqk2t

(22) fa ¼ k3t (25) j ¼ 1; . . . ; 7

(28)

(31)

Pl

v qk3t ¼ ðjÞ

awk3t ¼

i¼1

2 hkðtiþ1Þ  rek3t =ðl  1Þ (   n Age3jkt =aqk3t if aqk3t – 0

Pl

mwk3t ¼

(34) lwðjÞ k3t if

(23)

v kðtiþ1Þ =l

i¼1



for 0 otherwise (   3 n Malekt =aqk3t if aqk3t – 0

0 otherwise n   3 ¼ n Locationjkt =aqk3t

aqk3t – 00otherwise:for j ¼ 1; . . . ; 7 n   n Institution3jkt =aqk3t

(26) j ¼ 1; . . . ; 7

(29)

(32)

(35)

ðjÞ iwk1t

(36) iwðjÞ ¼ k2t

(37) iwðjÞ ¼ k3t

(38)

aqk1t – 00otherwise:for j ¼ 1; . . . ; N I (   n Foreigner1kt =aqk1t if aqk1t – 0 fwk1t ¼ 0 otherwise

aqk2t – 00otherwise:for j ¼ 1; . . . ; N I (   (39) n Foreigner2kt =aqk2t if aqk2t – 0 fwk2t ¼ 0 otherwise

aqk3t – 00otherwise:for j ¼ 1; . . . ; N I (   (40) n Foreigner3kt =aqk3t if aqk3t – 0 fwk3t ¼ 0 otherwise

(41)

if

if

if

In this paper, s, m, and l are set as 2, 4, and 8 days, respectively. hkt is the number of petitions filed at the time of t that the kth petition group in PGt contains. c r is the number of petition groups in PGt.  hkt > 0 d v kt ¼ 10 ifotherwise : i e Agejkt is the set of petitions that belong to the kth petition group in PGt as well as the jth age group for the times: from (t  s + 1) to t for i = 1, from (i  m + 1) to t for i = 2, and from (t  l + 1) to t for i = 3. In this paper, we define that petitions are divided according to applicants’ ages into seven groups: ages 6 10 for j = 1, 10 < ages 6 20 for j = 2, 20 < ages 6 30 for j = 3, 30 < ages 6 40 for j = 4, 40 < ages 6 50 for j = 5, 50 < ages 6 60 for j = 6, and 60 < ages for j = 7. i f Malekt is the set of male’s petitions that belong to the kth petition group in PGt for the times: from (t  s + 1) to t for i = 1, from (t  m + 1) to t for i = 2, and from (t  l + 1) to t for i = 3. g Locationijkt is the set of petitions that belong to the kth petition group in PGt as well as the jth location group for the times: from (t  s + 1) to t for i = 1, from (t  m + 1) to t for i = 2, and from (t  l + 1) to t for i = 3. In this paper, the locations of applicants are divided into seven groups by referring to the first three numbers in the zip codes included in petitions: Seoul city for j = 1, Gangwon province for j = 2, Daejeon, Chungchung province for j = 3, Incheon and Kyunggi province for j = 4, Gwangjoo and Jeonbook province for j = 5, Pusan city, Ulsan city, Kyungnam province, and Jeju island for j = 6, Daegu city, and Kyungsang province for j = 7. h Institutionijkt is the set of petitions that belong to the kth petition group in PGt as well as the jth institution group for the times: from (t  s + 1) to t for i = 1, from (t  m + 1) to t for i = 2, and from (t  l + 1) to t for i = 3. In this paper, the institutions to which petitions are related are divided into NI groups, thereby j = 1, . . ., NI. In this paper, we consider fifty Korean institutions, NI = 50. i Foreignerikt is the set of foreigner’s petitions that belong to the kth petition group in PGt for the times: from (t  s + 1) to t for i = 1, from (t  m + 1) to t for i = 2, and from (t  l + 1) to t for i = 3. b

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

Absolute increase and adk1t ¼ aqk1t  aqk1ðt1Þ decrease of quantity

Institutions’ weight in petitionsh

a

Short term period, sa P aqk1t ¼ si¼1 hkðtiþ1Þ

7263

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

the topic map of keywords (see Fig. 7 for the example of the topic map of keywords). Afterward, keywords newly found while constructing Kit for petitionit are added into the topic map of keywords. 3.2. Identify the petition group of petitionit Before identifying the petition group of petitionit, we need a set of petition groups previously identified at the time of (t  1), i.e.

PG(t1), as a preliminary. And Fig. 3 describes the way how PG(t1) is constructed from Petitions(t1) = {petitioni(t1) | 1 6 i 6 p when p is the number of petitions that are filed to e-People and stored in database as the XML format at the time of (t  1)} with considering petition groups in PG(t2) previously identified at the time of (t2). To explain in detail, steps in Fig. 3 are as follows. With keywords elicited from petitions in Petitions(t1) and their topic map, we count the number of a keyword’s appearing in

Fig. 6. The matrix B.

Fig. 7. The matrix C.

English dis/abbreviation synonym singular/plural

……………………………………………………… …………… "> beef ……………………………………………………… ……………………………………………………… ……………………………………………………… ……………


Fig. 8. Topic map of keywords suggested in this paper and its example of

, i.e. beef.

7264

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

   and    of petitioni(t1). And this leads to the matrix A defined as

2

3

a11

a12

   a1q

6 a21 6 Aðaij Þ ¼ 6 6 .. ðpqÞ 4 .

a21 .. .

   a2q 7 7 7 .. .. 7; . . 5

ap1

ap1

   apq

ð5Þ

where p is the number of petitions in Petitions(t1), q is the number of keywords elicited from petitions in Petitions(t1), and aij is the number of the jth keyword’s appearing in the ith petition of Petitions(t1), i.e. petitioni(t1). With the matrix A, petitions in Petitions(t1) are grouped into clusters by k-means clustering method after determining the size of clusters by SOM. Among the resulted petition clusters, petition inspectors select consistent petition clusters as candidates of petition groups to investigate. But inconsistent petition clusters are disjointed, and their petitions lead to the matrix A0 that has elements based on the same concept as the matrix A. Subsequently, the matrix A0 is subjected to the clustering analyses again. Likewise, consistent petition clusters are resulted from clustering analyses that are iterative between SOM and k-means clustering method until there’s no cluster found by SOM. And PG(t1) is constructed by petition inspector’s comparing the consistent petition clusters with petition groups in PG(t2), which is previously identified at the time of (t  2). In other words, if a consistent petition cluster is investigated as a newly found petition group at the time of (t  1) by petition inspectors, it is added into PG(t2) as a new petition group. But if it is not a new petition group, it is integrated into the petition group to which it belongs among petition groups in PG(t2). Consequently, with keywords of petitionit and petition groups of PG(t1), we identify petition group of petitionit through steps as described in Fig. 4. In detail, we search petition groups to which petitionit possibly belongs among petition groups in PG(t1) by using keywords in Kit. But if there’s no petition group in PG(t1) matched to petitionit, petitions related to petitionit are searched directly from database of petitions (XML) by using keywords of petitionit. And those petitions searched by using keywords of petitionit are integrated with petitionit as a new petition group. Thus, the petition group of petitionit is identified and later petition information from petition group of petitionit is used for forecasting the trend of petitionit in Section 3.3. 3.3. Forecasting the trend of petitionit In Section 3.3, we start from the assumption that there’re feature values inducible to the trend values for the petition group of petitionit. And there are two phases suggested as described in Fig. 5 to forecast the trend of petitionit: training phase at the time of (t  1) and forecasting phase at the time of t.

In training phase at the time of (t  1), prediction models are built up by training RBFNs and C5.0. To build up such prediction models, training data sets are collected from petitions stored in database (XML) for each petition group in PG(t1). And the training data sets are composed of two parts: feature values as input variables and a trend value as an output variable. To explain the detail of two parts in training data sets, let us assume that filing times of petitions in all petition groups of PG(t1) range from t0 to (t  1) and l is the long term period. Firstly, feature values over the times from ts = (t0  l + 1) to (t  1) for the kth petition group in PG(t1) when k = 1, . . ., r are calculated on the basis of formulas suggested in Table 1. In addition, the trend values over the times from ts to (t  1) for the kth petition group in PG(t1) when k = 1, . . ., r are collected from daily reports made by petition inspectors. The trend value at the time t0 2 {ts, . . ., (t  1)} for the kth petition group, i.e. tv kt0 , is equal to ‘H’ that represents that the kth petition group is highly possible to be a nationwide matter if any petition in the kth petition group appears in the daily report written at the time of t0 . Otherwise, tv kt0 is equal to ‘L’. Thus, sets of feature values and a trend value for each petition group in PG(t1) over the times from ts to (t  1) are led to the matrix B as shown in Fig. 6. And the matrix B is used as training data sets to build up prediction models based on RBFNs and C5.0 to forecast the trend of petitionit at the time of t. In forecasting phase at the time of t, with petition information of the petition group identified for petitionit resulted from Section 3.2 and prediction models built up using the matrix B in training

Table 3 The result of 8 fold validation. Petition group put into testing phasea

Estimated accuracy in training phase (%) RBFNs

C5.0

RBFNs

C5.0

Group 1: founding an international middle school Group 2: founding a health subject in schools Group 3: the illegal websites speculating gambling spirit Group 4: beef imported from US Group 5: the restriction on selling real estate in the metropolitan area Group 6: the alternative-day-no-driving system Group 7: the responsible surveillance system Group 8: melamine

83.87

97.65

69.41

72.94

84.87

93.32

62.35

50.59

82.69

97.48

84.71

62.35

84.71 83.36

97.48 94.96

77.65 69.41

75.29 78.82

84.37

96.97

79.35

69.57

83.35

97.98

86.96

86.96

82.69

97.98

85.88

85.88

84.87 82.69 83.74

97.98 93.32 96.73

86.96 62.35 76.96

86.96 50.59 72.80

Maximum estimated accuracy Minimum estimated accuracy Average

Estimated accuracy in testing phase (%)

a One among 8 petition groups is selected for testing phase while the others are used to build up prediction models in training phase.

Table 2 The outline of 8 petition groups. Petition group Group Group Group Group Group Group Group Group

1: 2: 3: 4: 5: 6: 7: 8:

founding an international middle school founding a health subject in schools the illegal websites speculating gambling spirit beef imported from US the restriction on selling real estate in the metropolitan area the alternative-day-no-driving system the responsible surveillance system melamine

Representative keywords in Korean

The number of online petitions 13 383 30 8 1250 257 6 130

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

phase at the time of (t  1), we predict the trend value of petitionit. To do so, we form the matrix C as shown in Fig. 7 by getting feature

7265

values from the petition group of petitionit at the time of t. And we apply the matrix C as input variables to the prediction models built

Fig. 9. (a) Actual trend values for 8 petition groups from daily reports by petition managers, and estimated trend values for 8 petition groups which were resulted by prediction models based on (b) RFBNs and (c) C5.0.

7266

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

up using the matrix B in training phase at the time of (t  1). As a result, the trend value of the petition group of petitionit is decided between ‘L’ and ‘H’ as the output variable, and this is identical to the trend of petitionit. Afterward, petition inspectors make mention of petitionit into the daily report at the time of t if the predicted trend value for the petition group of petitionit comes to be ‘H’. Moreover, the prediction result for the petition group of petitionit at the time of t is compared by petition inspectors to the actual trend value of the petition group of petitionit after the time of t, and the predicted trend value for the petition group of petitionit at the time of t is modified after the time of t if necessary. Eventually, this feedback improves the precision of prediction models when sets of feature values and the trend value from the petition group of petitionit at the time of t are used as training data sets to build up the prediction model at the point after the time of t.

4. Application to petitions filed to e-People For this section, we collected 4,217 petitions filed to e-People over 92 days from July 1, 2008 to September 31, 2008. On these petitions after setting July 1, 2008 as t = 1 and September 31, 2008 as t = 92, we performed three subsections in Section 3 as follows to apply our methodology. Related to Section 3.1, we followed four steps and elicited 46 keywords from 4,217 petitions filed over the times from t = 1 to t = 92 with using the topic map as described in Fig. 8. And then, we identified 8 petition groups by going through steps explained in Section 3.2 with the help of Clementine™ as the analysis tool. Such 8 petition groups identified in Section 3.2 are explained in Table 2. On the basis of petition information of the identified 8 petition groups, we performed the 8 fold validation on our prediction models constructed through steps suggested in Section 3.3. In detail, firstly we evaluated the feature values and collected trend values for 7 petition groups over the times from t = 8 to t = 92 because the long-term feature values require 7 days ahead of the time t according to Table 1. And we used them to build up prediction mod-

els based on RBFNs and C5.0 with using Clementine™. Here, the prediction models based on RBFNs were set with preventing overtraining and stopping on 1000 cycles. And C5.0 was set with 10 trial boosting, 10 fold cross validation, global pruning, windowing attributes, and stopping on 1000 cycles. Thereupon, we applied the constructed prediction models to forecast the trend value of the remaining petition group at the time of t for 85 days from t = 8 to t = 92. And these processes were repeated for each petition group. Consequently, the estimated accuracies resulted from performing the 8 fold validation with both types of prediction models, i.e. RBFNs and C5.0, are put together in Table 3. According to Table 3, our prediction models based on RBFNs were trained with the estimated accuracies such as 84.87% on the maximum, 82.69% on the minimum and 83.74% on the average in training phase. And they showed the estimated accuracies such as 86.96% on the maximum, 62.35% on the minimum and 76.96% on the average in testing phase. On the other hand, prediction models based on C5.0 gave the training result with prediction accuracies 97.98% on the maximum, 93.32% on the minimum and 96.73% on the average in training phase. And their estimated accuracies when applied to testing data sets were distributed from 50.59% to 86.96% with 72.80% on the average. Besides, trend values predicted over the times from t = 8 to t = 92 for 8 petition groups are expressed as graphs in Fig. 9 with the actual trend values investigated from daily reports by petition inspectors. From Fig. 9, we found out that the prediction models applied to three petition groups such as petition group 2, petition group 3, and petition group 5 forecasted the time when the predicted trend values of petition groups turned from ‘L’ to ‘H’ earlier than petition inspectors had evaluated manually petition groups from ‘L’ to ‘H’. At length, petition group 2 which is the petition group of petitions related to founding a health subject in schools showed that its prediction model based on C5.0 turned the trend value of petition group 2 from ‘L’ to ‘H’ 19 days earlier than petition inspectors from ‘L’ to ‘H’ manually according to Fig. 10. For petition group 3 which is the petition group of petitions related to the illegal websites speculating gambling spirit, it is turned out in Fig. 11 that the trend values by its prediction model based on C5.0 had been changed from ‘L’ to ‘H’ 71 days earlier than petition inspectors

Fig. 10. Petition group 2 were changed from ‘L’ to ‘H’ at the time of (a) t = 36 by the actual trend values from daily reports by petition managers, (b) t = 59 by the estimated trend values resulted by prediction models based on RFBNs, and (c) t = 17 by the estimated trend values resulted by prediction models based on C5.0.

Fig. 11. Petition group 3 were changed from ‘L’ to ‘H’ at the time of (a) t = 79 by the actual trend values from daily reports by petition managers, (b) t = 80 by the estimated trend values resulted by prediction models based on RFBNs, and (c) t = 8 by the estimated trend values resulted by prediction models based on C5.0.

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

7267

Fig. 12. Petition group 5 were changed from ‘L’ to ‘H’ at the time of (a) t = 51 by the actual trend values from daily reports by petition managers, (b) t = 37 by the estimated trend values resulted by prediction models based on RFBNs, and (c) t = 8 by the estimated trend values resulted by prediction models based on C5.0.

investigated manually. Lastly, petition group 5 revealed in Fig. 12 that its prediction models from both RBFNs and C5.0 transformed the trend values of petition group 5 from ‘L’ to ‘H’ respectively 14 and 43 days earlier than from daily reports by petition inspectors. Thus, some of our prediction models forecasted the time when trend of petition groups turn from ‘L’ to ‘H’ 14–71 days earlier than daily reports by petition inspectors. And this implicates that our prediction models can be used extendedly to perceive the moment of trend value’s turnover from ‘L’ to ‘H’ for a petition group in advance. This contributes to gaining time to make preparation for nationwide social problems by improving the national law and policy related to petitions of petition groups with the trend value which are turned into ‘H’ before they get worse and eventually fatal to the country. 5. Conclusion and further works In this paper, we put forward the framework of applying text and data mining techniques to petitions filed to e-People on purpose to solve those problems and challenging issues as introduced in Section 1. To sum up our framework explained through three subsections in Section 3, firstly, we suggested steps in which keywords are elicited from unstructured data, i.e.    and    of petitions (XML). Secondly, we proposed the way how petition groups are identified from consistent clusters of petitions after clustering analyses on petitions with the elicited keywords and how newly filed petitions are classified into one of petition groups. As the third step, we suggested forecasting the trend of petitions filed to e-People by adopting two types of data mining techniques such as RBFNs and C5.0 with formulas explained in Table 1 that produce feature values from structured data of petitions for each petition group. As the application in Section 4, we collected 4,217 petitions filed to e-People over 92 days from July 1, 2008 to September 31, 2008. Using the elicited 46 keywords from 4,217 petitions, we identified 8 petition groups (See Table 2). On the basis of petition information of the identified 8 petition groups, we performed the 8 fold validation on the constructed prediction models. And it was turned out that their estimated accuracies in testing phase were 62.35–86.96% by prediction models based on RBFNs and 93.32–97.98% by prediction models based on C5.0 (see Table 3). Consequently, trend values predicted for 8 petition groups were expressed as graphs in Fig. 9. And there we found out that the prediction models applied to three petition groups such as petition group 2, petition group 3, and petition group 5 forecasted the time when the predicted trend values of petition groups turn from ‘L’ to ‘H’ earlier than petition inspectors had evaluated manually petition groups from ‘L’ to ‘H’ (see Figs. 10–12). Likewise, our methodology transformed a great number of petitions into the petition groups analyzable by petition inspectors on the basis of applying text mining techniques to unstructured data of petitions, i.e. through identifying petition groups by clustering

petitions with the elicited keywords. And we expect this will decrease time-consuming manual works on reading and classifying petitions, and thereby petition inspectors will be able to concentrate on daily analysis of continually filed petitions with more efficiency. Subsequently, through applying data mining techniques to structured data of petitions, we could predict the trend of petitions with appropriate degree of accuracy, and we expect our forecasting models based on RBFNs and C5.0 will be able to replace petition inspectors’ decision making on the trend of petitions. Besides, we found out that our forecasting models based on RBFNs and C5.0 possibly predicts the moment when the trend value turns into ‘H’ earlier than the petition inspectors. This will help the government sectors to concentrate on improving the related national law and policy strategy by saving their time in finding and chasing significant groups of petitions that might grow as the nationwide problems. Moreover, if the priorities of petitions with the respect to their petition groups’ trend values are evaluated, it will lead to the priorities of the related laws and policies that are to be improved by government sectors. Eventually, these contributions by our paper will evolve e-People into the virtual space where boundaries between citizens and government sectors are melted down so that citizens can participate easily in innovating on the national law and policy by just filing petitions to e-People as the voice of the nation. As a further work, we would like to improve the performance in eliciting keywords from petitions by adding visualization methods based on semantic networks. This can be advantageous to our approach because visualization methods are known to be proper for representing unstructured data and its analysis results and the semantic networks consider the relationship among keywords. In addition, we wish to introduce additional formulas for new feature values and do researches on fining priorities among feature values used as input variables for prediction models. Finally, we have a plan to evolve e-People system by implementing our framework of applying text and data mining techniques to petitions filed to e-People. This will play an important role as a reference model in realizing Open Innovation in e-Government by enabling citizens to participate in innovating on the nation system conveniently by filing petitions to e-People.

Acknowledgement This paper is based on the result of Information Strategy Planning (ISP) which had been performed in order to improve e-People of Anti-corruption and Civil Rights Commission (ACRC) by SAMSUNG SDS consortium for 10 months from April 1, 2008 to January 31, 2009. References ACRC (2008). e-People: Online petition & discussion portal. In Demonstration contest session of e-Challenge conference and exhibition 2008. Stockholm, Sweden: European Commission.

7268

J.H. Suh et al. / Expert Systems with Applications 37 (2010) 7255–7268

Akhlaghi, Y., & Kompany-Zareh, M. (2005). Comparing radial basis function and feed-forward neural networks assisted by linear discriminant or principal component analysis for simultaneous spectrophotometric quantification of mercury and copper. Analytica Chimica Acta, 537(1-2), 331–338. Alfaro, E., García, N., Gámez, M., & Elizondo, D. (2008). Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks. Decision Support Systems, 45(1), 110–112. Aliguliyev, R. M. (2009). Clustering of document collection – A weighting approach. Expert Systems with Applications, 36(4), 7904–7916. Atkinson, R. D., & Castro, D. D. (2008). Digital quality of life: Understanding the personal and social benefits of the information technology revolution. Information Technology and Innovation Foundation (ITIF). Berry, M. J., & Linoff, G. S. (2000). Mastering data mining. New York: John Wiley & Sons. Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., et al. (1999). Paritioning-based clustering for web document categorization. Decision Support Systems, 27(3), 329–341. Bose, I., & Mahapatra, R. K. (2001). Business data mining – a machine learning perspective. Information & Management, 39(3), 211–225. Chang, C.-L., & Chen, C.-H. (2009). Applying decision tree and neural network to increase quality of dermatologic diagnosis. Expert Systems with Applications, 36(2), 4035–4041. Chau, M., Huang, Z., Qin, J., Zhou, Y., & Chen, H. (2006). Building a scientific knowledge portal: The NanoPort experience. Decision Support Systems, 42(2), 1216–1238. Chen, T. (2007). Incorporating fuzzy c-means and a back-propagation network ensemble to job completion time prediction in a semiconductor fabrication factory. Fuzzy Sets and Systems, 158(19), 2153–2168. Chen, L.-D., Sakaguchi, T., & Frolick, M. N. (2000). Data mining methods, applications, and tools. Information Systems Management, 17(1), 1–6. Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for business blog search and mining. Expert Systems with Applications, 35(3), 581–590. Chesbrough, H. W. (2003). Open Innovation: The new imperative for creating and profiting from technology. Boston: Harvard Business School Press. Chou, P. A. (1991). Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 340–350. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. Delen, D., & Crossland, M. D. (2008). Seeding the survey and analysis of research literature with text mining. Expert Systems with Applications, 34(3), 1707–1720. Deng, W. J., Chen, W. C., & Pei, W. (2007). Back-propagation neural network based importance–performance analysis for determining critical service attributes. Expert Systems with Applications, 34(2), 1115–1125. Derks, E. P. P. A., Pastor, M. S. S., & Buydens, L. M. C. (1995). Robustness analysis of radial base function and multi-layered feed-forward neural network models. Chemometrics and Intelligent Laboratory Systems, 28(1), 49–60. Enke, D., & Thawornwong, S. (2005). The use of data mining and neural networks for forecasting stock market returns. Expert Systems with Applications, 29(4), 927–940. Evans, D., & Yen, D. C. (2006). E-Government: Evolving relationship of citizens and government, domestic, and international development. Government Information Quarterly, 23(1), 207–235. Fan, W., Wallace, L., Rich, S., & Zhang, Z. (2006). Tapping the power of text mining. Communications of the ACM, 49(9), 76–82. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37–54. Guzella, T. S., & Caminhas, W. M. (2009). A review of machine learning approaches to Spam filtering. Expert Systems with Applications. doi:10.1016/ j.eswa.2009.02.037. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA: Morgan Kaufmann. Kiang, M. Y., Fisher, D. M., Chen, J.-C. V., Fisher, S. A., & Chi, R. T. (2009). The application of SOM as a decision support tool to identify AACSB peer schools. Decision Support Systems. doi:10.1016/j.dss.2008.12.010. Kohonen, T. (1995). Self-organizing maps. Springer. Kuo, R. J., An, Y. L., Wang, H. S., & Chung, W. J. (2006). Integration of self-organizing feature maps neural network and genetic K-means algorithm for market segmentation. Expert Systems with Applications, 30(2), 313–324. Kuo, R. J., Ho, L. M., & Hu, C. M. (2002). Integration of self-organizing feature map and K-means algorithm for market segmentation. Computers & Operation Research, 29(11), 1475–1493. Lam, M. (2004). Neural network techniques for financial performance prediction: Fundamental and technical analysis. Decision Support Systems, 32(4), 567–581. Larose, D. T. (2005). Discovering knowledge in data: An introduction to data mining. Hoboken, New Jersey: John Wiley & Sons.

Lee, T. L. (2004). Back-propagation neural network for long-term tidal predictions. Ocean Engineering, 31(2), 225–238. Lee, J. A., & Jung, J. W. (2004). Strategy for implementing high level e-Government based on customer relationship management. Korean National Computerization Agency. Li, Y., Chung, S. M., & Holt, J. D. (2008). Text document clustering based on frequent word meaning sequences. Data and Knowledge Engineering, 64(1), 381–404. Marchionini, G., Sanan, H., & Brabdt, L. (2003). Digital government. Communications of the ACM, 46(1), 25–27. Palvia, S. C. J., & Sharma, S. S. (2007). Foundations of e-Government. In A. Agarwal & V. V. Ramana (Eds.), e-Government and e-Governance: Definitions/domain framework and status around the world (pp. 1–12). India: International Congress of e-Government. Panda, S. S., Chakraborty, D., & Pal, S. K. (2007). Flank wear prediction in drilling using back-propagation neural network and radial basis function network. Application Soft Computing, 29(4), 927–940. Roussinov, D., & Chen, H. (1999). Document clustering for electronic meetings: An experimental comparision of two techniques. Decision Support Systems, 27(1-2), 67–69. Sahin, F. (1997). A radial basis function approach to a color image classification problem in a real time industrial application. In Technical report ETD-6197223641. MS. Thesis. Blacksburg, USA: Electrical Engineering, Virginia Tech. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of ACM, 18(11), 613–620. Scott, D. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Survey, 34(1), 1–47. Strauch, M., Supper, J., Spieth, C., Wanke, D., Kilian, J., Harter, K., et al. (2007). A twostep clustering for 3-D gene expression data reveals the main features of the arabidopsis stress response. Journal of Integrative Bioinformatics, 4(1), 54–66. Sun, Z.-L., Choi, T.-M., Au, K.-F., & Yu, Y. (2008). Sales forecasting using extreme learning machine with applications in fashion retailing. Decision Support Systems, 46(1), 411–419. Tsaih, R., Hsu, Y., & Lai, C. C. (2008). Forecasting S&P 500 stock index futures with a hybrid AI system. Decision Support Systems, 23(2), 161–174. Ture, M., Tokatli, F., & Kurt, I. (2009). Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4. 5 and ID3) in determining recurrence-free survival of breast cancer patients. Expert Systems with Applications, 36(2), 2017–2026. United Nations (UN) & American Society for Public Administration (ASPA) (2003). Benchmarking e-government: A global perspective. NY: UN Publications. Voorhees, E. M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing & Management, 22(6), 465–476. Walczak, B., & Massart, D. L. (2000). Local modelling with radial basis function networks. Chemometrics and Intelligent Laboratory Systems, 50(2), 179–198. Wang, J., Peng, H., & Hu, J. S. (2006). Automatic keyphrases extraction from document using neural network. In Advances in machine learning and cybernetics. Lecture notes in artificial intelligence (Vol. 3930, pp. 633–641). Berlin, Heidelber, Germany: Springer. ICMLC 2005. Wei, C., Yang, C. S., Hsiao, H. W., & Cheng, T. H. (2006). Combining preference- and content-based approaches for improving document clustering effectiveness. Information Processing & Management, 42(2), 350–372. Wei, C.-P., Yang, C. C., & Lin, C.-M. (2008). A latent semantic indexing-based approach to multiligual document clustering. Decision Support Systems, 45(3), 606–620. Weng, S.-S., & Lin, Y.-J. (2003). A study on searching for similar documents based on multiple concepts and distribution of concepts. Expert Systems with Applications, 25(3), 355–368. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., et al. (2006). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. Wu, H., Zhang, X., Li, X., Liao, P., Li, W., Li, Z., et al. (2006). Studies on acute toxicity of model toxins by proton magnetic resonance spectroscopy of urine combined with two-step cluster analysis. Chinese Journal of Analytical Chemistry, 34(1), 21–25. Yang, Y., & Pederson, J. O. (1997). A comparative study on feature selection in text categorization. In The 14th international conference on machine learning (pp. 412–420). Yang, H.-C., & Lee, C.-H. (2005). A text mining approach for automatic construction of hypertexts. Expert Systems with Applications, 29(4), 723–734. Yu, L., Lai, K. K., & Wang, S. (2008). Multistage RBF neural network ensemble learning for exchange rates forecasting. Neurocomputing, 71(16-18), 3295–3302. Zhang, D., & Zhou, L. (2004). Discovering golden nuggets: Data mining in financial application. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 34(4), 513–522.