معلومات البحث الكاملة في مستودع بيانات الجامعة

عنوان البحث(Papers / Research Title)


Data Construction using Genetic Programming


الناشر \ المحرر \ الكاتب (Author / Editor / Publisher)

 
عباس محسن عبد الحسين البكري

Citation Information


عباس,محسن,عبد,الحسين,البكري ,Data Construction using Genetic Programming , Time 6/14/2011 6:55:27 AM : كلية تكنولوجيا المعلومات

وصف الابستركت (Abstract)


Data Construction using Genetic Programming

الوصف الكامل (Full Abstract)



Data Construction using Genetic Programming Method to Handle Data Scarcity Problem

 
Abbas M.AL- Bakary , Samaher Hussein Ali* Computer Technology Collage, Babylon University 
 
Computer Science Department, Babylon University

 Iraq.abbasmoh67@yahoo.com

 Iraq.samaher_hussein@yahoo.com

  
 
ABSTRACT:-

 
Genetic Programming Data Construction Method (GPDCM) uses in this work to handle one of the key problems in the supervised learning which is due to the insufficient size of training dataset. The methodology consists of four stages: first, represent each record in small dataset as decision tree(DT) where the collection of these trees represent the population of Genetic Programming algorithm(GPA). Second, attaching the numerical value to each node of those trees (Gain information Ratio). These values represent the fitness of the nodes. Third, expanding the small population by apply parallel method in three different types of crossover which is related to the GPA for each pair of the parents.Fourth, forecasting the classes to new samples generated by GPDCM using back propagation neural network (BPNN) ,then apply ROC graphs as a measures of Robustness Evaluation. The work takes all the important variables in to account, because it is started by collect DTs and it applies on five different datasets (iris dataset, weather dataset, heart dataset, soybean dataset and lamphgraphy dataset). For the theoretical and practical validity, we compare between the proposed method and the other applied methods. As the result, we fined that GPDCM is promising techniques for expanding the extremely small dataset and extracted a useful knowledge . Keywords_ insufficient size of dataset, decision tree, GeneticProgramming, Gain Information Ratio, BPN.
 
 
 INTRODUCTION:-

 
Data scarcity problem is one of the main problems of machine learning and data mining, because insufficient size of data is very often responsible for poor performances of learning, how to extract the significant information for inferences is acritical issue. It is well known that one of the basic theories in Statistics is the Central Limit Theorem [1,16]. This theorem asserts that when a sample size is large (30), the x-bar distribution is approximately normal without considering the population distribution. Therefore, when a given number of samples are less than 30 , it is consider as insufficient size of samples to perform intelligent analysis. There are two possible ways to overcome the data scarcity problem. One is to collect more data while the other is to design techniques that can deal with extremely small data sets. One major contribution to the above issue has been given by [2] who developed a methodology for integrating different kinds of “hints” (prior knowledge) into usual learning from- example procedure. By this way, the “hints” can be represented by newexamples, generated from the existing data set by applying transformations that are known to leave the function to be learned invariant. Then, [3] modified “hints” into “virtual samples” and applied it to improve the learning performances of artificial neural networks such as Back- Propagation and Radial Basis Function Networks. In fact, it is evident that generating more resembling samples from the small training set can make the learning tools perform well. [4] Proposed to use a neural network ensemble to preprocess the training data for a rule learning approach. They used the original training data set to generate an ensemble at first. Then, they randomly generated new instances and passed them to the ensemble for classification. The outputs from the ensemble were regarded as labels of these instances. By combining the predicted labels and the training inputs, new examples were obtained and used to enlarge the training data set. The enlarged training data set was finally used by a rule learning approach. This approach has been applied to gene expression data and obtained interesting results.



  Dear visitor,For downloading the full version of the research/article click on the pdf icon above.  

تحميل الملف المرفق Download Attached File

تحميل الملف من سيرفر شبكة جامعة بابل (Paper Link on Network Server) repository publications

البحث في الموقع

Authors, Titles, Abstracts

Full Text




خيارات العرض والخدمات


وصلات مرتبطة بهذا البحث