الملخص الإنجليزي
DNA mutations are an important source of human genome variability. Single Nucleotide Polymorphisins (SNP) are changes in DNA sequence that are present in more than one percent of the population. DNA comprises of two region coding DNA and noncoding DNA. Coding DNA is sopecifically referring to DNA that encodes proteins. However, much of DNA does not encode proteins that is called junk DNA. Nonsynonymous SNP must fall inside coding region of the DNA that changes the protein sequence. SNP may cause alteration in protein function, protein stability and causes future health consequences. Most of the SNPs doesn't have any effects or functions within the organism, and some others have side effects on our body such as the changes in the appearance of the cell, shape or color.
During the last decade, over 3 billion of nucleotides from human genomes have been released which is accompanied with other data from the HapMAP consortium and the human variation project which allow identifying tens of millions of SNPs in different regions. Current database of Single nucleotide polymorphisms (dbSNP) is the most comprehensive database of genetic variability; it contains about 51.8 million $NP depending on the position and occurrence.
Computer tools manipulate and analyze SNP effect. A predictor tool classifies the effect of SNP which cause disease or which lead normal effect. There are three main type of tools: sequence based tool, structure based tool and both. Sequence based tools analyze and manipulate using features that are extracted from a sequence amino acid for instance Sift, Panther and other. Structure based tools analyze and manipulate using feature that are extracted from three dimensional structure of protein. The third type uses both protein sequence and protein structure so it gathers strong point from both side such as SNP&GO and Polyphen2. However most of tools utilize one classification machine learning algorithm to predicts the SNP effect on protein function, This research implements a new prediction tool which classifies the effect of non synonymous SNP as disease or neutral SNP. It utilizes three well known machine learning algorithms to classify the effect of coding SNP: Support vector inachine, random forest and artificial neural network. Then, it ensembles these machine learning algorithms by two ensembles techniques: the greedy best selection and stacking ensemble. Then, it evaluates the prediction tool using distinct performance scores accuracy, sensitivity, and specificity. Finally, the prediction tools are compared with other study; and the experiments show that our prediction obtains better performance.