ATTENTION: The works hosted here are being migrated to a new repository that will consolidate resources, improve discoverability, and better show UTA's research impact on the global community. We will update authors as the migration progresses. Please see MavMatrix for more information.
Show simple item record
dc.contributor.advisor | Huber, Manfred | |
dc.creator | Basak, Suryoday | |
dc.date.accessioned | 2020-06-12T20:21:21Z | |
dc.date.available | 2020-06-12T20:21:21Z | |
dc.date.created | 2020-05 | |
dc.date.issued | 2020-06-05 | |
dc.date.submitted | May 2020 | |
dc.identifier.uri | http://hdl.handle.net/10106/29088 | |
dc.description.abstract | K-Nearest Neighbors (KNN) has remained one of the most popular methods for supervised machine learning tasks. However, its performance often depends on the characteristics of the dataset and on appropriate feature scaling. In this thesis, characteristics of a dataset that make it suitable for being used within KNN are explored. As part of this, two new measures for dataset dispersion, called mean neighborhood target variance (MNTV), and mean neighborhood target entropy (MNTE) are developed to help determine the performance we expect while using KNN regressors and classifiers, respectively. It is empirically demonstrated that these measures of dispersion can be indicative of the performance of KNN regression and classification. This idea is extended to learn feature weights that help improve the accuracy of KNN classification and regression. For this, it is argued that the MNTV and MNTE, when used to learn feature weights, cannot be optimized using traditional gradient-based optimization methods and we develop optimization strategies based on metaheuristic methods, namely genetic algorithms and particle swarm optimization. The feature-weighting method is tried in both regression and classification contexts on publicly available datasets, and the performance is compared to KNN without feature weighting. The results indicate that the performance of KNN with appropriate feature weighting leads to better performance. In a separate branch of the work, the ideas of MNTV and MNTE are used to develop a sample-weighting algorithm that assigns sampling probabilities to each instance in a training set. This too is tried in both regression and classification with subsamples drawn using the sampling probabilities, and the performance is compared to KNN without subsampling the training set. | |
dc.format.mimetype | application/pdf | |
dc.language.iso | en_US | |
dc.subject | Machine learning | |
dc.subject | Data analysis | |
dc.subject | Metaheutistic optimization | |
dc.subject | K-nearest neighbors | |
dc.subject | Classification | |
dc.subject | Regression | |
dc.title | Randomized and Evolutionary Approaches to Dataset Characterization, Feature Weighting, and Sampling in K-Nearest Neighbors | |
dc.type | Thesis | |
dc.degree.department | Computer Science and Engineering | |
dc.degree.name | Master of Science in Computer Science | |
dc.date.updated | 2020-06-12T20:21:21Z | |
thesis.degree.department | Computer Science and Engineering | |
thesis.degree.grantor | The University of Texas at Arlington | |
thesis.degree.level | Masters | |
thesis.degree.name | Master of Science in Computer Science | |
dc.type.material | text | |
dc.creator.orcid | 0000-0002-1982-1787 | |
Files in this item
- Name:
- BASAK-THESIS-2020.pdf
- Size:
- 5.752Mb
- Format:
- PDF
This item appears in the following Collection(s)
Show simple item record