A tree based categorical variable encoding strategy in supervised learning tasks

dc.contributorGraduate Program in Computational Science and Engineering.
dc.contributor.advisorBaydoğan, Mustafa Gökçe.
dc.contributor.authorGazioğlu, Mine.
dc.date.accessioned2023-10-15T06:40:59Z
dc.date.available2023-10-15T06:40:59Z
dc.date.issued2022
dc.description.abstractCategorical variables are present in most real-world datasets, often consisting of a high number of levels, referred to as high-cardinality categorical variables. Most machine learning algorithms do not have an innate mechanism to deal with categor ical variables, hence, their encoding is necessary. Categorical variable encoding is the general term for the conversion of nominal independent variables to a numerical format. Many encoding strategies exist, and they are discussed in this thesis. This the sis presents a novel encoding strategy, categorical split encoding, and also provides an analysis of existing encoding methods. Categorical split encoding uses primary and sur rogate split information as the vector representation for categorical variables, through a tree-based algorithm, this method outputs binary columns for each categorical variable making use of target information. Missing values are imputed by using surrogate infor mation, while clustering similar values together based on the path they take through the decision tree algorithm. Various existing encoding strategies are benchmarked for comparison with the proposed strategy. The performance of categorical split encod ing and other encoding methods is compared with three different machine learning algorithms (generalized linear models, random forest and xgboost) using datasets from regression, binary and multiclass classification settings. Datasets used are made pub licly available for replication purposes. As a result, categorical split encoding provides competitive results compared to existing encoding strategies in various datasets.
dc.format.pagesxiii, 59 leaves
dc.identifier.otherCSE 2022 G38
dc.identifier.urihttps://digitalarchive.library.bogazici.edu.tr/handle/123456789/19691
dc.publisherThesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcshCategories (Mathematics)
dc.subject.lcshSupervised learning (Machine learning)
dc.titleA tree based categorical variable encoding strategy in supervised learning tasks

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
b2777665.037472.001.PDF
Size:
1.34 MB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
b2777665.037494.001.zip
Size:
37.81 KB
Format:
Unknown data format

Collections