Skip to main content

2024 | OriginalPaper | Buchkapitel

SAWTab: Smoothed Adaptive Weighting for Tabular Data in Semi-supervised Learning

verfasst von : Morteza Mohammady Gharasuie, Fengjiao Wang, Omar Sharif, Ravi Mukkamala

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Self-supervised and Semi-supervised learning (SSL) on tabular data is an understudied topic. Despite some attempts, there are two major challenges: 1. Imbalanced nature in the tabular dataset; 2. The one-hot encoding used in these methods becomes less efficient for high-cardinality categorical features. To cope with the challenges, we propose SAWTab which uses a target encoding method, Conditional Probability Representation (CPR), for efficient representation in the input space of categorical features. We improve this representation by incorporating the unlabeled samples through pseudo-labels. Furthermore, we propose a Smooth Adaptive Weighting mechanism in the target encoding to mitigate the issue of noisy and biased pseudo-labels. Experimental results on various datasets and comparisons with existing frameworks show that SAWTab yields best test accuracy on all datasets. We find that pseudo-labels can help improve the input space representation in the SSL setting, which enhances the generalization of the learning algorithm.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Yoon, J., Zhang, Y., Jordon, J., van der Schaar, M.: Vime: extending the success of self-and semi-supervised learning to tabular domain. In: Advances in Neural Information Processing Systems, vol. 33 (2020) Yoon, J., Zhang, Y., Jordon, J., van der Schaar, M.: Vime: extending the success of self-and semi-supervised learning to tabular domain. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
2.
Zurück zum Zitat Ucar, T., Hajiramezanali, E., Edwards, L.: Subtab: subsetting features of tabular data for self-supervised representation learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 18853–18865 (2021) Ucar, T., Hajiramezanali, E., Edwards, L.: Subtab: subsetting features of tabular data for self-supervised representation learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 18853–18865 (2021)
3.
Zurück zum Zitat Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120 (2009) Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120 (2009)
4.
Zurück zum Zitat Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.: Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14567–14579 (2020) Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.: Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14567–14579 (2020)
5.
Zurück zum Zitat Wei, C., Sohn, K., Mellina, C., Yuille, A., Yang, F.: Crest: a class-rebalancing self-training framework for imbalanced semi-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10857–10866 (2021) Wei, C., Sohn, K., Mellina, C., Yuille, A., Yang, F.: Crest: a class-rebalancing self-training framework for imbalanced semi-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10857–10866 (2021)
9.
Zurück zum Zitat Bahri, D., Jiang, H., Tay, Y., Metzler, D.: Scarf: self-supervised contrastive learning using random feature corruption. arXiv preprint arXiv:2106.15147 (2021) Bahri, D., Jiang, H., Tay, Y., Metzler, D.: Scarf: self-supervised contrastive learning using random feature corruption. arXiv preprint arXiv:​2106.​15147 (2021)
10.
Zurück zum Zitat Darabi, S., Fazeli, S., Pazoki, A., Sankararaman, S., Sarrafzadeh, M.: Contrastive mixup: self- and semi-supervised learning for tabular domain. arXiv:2108.12296 (2021) Darabi, S., Fazeli, S., Pazoki, A., Sankararaman, S., Sarrafzadeh, M.: Contrastive mixup: self- and semi-supervised learning for tabular domain. arXiv:​2108.​12296 (2021)
11.
Zurück zum Zitat Cerda, P., Varoquaux, G.: Encoding high-cardinality string categorical variables. IEEE Trans. Knowl. Data Eng. 34, 1164–1176 (2020)CrossRef Cerda, P., Varoquaux, G.: Encoding high-cardinality string categorical variables. IEEE Trans. Knowl. Data Eng. 34, 1164–1176 (2020)CrossRef
12.
Zurück zum Zitat Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8), 1477–1494 (2018)MathSciNetCrossRef Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8), 1477–1494 (2018)MathSciNetCrossRef
13.
Zurück zum Zitat Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for wework lead scoring engine. arXiv preprint arXiv:1904.13001 (2019) Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for wework lead scoring engine. arXiv preprint arXiv:​1904.​13001 (2019)
14.
Zurück zum Zitat Lai, Z., Wang, C., Gunawan, H., Cheung, S.-C.S., Chuah, C.-N.: Smoothed adaptive weighting for imbalanced semi-supervised learning: improve reliability against unknown distribution data. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 11828–11843. PMLR (2022). https://proceedings.mlr.press/v162/lai22b.html Lai, Z., Wang, C., Gunawan, H., Cheung, S.-C.S., Chuah, C.-N.: Smoothed adaptive weighting for imbalanced semi-supervised learning: improve reliability against unknown distribution data. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 11828–11843. PMLR (2022). https://​proceedings.​mlr.​press/​v162/​lai22b.​html
16.
Zurück zum Zitat Gharasuie, M.M., Wang, F.: Progressive feature upgrade in semi-supervised learning on tabular domain. In: 2022 IEEE International Conference on Knowledge Graph (ICKG), pp. 188–195 (2022) Gharasuie, M.M., Wang, F.: Progressive feature upgrade in semi-supervised learning on tabular domain. In: 2022 IEEE International Conference on Knowledge Graph (ICKG), pp. 188–195 (2022)
20.
Zurück zum Zitat Majmundar, K., Goyal, S., Netrapalli, P., Jain, P.: Met: masked encoding for tabular data. arXiv preprint arXiv:2206.08564 (2022) Majmundar, K., Goyal, S., Netrapalli, P., Jain, P.: Met: masked encoding for tabular data. arXiv preprint arXiv:​2206.​08564 (2022)
21.
Zurück zum Zitat Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., Goldstein, T.: Saint: improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342 (2021) Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., Goldstein, T.: Saint: improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:​2106.​01342 (2021)
Metadaten
Titel
SAWTab: Smoothed Adaptive Weighting for Tabular Data in Semi-supervised Learning
verfasst von
Morteza Mohammady Gharasuie
Fengjiao Wang
Omar Sharif
Ravi Mukkamala
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2259-4_24

Premium Partner