The Ministry of University, Research and Innovation has funded a research project carried out by the Department of Computer Engineering at the University of Cádiz that has created REDIBAGG, a method that accelerates the training of artificial intelligence models by up to 70%, using less data but without losing accuracy. The technique has the potential to analyze large volumes of information in diverse fields such as medicine, industry, or finance.
The tool is designed to work with large volumes of information used for classification tasks, situations where algorithms must choose between specific options. For example, in healthcare, it could speed up automatic diagnostic systems without sacrificing reliability, in industry it would help detect real-time failures with lower resource consumption, and in finance, process large records in less time to prevent fraud or analyze risks.
As explained in an article published in the journal ‘Engineering Applications of Artificial Intelligence’, the system performs well in different contexts. «It is not a method oriented towards certain types of data, but it is very versatile and robust in any volume with a large number of features or instances,» points out Juan Francisco Cabrera, co-author of the study.
Another advantage of the tool is its implementation simplicity. It can be easily applied in common artificial intelligence work environments like the Python programming language, and standard libraries like Scikit-learn, specific for using machine learning techniques easily, facilitating its adoption by researchers, companies, or institutions.
REDIBAGG is a variant of ‘bagging’ (short for ‘bootstrap aggregating’ in English), a model combination method widely used to improve classifier accuracy in the context of artificial intelligence. The tool creates multiple subsets from the original data sample. Each subset is used for training a base classifier, and then the predictions are combined to make more reliable decisions. The ‘resampling’ method used by ‘bagging’ is ‘bootstrap’, a statistical technique that generates random sub-samples with replacement. In other words, new data collections are created by randomly selecting examples from the original set, allowing some to be repeated and others not.
Although ‘bagging’ is effective, its main drawback is the high computational cost. Each model is trained with a sub-sample of the same size as the original set, slowing down learning and multiplying resource consumption. To address this limitation, experts have implemented a new ‘resampling’ system that generates smaller but representative subsets.
From these sub-samples, they trained several independent models, combining their predictions just like in classic ‘bagging’. «In the era of big data, where we work with large volumes of data, using methods that reduce learning times is appreciated, especially if it can reduce it by up to 70% compared to the original method,» emphasizes Esther Lydia Silva, the study’s lead author.
To validate its effectiveness, they tested it on 30 real data sets using Urania, the supercomputer at the University of Cádiz. They worked in diverse areas such as medicine, biology, physics, or social sciences. Additionally, it was applied with various types of classification algorithms, such as decision trees, neural networks, support vector machines, or Bayesian models.
Future Objectives
In all cases, the new approach demonstrated a precision comparable to the original method. On average, they managed to reduce training time by 35%, achieving reductions of 70% in very large data sets. «By working with less complex models, training hours and storage costs are reduced, making the method much more efficient,» clarifies the scientist.
The researchers now aim to release the method for use by the scientific community. They also plan to study how the tool could be applied to other machine learning systems, besides ‘bagging’ and its variants, combine it with variable selection techniques to obtain even more efficient models, or explore its adaptation to regression tasks, where numerical values are predicted instead of categories.
The work has been funded by the Ministry of University, in addition to FEDER Funds.