Recently I graduated from Computer Engineering, so now I’m free! My final project “System for text classification in niche fields with few tagged data” was graded with honors and can be found here.
The related code and datasets can be found here.
Abstract
The exponential increase in the generation of content on the Internet has forced the automation of management tasks that were previously carried out by humans, thus driven a great development of Artificial Intelligence techniques. These new tools are helpful to moderate dangerous content spread on social networks, such as the apology for eating disorders.
In this work, we collaborate with the APE Foundation to develop a text classifier that detects the promotion of anorexia and bulimia in Twitter messages. This classifier will be integrated into a software system for monitoring interactions in social networks to detect the dissemination of these contents in real time. The main goal of this work is to study the most relevant open-source tools available for this task of text classification and to compare them in the specific context at hand (detection of messages promoting eating disorders).
Additionally, a corpus of texts labeled as promoters or non-promoters of eating disorders has been generated by expanding a pre-existing corpus with messages collected from the Internet. With it, text classifiers based on five different natural language processing tools have been trained. Specifically, FastText, SpaCy, Transformers, Custom_BoW and Custom_TF-IDF. These last two have been implemented as baseline of the comparison. Additionally, various forms of text pre-processing have been applied, including an original spell checker, to reduce noise in samples.
The results show a clear superiority of the Transformers and FastText tools, which have exceeded the 0.95 of F1-score, being better than those achieved by the other classifiers studied. Specifically, FastText is considered the most appropriate model in this case study due to its excellent balance between fast response time and quality of the results obtained. The most consistent results have been achieved with low-intrusive text preprocessing techniques and the use of spell checkers is discouraged due to its impact on response time, which does not result in notable improvements in the quality of the classifications.
As a conclusion to this work, it has been proven that it is feasible to categorize text in natural language with a reduced corpus of examples, without dedicated hardware or extensive knowledge of Artificial Intelligence. Future work is needed to improve the labeling quality of the generated data corpus, investigate whether the results obtained are similar in other cases of text classification, and study the use of Machine Learning as a Service, as in the case of OpenAI with GPT-3.
Keywords
- anorexia
- bulimia
- eating disorders
- social networks
- machine learning
- natural language processing
- neural networks