Create Tomorrow - Our Data Science and Imperial College London Partnership
Topic modelling is an area of machine learning that finds topics in documents based on the terms mentioned. It can be crucial when trying to categorise large amounts of text data quickly.
However, when you already have some categories then identifying which new areas are really something new or just an extension of what you have already is tough. The downside of automated techniques like topic modelling is that they do not know anything about a consumer credit context. What this means in practice is they will typically suggest too few or too many topics.
Subject matter experts (SMEs) can use their mental map of the problem to link issues that may seem unrelated to others, however, given the dynamic nature of our business and the size of our customer base, it can be challenging for experts to keep up to date with the data volume.
This summer NewDay Data Science worked with Imperial College London to combine the strengths of a data-driven approach to topic modelling with the SME view. A novel and interesting solution to this challenging problem was developed by an Imperial Masters student for their thesis project.
We sat down with the student who undertook this project to hear about his experience trying to tackle this problem:
Briefly introduce yourself and tell us about your background and what areas interest you?
My name is Billy Ogier and I’ve recently completed the MSc in Statistics at Imperial College London. After doing an undergraduate project in Reinforcement Learning, I was drawn into the exciting world of Machine Learning, leading me to specialise in a variety of AI-centred modules in the course. From Deep Learning to Ethics in AI, I enjoyed learning the skills to effectively harness the vast amounts of data available in today’s world.
Describe your MSc project, and how NewDay helped make the project possible.
The MSc thesis, in partnership with NewDay, centred around applying topic modelling to their chatbot data to explore the intents of customers. Topic modelling is the automated process of finding the themes present within a large corpus. Although usually completely unsupervised, we incorporated subject matter experts (SMEs) with domain knowledge into the topic model through prior labels on the chatbot utterances. I worked closely with Josh and Donal, members of NewDay's Data Science team, whose guidance and collaborative efforts were instrumental in the development of the project.
Were there any aspects of your project that you found most rewarding or enjoyable? Anything surprising?
Imperial place substantial emphasis on developing novel ideas in our work. Although daunting for a three-month project, it made the work feel particularly satisfying. The innovation was in the development of a novel clustering method called K-SMEans. This technique incorporated the SME labels into the traditional K-Means clustering approach, allowing the user to control the influence of these labels during the clustering. K-SMEans was integrated into a recently developed topic modelling framework called BERTopic. Seeing the benefits of this novel approach, such as reduced redundancy and increased topic coverage within the chatbot data, was really rewarding.
What were some of the challenges you faced during the project, and how did you overcome them?
An initial challenging aspect of the project was exploring the vast body of literature available. Trying to read and understand state-of-the-art research from places like Google was demanding and required considerable patience. Another challenge was the application of topic modelling to the short-form chatbot utterances which can lack the necessary context and depth to extract meaningful topics. To overcome this, we used the BERTopic framework that leverages the BERT language model. BERT is pre-trained on an incredibly large text dataset to form information-rich text embeddings that can be used in a topic model, and as a result, was much more effective on short text compared to other topic models such as Bayesian generative models.
What did you learn from the project that you think will be valuable for you moving forward in your career?
One of the great benefits of working with NewDay was learning to use their modern tools for machine learning. Using Amazon SageMaker, an AWS cloud-based machine learning service allowed smooth research, collaboration, and model development. It also allowed me to take advantage of high-performance computing resources by using GPU accelerated Python packages. More generally, being able to speak with such a variety of intelligent people across NewDay, who not only had a genuine interest in my work but provided insightful input, showed me the real power of collaborative workforces.
Thank you, Billy, for sharing your experience and for your excellent work in developing K-SMEans!
This work and thesis were supervised by Dr. Ioanna Papatsouma - Senior Teaching Fellow in Statistics at Imperial College London, Dr. Joshua Plasse - Lead Data Scientist at NewDay, and Dr. Donal Simmie - Head of Data Science at NewDay.
Thank you to the Department of Mathematics at Imperial College London for facilitating the partnership. Thank you also to the following NewDay colleagues who helped make this partnership possible: Fengrui Wang - Senior Principal Security Architect, Idelkys QuintanaRamirez - ML Ops Engineer, Marcin Winnik - Data Engineer, Jason Wan - Senior Infrastructure Engineer and Elvis Joubert - End User Computing Specialist.
We look forward to integrating K-SMEans into the next iteration of our conversational AI product suite with our colleagues from our Digital Products team Jess Smith - Digital Product Owner and Russ Martin - Head of Digital Products, Web and Mobile.