If you ask Saheed Azeez, about the difficulty level of creating Naijaweb, a dataset of 230 million GPT2 tokens based on Nairaland, he'll tell you it is easy. "All you need is to know web scraping and data cleaning," he said.
However, when he explained how he created Naijaweb, most of what he said flew over my head. As a final year Mechanical Engineering student from the University of Lagos, Nigeria, Naijaweb is impressive, even if Azeez doesn't completely agree.
Naijaweb is not another ChatGPT or a version of Ijemma Onwuzulike's IgboSpeech, it is a dataset that can be used to train a large language model (LLM), like the one that powers ChatGPT.
Feeding data into LLMs sounds like a walk in the park, but you need web scraping and data-cleaning skills from scraping to converting into GPT tokens.
Azeez started learning web scraping and data cleaning skills, which he used to create Naijaweb in a Python class in 2019.
The only reason he joined the class was because of the mention of machine learning, which he thought meant teaching robots or machines how to learn. As a mechanical engineering student, he thought it could come in handy, but He soon realised that there were no physical machines involved in machine learning.
Rather than be disappointed, he got even more interested; thanks to the COVID-19 pandemic in 2020, he had time to take it seriously.
He started with a lot of machine learning competitions he found on Zindi. He lost most of them. But while they were important for learning, they weren't enough to help him build Naijaweb. "I needed to learn how to build from scratch."
Building Naijaweb
In 2022, Azeez began his first attempt at web scraping Nairaland. "I heard people talking a lot about the amount of value Nairaland possesses, so I decided to try web scraping it."
Give it a try, you can unsubscribe anytime. Privacy Policy.
Unfortunately, this first attempt didn't go very well. "The script I used back then didn't support synchronous programming."
Synchronous programming means tasks are completed one after another in a set order, with each step waiting for the previous one to finish. When he tried again this year, he figured it out, but with credit to Hugging Face, an open-source platform for ML and data science that created an easy-to-use library.
The next step now is for Naijaweb to train an LLM, but that might not happen.
While 230 million GPT2 tokes seem like a lot, in today's AI age, it is not nearly enough. But what exactly are these tokens?
LLMs understand numbers and not words, the process of converting words into numbers that LLMs understand is the tokenisation process.
"If we were to tokenise the word CALCULATED, for example, we could split it into four tokens, CAL-CU-LA-TED. A number will be assigned to each of these tokens."
This complex process is not the first Azeez has taken on. He once built a screenshot bot on X in 2022 known as Tweet Shot. According to him, it was his most viral creation with 170,000 followers.
Azeez said the bot has been acquired by "an Indian man" although he declined to share how much.
What is next?
Azeez currently works as a Machine Learning Engineer with HelpMum, a non-profit AI startup dedicated to building solutions that support maternal and infant healthcare. Between school and his job, he barely has enough time to do the required research to take his AI skills to the next level.
However, these are not the only things that stand in his way. Building AI projects require a lot of computing power and constant electricity that he does not have. The dataset he created, for example, required him to keep his laptop running for days.
Using the dataset he has created to train LLMs would require a very powerful graphics processing unit (GPU). While he could use a service like Google Colab to get access to high-end GPUs, he'd still need a good laptop and constant electricity for weeks.
But when it comes to building LLMs, Azeez says that is not the job of one man, it requires a team of highly skilled machine learning engineers, some of which he says Nigeria has.
"There are Nigerians that are very skilled in these things who have gone to do their PhD abroad. I know a UNILAG graduate who built a small LLM one time."
Azeez even revealed that there is a thriving community of AI enthusiasts in Nigerian universities, a group of super-smart people who are passionate about AI and ML. Data Science Nigeria is doing a good job of feeding their passions, he reveals, but with intermittent power supply and unavailability of GPUs, will these passions amount to anything?