Where to get Chatbot Training Data (and what it is)
In a break from my usual 'only speak human' efforts, this post is going to get a little geeky. We are going to look at how chatbots learn over time, what chatbot training data is and some suggestions on where to find open source training data.
So, let's get a wiggle on.
What is chatbot training data?
On a fundamental level, a chatbot turns raw data into a conversation. This data is usually unstructured (sometimes called unlabelled data, basically, it is a right mess) and comes from lots of different places. A chatbot needs data for two main reasons: to know what people are saying to it, and to know what to say back.
Consider a simple customer service bot. The chatbot needs a rough idea of the type of questions people are going to ask it, and then it needs to know what the answers to those questions should be. It takes data from previous questions, perhaps from email chains or live-chat transcripts, along with data from previous correct answers, maybe from website FAQs or email replies. All of this data, in this case, is training data.
The importance of training and good data
We have seen tens of thousands of chatbots developed. Most of them are poor quality because they either do no training at all or use bad (or very little) training data.
You see, the thing about chatbots is that a poor one is easy to make. Any nooby developer can connect a few APIs and smash out the chatbot equivalent of 'hello world'. The difficulty in chatbots comes from implementing machine learning technology to train the bot, and very few companies in the world can do it 'properly'. Chatbots are only as good as the training they are given. Knowing how to train them (and then training them) isn't something a developer, or company, can do overnight.
Finding the training data
Ok, that is a brief overview of what chatbot training is. So, where do we get the data to train the chatbot?
The first, and most obvious, is the client for whom the chatbot is being developed. With the customer service chatbot as an example, we would ask the client for every piece of data they can give us. It might be spreadsheets, PDFs, website FAQs, access to help@ or support@ email inboxes or anything else. We turn this unlabelled data into nicely organised and chatbot-readable labelled data. It then has a basic idea of what people are saying to it and how it should respond.
Client data aside, there are a number of other places training data can come from. Here's a handy swipe list:
- Enron email archive (500,000 emails made public due to the scandal).
- U.S. National Institute of Standards and Technology is 70,000-ish handwritten digits.
- U.S. Fundamentals Archive is a database of five years of financial data on over 5,000 U.S. companies.
- Sports Statistics Database is historical data on lots of USA professional and college sport.
- UCI Machine Learning Repository is the go-to place for data sets spanning over 350 subjects.
- Kaggle Datasets has over 100 topics covering more random things like PokemonGo spawn locations.
- data.gov is a public dataset focussing on social sciences.
An on-going process
Just to finish up, I want to talk briefly about how a chatbot's training never stops. When we develop a chatbot for a client we tend to train the bot in five stages:
- Warm-up training - this is where we use client data to get the chatbot as smart as we can to get started (and this is where most companies either stop or don't even start!)
- Real-time training - this is where we analyse incoming conversations in real-time. It tells us what people are saying/asking the bot, rather than what we think they will say/ask.
- Sentiment training - this is where we train language and functions based on how people are talking to the chatbot. As a basic example, if we detect a grumpy user we use different responses to if we detect a happy user.
- Effectiveness training - this is where we look at the result of conversations to train how to get people to that result. Sounds cryptic I know, ultimately, it means we learn what a good result of a conversation is and change language, features and responses to get more people to that result faster.
- Continuous improvement - this is where we do all of the previous training in a feedback loop. We continuously learn from user interactions, results and feedback and improve a chatbot's language, conversation flows and functions. The more complex a chatbot, the most investment there is in iteration and continuous improvement.
So there we have it.
A run through of what training a chatbot is, where to get chatbot training data and a little bit of insight on how ubisend builds world-leading chatbots, in part, because of its ability to train their chatbots.
As usual, questions, comments or thoughts to my Twitter or LinkedIn.