In this week’s SlatorPod, Happy Scribe CEO André Bastié joins us to talk about building a unified platform for transcription and subtitling.
André discusses the journey to co-founding Happy Scribe during his studies where he accidentally came across the challenge of transcription and built a first prototype with his flatmate and now CTO, Marc Assens Reina.
The CEO shares how their product development has evolved, from initially deploying the Google Speech-to-Text API to connecting to various off-the-shelf systems to, now, building their own custom models. He talks about how being a bootstrapped company forces them to focus on producing results with limited resources.
André touches on the different customized features that allow users to create a vocabulary list, build their own dictionary, and adjust the number of characters per line for subtitling projects.
He gives his take on what’s driving the popularity of subtitles in short-form content and how subtitling differs between TikTok and long-form entertainment.
Subscribe on Youtube, Apple Podcasts, Spotify, Google Podcasts, and elsewhere
The CEO talks about the positives and negatives of Whisper, OpenAI’s open-source ASR model, and its impact on the AI space. The pod rounds off with Happy Scribe’s roadmap for 2023, including some interesting changes to pricing.
Transcript (Produced by Happy Scribe)
Florian: Tell us a bit more about your background and the initial origin story behind Happy Scribe.
André: My professional background is going to be rather short. I come from the Southwest of France where I went to high school, and then I went to Paris. I studied business-related topics, and after that, I went to Ireland, where I did a Masters in Ecommerce. It was half technology and business. I did one internship in London, where I was working for Intuit, which is a big US fintech. Happy Scribe started in the middle of that Masters. I was doing this qualitative research around social entrepreneurs in Ireland and I was looking at their motivation and aspiration. I would go every once in a while and interview one about what they would be doing. Once I did twelve interviews, I went back to the university and the researcher in charge came to me and said, that is great, you did all these interviews, but now you have got to transcribe them. I had never done this in my life. I go back home and I start transcribing. I did the first ten minutes and I realize this is literally going to take me ages to do it. It takes me around six, seven hours to do a one-hour interview. It is impossible that I spend 40, 50 hours transcribing interviews for this Masters. I did not have the time. It was the end of the year, final projects and everything.
I go into the living room and I start talking with my flatmate, who is now the CTO of the company, Marc. He was doing his bachelor’s thesis on computer vision and artificial intelligence. We started talking about this and we tried to find tools online, things that could help, and we could not find anything. We were students, so we could not pay external transcribers to do the job and so we ended up playing a bit with Google Speech-to-Text API that was released on the exact same day. It was not amazing, especially at that time compared to the results we are getting today, but the speech-to-text technology could help me and my research group save between 30 to 50% of the time we would spend transcribing interviews. I started using it, my research group started using it and one, two days after we started receiving emails from PhD students, researchers at the university. They were like, we heard about what you were doing with this speech-to-text, can you come into my lab and install it? Mark and I were business-oriented. We were loving the ecosystem in Dublin and everything. We locked ourselves in for 48 hours, and we started building the first version of what would become Happy Scribe with a web interface and all the backing behind it. The month after, we launched. 10, 20 people were using it because we had posted the link on some forums and everything. Maybe two days after we get an email from this journalist in the US that said, I saw your tool, it is a game changer, I am going to write an article about it. The article goes viral. We receive more than 50,000 views on the website and the website keeps crashing all the time and that was a pivotal moment. That is when we decided, okay, there is something much bigger to build from this. This is when we took the decision to start Happy Scribe and from that moment we went full on.
The weekly language industry podcast. On Youtube, Apple Podcasts, Spotify, Google Podcasts, and all other major platforms. Subscribe Now.
Florian: That is insane. That has compressed the entire startup journey in a couple of weeks. Notice the problem, find in a sense an alpha version, build it, product-market fit, and go viral with an article.
André: It was so insane and so intense at the same time. We were answering customer tickets during the night and doing our exams during the day. We learned so much in this early stage because we were seeing this industry with a completely new eye, without any previous experience in the market. We were not trying to build a transcription tool. We were trying to build a tool that will help journalists save time.
Florian: Give us the three, four, five key milestones over the past five, six years since you have been online.
André: After this happened, it took us one and a half years for Mark and me to understand the industry. In the first one, two years of Happy Scribe, we were in full opportunistic mode. We were attacking every single project that was coming to us. We had some TV broadcasters coming to us and asking us to do subtitles in one language with human resources. We had never done that before, and we were saying yes to everything, and we were learning, learning, learning. After these two years, we are like, doing this opportunistic approach is nice, we are learning a lot, but what do we want to do with Happy Scribe? Where do we want to bring this? What do we envision for the future? This is when we decided to focus a lot on the product, to work out the speech-to-text technology, to be able to work on the fundamental problem, which is the AI itself, the model itself. Why? We realized that we were in a position that is quite exceptional, where Happy Scribe is a product for the users, but at the same time a data giant that is processing a lot of information. The latest model of Whisper brought that back. To get very good accuracy, data is one of the most important assets that you can have. That is the strategy we took and on the product part, it was more about how we can build an all-in-one platform for language needs. We saw companies that would have transcription, translation, subtitle needs, both with human and AI. We see a lot of companies using the two services and so in the last five years, there have been different key points, but it is about building this unified platform for language needs, this one shop stop for language services.
Florian: In terms of the market you are looking at, is it still mostly individual journalists or creators or are you also branching into more SMEs and the enterprise?
André: We started targeting individuals and recently we started to focus much more on companies and teams. We want to build a platform where collaboration happens around languages. What we realized is all our customers are not individuals, they are working as part of a team and the transcription work is not used by just one person, it is used by multiple people. Same goes for subtitling or translation. We are bringing the collaboration part in the different services that we are providing.
Florian: That must be pretty hard on the pricing side, adding seats and things that? Is that a hard transition?
André: We are still doing the transition. Until today, our pricing, for example, was very transactional. We would charge per minute, per hour because from our knowledge, that is what makes sense looking at the industry. Now we are going to try a few things in the coming weeks including a more per-seat type of pricing for these teams.
Florian: Have you guys been bootstrapped since day one or have you raised any money from investors?
André: No, we have been bootstrapped since day one.
Florian: That is what the idea of product market fit gives you. The freedom to bootstrap.
André: Gives you the freedom to bootstrap. When you are VC funded, you build a company in a totally different way. When you are bootstrapped it forces you to focus on what brings results because you have limited resources both in terms of money and people. You never invest too much time in something if you are not guaranteed that it is going to bring results, for us it is more revenue.
Florian: In a 2017 interview you said that you were using the Google Cloud Speech API. Are you still working mostly off the Google Speech API and how much secret sauce and customization goes into it? Tell us a bit more about how you add the custom layer on top of the API.
SlatorCon Remote June 2023 | Early Bird Now $120
A rich online conference which brings together our research and network of industry leaders.
André: This is old, 2017. Things have changed since. We started with Google Speech API. At the time that was pretty much the only option apart from Amazon Transcribe. One of the points that I described as important is the multilingual aspect. We started running frequent benchmarks of all the providers that were available and started learning about more providers as well. We have worked with Speechmatics, Assembly. We have worked with a lot of them and we have benchmarked pretty much everyone. We got to a point where we were working with five different providers. Our goal was to provide the best solution for our users, not to get one partnership with one of these companies. Today we are taking a step further and we are now also working on our own models that are already available in some languages. This is going to be more the future of where it is going to come from, the speech-to-text.
Florian: Do you allow customers or subscribers to customize certain things, jargon, language? Maybe if there is noisy audio, how do you adjust for that?
André: Yes, we have a few things. Not many, but we allow people to add their own vocabularies. Let us say you are doing interviews about a given topic and there is this jargon or this company name or these acronyms that are mentioned back and forth. You can add them on this piece of product and the machine will do its best to recognize them on future files, so that is the AI part. We also have this human-made transcription service, so this more blind marketplace that we provide to our users. In there we build customer dictionaries. Over time our transcribers are storing speaker names or acronyms, anything that could be very specific to that user so we have more consistency over time. Then on the subtitling side of things, we also enable people to customize the CPS, the number of characters per second, the number of characters per line, the space between the captions, and have this on a project basis. The reason why we want to work on our own AI is that it enables us to provide much more customization to our users. One of the things that we are looking to release in the coming months is being able to store and save the parameters of any user’s voice. Let us say you upload a file, you enter on that specific file this is Florian. On a future file that you upload on Happy Scribe, because we know that this corresponds to the voice of Florian, you will not have to retype that speaker name anymore and over time that makes our users save a lot of time.
Florian: Is there a particular metric that you are using or a set of metrics in transcription for quality? If yes, what are those?
André: First, on the AI part, you have the word error rate. We are going away from it recently because it is not as accurate and it is not as representative as it used to be. In the early days, the progress was so big that you could really see the progress being made on the word error rate. Now, because we are getting into this last mile of making ASR work perfectly, you need something that is much more subjective. You need to be able to capture the difficulty of the audio. You need to be able to capture any subjectivity in the file. Anything that could be subject to interpretation. With word error rate, if the number three is in numerals and the machine is going to transcribe it in letters, it is going to count it as wrong, but it is the same thing. This is why now we are using the word correction rate. We are looking at the time that our transcribers spend and the number of words that our internal transcribers are correcting on this transcript.
Florian: It is similar to machine translation. There used to be this metric called BLEU score, which everybody used, but it is almost at a level where I am told it is becoming a little bit meaningless.
André: In the transcription world, you have a lot of frameworks that you use to analyze quality. What we have been doing and working for our human service is we are going to build it derived from the MQM, the multidimensional quality model and try to do something to evaluate that. We stay on top of quality.
Florian: How do you think about machine translation or even human translation and what technology are you using there? In our view, that is its own giant universe of options you can plug in and 20 different APIs you can call on.
André: For machine translation, we are working with two providers that are pretty common, which are DeepL and Google Translate. We give priority to DeepL and any language that DeepL does not do is going to Google Translate. At the moment it is as simple as this. We have not planned to work our own model on this or dive much deeper into the problem. Translation is still very early in the services that we provide. We are seeing more value in foreign subtitles than in translating files or text files or transcription. We might be doing some work this year, but right now we prefer keeping it simple, so the user can select the language and get the results.
Florian: There is this boom in video and audio content. This is driving a huge boom in transcription and subtitling. When people are consuming content on Twitter, YouTube, TikTok, some cannot listen to it because they are on public transport and they do not have their headphones on, so subtitles become important for having a piece go viral. How do you see that? Is that driving some of the signs up to your platform?
André: This is huge. First, it is not that they cannot listen to the audio. On a lot of these platforms, the default mode is audio off. If you want to capture the attention of someone on a video without having the audio, the only thing that you can try is captions. There are a couple of metrics that I was looking at and they are quite interesting. Around 92% of people right now watch videos with subtitles and with no audio. 69% of people watch videos with the sound off in public spaces. Every time you look at your phone on a mobile in a public space, there are 69% that will have the sound off. What we saw, over the last three years, is a growing interest from influencer agencies that are coming to us and that are subtitling almost all their content creators. The last thing is about accessibility. Sadly, I do not think it is the number one reason at the moment, but it is having a big impact to be able to subtitle for that.
Florian: How does the subtitling differ for a 30-second TikTok to a five-minute YouTube to a high-end Netflix show? What are some of the differences?
André: It is completely different because you have different formats. The videos are vertical or horizontal, so you need to follow guidelines that are very different. On a horizontal video, you have 40 characters per line, for example. On a vertical video, you had 25, 30 characters per line. You have the format, but it all starts with your audience. When you do subtitles, you do not use the same guidelines if it is content for kids or if it is content for adults. Kids will have a number of lower characters per second, whereas adults will be able to read more characters per second. Subtitling is more an art rather than a science and that is where all the complexity is. What we are seeing at Happy Scribe now, a lot of people that come to us do not understand the complexity of subtitles or think that you get the transcript and you break it randomly. You need a very deep language understanding to be able to do subtitles. You need to understand the structure of a sentence to know where to break it. It is much more complex. Doing subtitles is easy. Doing subtitles that are readable is tough.
Florian: You also offer what you call human-made translation and transcription. How much of the business is that already? How do you manage or think about going forward managing that complexity of expanding into an agency type of model versus being the pure SaaS type of business?
André: We do not manage it as an agency at all. The human service that we started providing came from the fact that we had this list. We have always received a lot of applications from transcribers, subtitlers, and translators. I created this form where people were storing their information and maybe we will get back to you if we launched that service one day. What surprised me is that a third of them said, which company have you worked for, and they were saying Happy Scribe. I started contacting them and getting in touch. What was happening is that we had users that would do the AI transcript on Happy Scribe and then take the link, share it on UpWork and get someone to proofread it. That was what was happening. We started launching it and we saw a lot of traction. We have customers, sometimes they need to subtitle, transcribe maybe 20, 30,000 minutes a month. In that case, it is very hard to proofread subtitles on your own, and so that is where we are bringing value. Now, on the way we provide a service, from the time the customer uploads the file to the way to pay the transcribers to the transcription. The entire review process is automated. We very rarely work on special projects, meaning that we have our set of guidelines that transcribers need to follow and we do not change them for a client. We aim to provide a solution that covers 70, 80% of what people usually need, not the 20, 30% that are specialized. To give you an example, right now we exclusively do clean-read transcripts. We do not do it verbatim. For subtitling, we do 41 characters per line, we do not do less, we do not do more. We go for the industry standard and provide a solution that allows them to save time. The prioritization of the work and the queuing system is fully automated, meaning right now we have around 700 transcribers and we have one person that manages the queueing system internally.
Florian: It is enabling your customers to DIY this and giving them access to a pool of transcribers. Would they still send some protected links and have people work there?
André: The files need to be uploaded on the platform to select the language. Everything is happening on Happy Scribe, meaning that we aim to serve customers that want to access the full product suite. We have some clients that have 30 employees and all these employees are on Happy Scribe and have their workspace and different folders based on different creators they are working on. They use human transcription or AI, depending on what they need. That is where Happy Scribe is interesting in terms of business offering, is that it is this one place where they can do a request for human or do a request for AI if they do not have time and go on. It is standardized, meaning that we guarantee 99% accuracy, you get your job reworked or you are refunded if we are not able to reach it. We guarantee you 24 hours turnaround and for extra, we can even get a seven, eight hour turnaround on transcription. We want to build a service that is predictable and that is accessible 24/7.
LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. Browse new jobs now.
Florian: That is where you differ very much from some of the other competitors that are either purely API or very much in this enterprise space and also heavily funded. Is that something you think long term might be part of your strategy or do you want to build this fully automated SaaS DIY platform?
André: We have a few enterprise customers, it is not a big focus. As for now, we focus on teams and more SMBs. If we go for enterprise, we would not be going for enterprise on agency mode, meaning that we would not customize models for customers. We would provide enterprise more on the product level. Happy Scribe is a deeply product company. We want to build something that is scalable.
Florian: What are your thoughts around Whisper, given that you are now building your own models? Generally, do you think it will have a lasting impact on transcription or is it another iteration in which every other week we have a new thing launching?
André: Whisper is a massive change. It is a big change of paradigm in the AI sphere. First, it is showing that having a lot of data makes a difference. Second, it is the multilingual aspect that is impressive. We work on our own model, but our own models are based on Whisper and what we add after this is the fine-tuning. Some thoughts on Whisper was that it has been trained on so many things that were not necessarily curated. The machine started having a lot of hallucinations, would lose track of what it was transcribing and go completely nuts on some things. That is where the work that we have been doing at Happy Scribe is being able to improve this model based on our data that is much more clean and that we have full control on. The main problem is probably going to be more on how do you build a model that is aware of the current news. In the transcription space what we have seen multiple times is all these models breaking when there was a major world event. When COVID happened, it was impossible to transcribe COVID for four months, so the user had to proofread COVID by themselves. Same with the war in Ukraine. All these names that started appearing, none of the AI were able to transcribe them. This is going to be a very interesting point to build a model that is usable in the newsrooms or able to train constantly on data. It is also one of the reasons why we work on our own AI because for our customers this is a big pain they are having. We need to be able to provide them with an AI that is aware of the latest jargon in the world. The final thing is that in the AI space, there is still some work to do on building an unbiased model. These models are still quite biased towards men, towards not understanding minority accents and that is something where these models can make some big improvements.
Florian: How open-source is Whisper?
André: Whisper model is completely open source. Facebook released one two years ago on speech-to-text and that was quite good. As we started working on them, there was another one by a group of researchers. These models come up frequently.
Florian: What do you think about this whole generative AI hype? How fundamental is it? How much hype versus actually something you think is going to change your business in the next 12 to 24 months is in this current hype cycle? What are your thoughts about that?
André: I am super excited about this. You have, I do not know, 50 transcriptions about research, in the past with qualitative research, what you would do is you would listen and transcribe every single interview and you would go there and tag by themes what people are talking about. With ChatGPT you can get to a point where you can say, what are the main themes that people have been talking about in this interview, and can you give me the top five quotes about it? That is a huge time saver. The hype right now is on things that we will never use in the future and asking questions that no one cares about. Once you pair this with actual content, it can be transcribed, I think the impact can be very big. A lot of people are transcribing meetings, you say, can you summarize this meeting in five bullet points? Honestly, I tried it a few times with some of our content and it is impressive. Now think of ChatGPT paired with a perfect speech-to-text engine, a working Alexa that we have all been waiting for.
Florian: For you, it is particularly exciting because your specialty is almost building the front end to this and that is what a lot of people are now scrambling to get done. It has taken a leap. Now everyone is trying to do a proper front end, which you have a five-year head start.
André: Our top skill is building great products, building great UX, making transcription, subtitles, and translation easy for anyone. You have a communication team in a big company, you can get your subtitles in five minutes. You do not need to learn anything about it.
Florian: Tell us more about whatever you can share on the roadmap for this year. Any exciting launches you can disclose for our listeners?
André: Yes. There are probably going to be some changes in the pricing, although it is going to take a bit of time. That is going to be something that is going to be interesting. Then in products, things are being defined right now in another room. At Happy Scribe, we believe that people do not care about transcription, subtitles. It is the core of everything we do. People care about what they are going to do with their transcript and what they are going to do with their subtitles. On a product part, the vision is really to look at what people are doing one step before they get the transcription and look at what people are doing one step after they get the transcription and build this product within Happy Scribe. We work a lot with production companies, and they need to know what is going to go in the final document. How can we enable this on the platform? Same for researchers that I mentioned before. They do this analysis. How can we help you make this analysis faster? By directly working on your transcript with Happy Scribe. This is where we are looking to go.