How do you shift your data project from Prototyping to Production successfully?
By Walid Benbihi, Data Scientist at Ponicode
There are just so many steps between the first idea and the final prototype. When we try to answer this question we try to embrace not just all of the steps, but also the different levels of prototyping: the project management level just as much as the codebase management level.
The AI industry is facing a critical issue: 80% of new projects will be aborted before reaching the proof of concept step. This systemic failure to move prototypes to production means a massive waste of resources within the economy. We should be better distributing our resources to bring innovation and wealth to companies. We also need to make sure that Data Scientists are capable of reducing wasted resources and bringing the concepts that they design to life.
In this article, I will try to explain some weaknesses in the prototype to production workflow of Data Scientists. I wish to open a discussion to push forward best practices in our industry. I hope we can build a comprehensive reflexion on how to bring more maturity to AI projects. And, ultimately, help researchers and project managers to systematically succeed in their work.
The great majority of AI projects face some kind of design-to-production issue damaging their ROI. It happens at every level from the small startup to the Fortune 500 company, from AI-centric businesses to AI solution providers.
Why does this happen?
The biggest weakness is that, while designing, Data Scientists have little to no prediction of the production environment. They perform their research with a poor sense of the reality ahead of them. They would rather explore their idea, hoping to be able to test it instead of making a proper analysis ahead of time of the limitations they might face in production.
One very common issue is related to the data quality dimension of a project. Typically, Data Scientists use datasets in the prototype that are not usable in production. Another dimension we could mention would be the cost of production: perfectly well designed models of which implementation is not financially feasible.
The third biggest dimension creating issues is the compliance side of a project. I have personally experienced this in big corporate environments having worked previously in consulting within the retail industry (but I think that this point is valid across all company sizes as I’ve heard quite a few testimonials of similar experiences in SMEs).
When I was a Project Manager, I very often faced refusals from clients when the Data Science team designed very efficient models which were not GDPR compliant. When such a situation occurs you have to step back into research and redo the work. This is a direct consequence of poor forecasting.
Another example would be security flaws. Since Data Scientists are not trained as Software Engineers, they don’t always have a sense of IT requirements. Further down the road they might have their projects put on hold because it does not fit the client’s security framework. They consequently have to rework to improve the security level of the model.
Every level and dimension can impact the capacity of a project to go into production. Nowadays, too many resources are wasted in AI projects. It is a Data Scientists’ duty to take all aspects into consideration and make partnerships with people who have the right expertise in order to guarantee the success of their project.
Which solutions are available to us?
There are different new approaches and workflows emerging to solve this. They each have their own strengths and weaknesses. There is still no consensus regarding this and to each their own. Teams usually pick an approach according to their end goal.
Well, maybe you can identify some best practices regardless of the approach?
A few best practices are now becoming normal standards in terms of project management. Regarding the data exploration phase we tend not to have a feature-focused approach but rather think about whether we will be able to use this feature in production. This enables us to make better choices along the way.
As soon as you prototype you should think about production. Meaning that the Data Scientists have to wonder if they will have access to the data in production. In order to do that they must understand its origin on a broader level and take the time to make sure they understand that data well beyond its statistical power. Finally, they must make sure that the data is not only significant but also accessible on the IT side and compliant with the law.
We find in the AI and Data Science world the same communication issue we already had between Developers and DevOps. This is where Data Scientist Ops begins.
It’s possibly even more of an issue since the knowledge gap between data science and deployment is much bigger than the gap between software engineering and deployment.
It is also a more complex issue than DevOps (check out our article on DevOps here 😉). Devops is a complex topic but code follows a very deterministic logic. We can identify bugs in deployment and target its source whereas the non deterministic nature of Machine Learning makes the same issue more complex. A Machine Learning related mistake will have a bigger negative financial impact inherently due to its non deterministic nature and its longer development cycle.
Moreover, Data and Machine Learning Projects have an exploratory phase that is a great cause of frustration. Indeed, this phase can sometimes be endless since we don’t always know if the issue we are trying to solve has a solution. Hence the necessity to set clear goals and keep them all along this exploratory phase.
One simple way to avoid this infinite loop situation where researchers endlessly research a solution would be to deploy anything above 0. Meaning that even a solution that does not fit the goal 100% or even 10% is still a solid starting point and should be pushed forward to production in order to set a baseline from which you can then iterate. This technique is a good approach to avoid a wasteful and long exploratory phase.
You also have to set the level at which research is not worth it. Let me explain; research can be infinite because you can always push things further just for one tiny extra level of improvement on the performance. But at some point the revenue or the cost saving effect of this extra point is not balanced by the resources put into research. When you bring negative value to the company - when the value you bring is less than the resource you used to reach it - this is the line you should not cross and it is important to set these boundaries that frame the research environment.
As I said previously, I believe in the approach that when you reach some satisfying level of performance you should put it into production and then incrementally improve on it and gradually bring more and more value to the company.
In order to do that you also need to be cautious of the foundations supporting your research and development. Making the right organisational choices, choosing the right tools ahead of the curve is essential to being able to iterate fast and efficiently.
I believe it is this cautious, step-by-step approach that our data projects deserve.
Why is this not the case today?
Well the issue is that Data Scientists’ work is driven by clients and end-users who often require an 85% or 90% minimum level of performance on the prototype they are being delivered. It is a misunderstanding between users and makers about the best workflow needed in order to bring the most value. And that is how you end up having research going in a loop because it can not reach the level of performance required at delivery.
Is this just a matter of pedagogy across departments then?
Maybe. I think many Heads of Data or Business Managers are too KPI-driven to see the greater good behind lowering the entrance performance level. But, in their defence, there certainly are cases where anything less than a 90% performance level is unacceptable.
Let’s take an example: we are working on building a marketing model for a marketing agency and we have an 80% performance level on that model. That could mean that for every 100 clients we would lose 20 of them. Damaging but not critical, right?
Well then let’s take the same performance level but on a supply chain model that is supposed to detect defective parts. This is unacceptable because we can’t ship cars, or trains or planes or houseware with this level of quality control.
I think the line is blurred between the cases where the poor performance level is worth it in order to bring great models to life and the cases where we can’t argue with the performance levels.
The lack of ambition in research and development can jeopardise innovation when it comes to data projects.
But there is also a matter of financial losses there, don’t you think?
Of course! I was mentioning Marketing as not as critical as Supply Chain but we can definitely imagine a case where a Marketing employee is doing a task with a cost per user of 15 euro, well, if you don’t give that information to the data science team they could bring to life perfectly a model with better accuracy than a human but at a cost per user of 30 euro, which definitely goes against the purpose of bringing value to the company. Hence, the necessity within the research process of the Data Science team to stay in close touch with the reality in which they are building a new model.
I have personally learned from such mistakes during a mission I had with a Supply Chain team for which we were trying to build models to predict sales of their product. We found some significant data in the Marketing Department which we could use to improve accuracy. It was related to the performance of advertising. The only issue was that this data was collected manually by marketing employees and put into excel sheets which was not compatible with what we were trying to do.
In order to automate this process we created templates and asked the marketing team to shift to a new workflow in order to integrate our models. But this new workflow meant an extra workload on the hands of an already saturated marketing department. We consequently had to step back and I had to give up this source of data which had become core to the architecture of my model. I had not thoroughly considered the feasibility and consequence of collecting the data I needed. Lesson learned: never start prototyping without a clear vision of the deployment environment.
Is there a problem with code quality when it comes to bringing prototypes to production?
Of course. First of all, if the codebase is not clean and well documented then the workload of Data Scientists onboarding the codebase is going to be heavier. They could even reach the point when they want to rewrite pieces of code themselves in order to have maintainability, legacy or compliance of the codebase under control. The financial cost behind this could be considerable.
Does the industry have any leads on how to solve it?
Data Science is a new area of expertise and the issues we are talking about are just as recent. We are now talking about ML Ops. There are a range of tools that appeared recently to help with ML Ops. Some of them focus on helping Data Scientists to build prototypes that are production friendly. I’m thinking about Dataiku, for example. I also think about recent tools that were not originally made for Data Scientists but are very useful to their work such as Apache Airflow (to set up computer science pipelines and make projects reliable and scalable).
On the other hand, some tools used by Data Scientists are unhelpful when it comes to making production-friendly prototypes. Jupyter Notebook who are a favourite of Data Scientists (because of how they enable them to prototype quickly) are making it really hard to turn a prototype into a scalable live model. It relies too much on the Data Scientists’ quality of work and usually asks for a massive refactorisation. Data Scientists feel satisfied because they can make their proof of concept easily but the tool is deteriorating the capacity to shift that POC from prototype to production.
Last but not least, deep learning algorithms are trained on graphic cards for budget and time-saving reasons but this practice creates a gap between the testing environment and production environment and often Data Scientist teams will fail to transition from one to the other because they did not plan ahead for this discrepancy. Or, the model will not use the full potential of the live GPU capacity. This means a financial loss caused by unused resources.
Wow, that’s a lot of information! How would you wrap this up, Walid?
I think it’s an interesting topic because there is an urgency to discuss it and face the flaws in our workflows. As I said earlier, every Data team should carry out this introspective work and build their own set of best practices to follow in order to reduce waste of resources. Let’s build our industry standards and systematically bring our prototypes to life with success.
If you want to learn more about what a day in Ponicode as a Data Scientist is like, you can check out this interview with Edmond Aouad and Hamza Sayah.