Deploy a Successful AI Strategy // Part 3: Data-Infrastructure
Every engine needs fuel to do its job reliably. It is no different in the field of machine learning or artificial intelligence. Every model needs high-quality data to produce really good results.
For this reason, it is incredibly important to develop a high-quality data infrastructure. This infrastructure allows models to produce better results and reduce errors. Essentially, the data infrastructure should provide data as an output that
- eliminates or smoothes missing data values (e.g. median or arithmetic average for numeric values)
- does a certain amount of feature engineering (which features have a big impact, which have less)
- preparing data values in such a way that the actual model development can focus on exactly this work. As it stands today (year 2020), about 80% of the work effort is spent on data preparation, whereas the remaining 20% is spent on model development.
A second essential feature of an efficient infrastructure is the standardized provision of models. It is advisable here to develop a guideline to which the organization should orient itself. Again, there are essential requirements for the models:
- Reproducibility: models are provided as Docker containers and can thus be provided on diverse (cloud) infrastructures. Containers also guarantee a certain reliability of the model, since the same development environments, libraries, etc. are always used here.
- Standardized provision: It is advisable to develop an enterprise standard here (e.g. REST APIs). Standards are important, because in this way technical debts can be reduced and modularity is guaranteed.
As a final point, the topic of cloud computing and data protection should not go unmentioned. The last 10 years have shown that renowned cloud providers are increasingly replacing the classic functions of a company’s IT. When it comes to the topic of data storage and the storage of data, we are entering a very sensitive area — also because of the GDPR or CCPA. Depending on your industry, business model and type of data stored, you need to carefully weigh whether data storage in the cloud is an option or not.
Key Take Aways
- What technologies should be used to implement the data infrastructure?
- Is the expertise available within the company or will it be necessary to turn to cloud providers if necessary?
- Does decentralization of the infrastructure play a role (legal aspects, global deployment, etc.)?
Want to read more? Return to overview