How A lot Docker Ought to a Information Scientist Know?

August 18, 2025

49

The perfect solutions are clearly “some”, “relies upon”, or start with “Effectively…”. Let’s do a deeper dive and try to grasp the place and the way Docker is being shoehorned (learn shoved) into an information scientist’s every day work and have a look at how open supply Buildpacks might help knowledge scientists.

The Offender ― Titles

Earlier than we dive into the specifics of containerization, it’s essential to grasp the completely different roles usually discovered amongst knowledge specialists and what they sometimes entail. These embrace knowledge scientist, knowledge analyst, and knowledge engineer.

A knowledge analyst sometimes focuses on exploring and analyzing present knowledge to extract insights and talk findings. Their work usually entails knowledge cleansing, visualization, statistical evaluation, and reporting. Instruments usually embrace SQL, Excel, BI instruments (Tableau, Energy BI), and typically Python/R for scripting and fundamental modeling.

The info scientist builds fashions and algorithms, usually utilizing superior statistical methods and machine studying. They’re concerned in your complete course of from knowledge assortment and cleansing to mannequin constructing, analysis, and typically deployment. Their toolkit is intensive, together with Python, R, numerous ML frameworks (TensorFlow, PyTorch, scikit-learn), SQL, and more and more, cloud platforms.

The info engineer is a more recent position. This persona designs, builds, and maintains the infrastructure and programs that permit knowledge scientists and analysts to entry and use knowledge successfully. This entails constructing knowledge pipelines, managing databases, working with distributed programs (like Spark), and guaranteeing knowledge high quality and availability. Their abilities lean closely in the direction of software program engineering, databases, and distributed programs.

Are knowledge engineers simply DevOps of us in knowledge science garb? (metamorworks/Shutterstock)

What do these titles imply? Are knowledge engineers simply DevOps of us within the garb of information science folks?

Whereas there’s undoubtedly a major overlap and knowledge engineers usually make the most of many DevOps rules and instruments, it’s not fully correct to say they’re simply DevOps of us. Information engineers have a deep understanding of information constructions, knowledge storage and retrieval, in addition to knowledge processing frameworks that transcend typical IT operations. Nonetheless, as knowledge infrastructure has moved to the cloud and embraced rules like Infrastructure as Code and CI/CD, the abilities required for knowledge engineering have converged significantly with DevOps.

Lateral Shifts: The Rise of MLOps

This convergence is maybe most evident within the emergence of MLOps.

MLOps will be seen because the intersection of machine studying (ML), DevOps, and knowledge engineering. It’s about making use of DevOps rules and practices to the machine studying lifecycle.

MLOps is about placing knowledge science artifacts into manufacturing. These will be fashions, pipelines, inference endpoints, and extra. The objective is to reliably and effectively deploy, monitor, and preserve machine studying fashions in manufacturing environments.

Along with typical DevOps tooling, MLOps requires a selected focus, and requires a number of further instruments. It’s like creating a brand new vertical business the place DevOps instruments are utilized. Whereas MLOps leverages core DevOps ideas like CI/CD, monitoring, and automation, it additionally introduces instruments and practices particular to machine studying, equivalent to mannequin registries, function shops, and instruments for monitoring experiments and mannequin variations. This represents a specialization throughout the broader DevOps panorama, tailor-made to the distinctive challenges of deploying and managing ML fashions.

Enter Kubernetes

Over the previous few years, Kubernetes has grow to be an integral a part of cloud-native computing and the gold commonplace for container orchestration at scale. It offers a sturdy and scalable method to handle containerized purposes.

(Mia-Stendal/Shutterstock)

Kubernetes is the mainstay of the DevOps world. Kubernetes gives important advantages by way of scalability, resilience, and portability, making it a preferred selection for modernizing infrastructure. This adoption, pushed by the engineering and operations facet, inevitably impacts different roles that work together with deployed purposes.

This forces information of containers, Docker, and an entire lot of different instruments on knowledge scientists. As ML fashions are more and more deployed as microservices inside containerized environments managed by Kubernetes, knowledge scientists want to grasp the fundamentals of how their fashions will run in manufacturing. This usually begins with understanding containers, and Docker is probably the most prevalent containerization software.

How does studying a brand new DevOps software examine to studying, say, Microsoft Excel? It’s a vastly completely different beast. Studying Excel is about mastering a person interface and a set of features for knowledge manipulation and evaluation inside a structured atmosphere. Studying a DevOps software like Docker, or understanding Kubernetes, entails greedy ideas associated to working programs, networking, distributed programs, and deployment workflows. It’s a major step into the world of infrastructure and software program engineering practices.

Let’s have a look at the levels of an ML pipeline and the place containers slot in:

Information Preparation (assortment, cleansing/pre-processing, function engineering): These steps can usually be containerized to make sure constant environments and dependencies.
Mannequin Coaching (mannequin choice, structure, hyperparameter tuning): Coaching jobs will be run in containers, making it simpler to handle dependencies and scale coaching throughout completely different machines.
CI/CD: Containers are elementary to CI/CD pipelines for ML, permitting for automated constructing, testing, and deployment of fashions and associated code.
Mannequin Registry (storage): Whereas the registry itself won’t be containerized by the information scientist, the method of pushing and pulling mannequin artifacts usually integrates with containerized workflows.
Mannequin Serving: It is a main use case for containers. Fashions are sometimes served inside containers (e.g., utilizing Flask, FastAPI, or particular serving frameworks) for scalability and isolation.
Observability (utilization load, mannequin drift, safety): Monitoring and logging instruments usually combine with containerized purposes to offer insights into their efficiency and habits.

A Entire Sea of Non-Containerized Workloads

Click on it to enlarge it

Regardless of the push in the direction of containerization, it’s necessary to acknowledge that there exists an entire sea of non-containerized workloads in knowledge science. Not each process or software instantly advantages from or requires containerization.

These may very well be instruments or complete platforms. Usually operating regionally, but in addition in manufacturing.

Some concrete examples of non containerized workloads in an information science pipeline are:

Preliminary knowledge exploration and ad-hoc evaluation: Usually completed regionally in a Jupyter pocket book or IDE with out the necessity for containerization.
Utilizing desktop-based statistical software program: Instruments like SPSS or SAS, whereas highly effective, are usually not sometimes run in containers for interactive evaluation.
Working with giant datasets on a shared cluster with out container orchestration: Some organizations should depend on conventional cluster computing the place jobs are submitted and run with out specific containerization by the tip person.
Easy scripts for knowledge extraction or reporting that run on a schedule: For easy duties with out advanced dependencies, a easy script executed by a scheduler would possibly suffice with out container overhead.
Older legacy programs or instruments: Not all present knowledge infrastructure is container-native.

The Drawback

The results of such a diffusion in non-containerized choices being obtainable, and handy, knowledge scientists are inclined to gravitate in the direction of utilizing these choices. Containers characterize a cognitive overload — one other tech they’ve to review, one other mastery they should pursue.

That being mentioned, containers can enhance a number of issues for knowledge science groups. Inconsistencies between environments, which generally is a giant supply of toil, will be ironed out. Containers can forestall dependency conflicts between completely different environments – native or staging. Reproducible and transportable builds and fashions served is a function that knowledge scientists would like to have.

Not all knowledge groups can afford to have giant, competent, or financial operations groups at their beck and name. The Iron Triangle yet again.

Cloud Native Buildpacks: A Clear Resolution To A Messy Drawback

Information scientists incessantly make the most of various toolchains involving languages like Python or R together with a myriad of libraries, resulting in advanced dependency administration challenges. Operationalizing these artifacts usually require deftness and container acrobatics within the type of manually stitching collectively and sustaining intricate Dockerfiles.

Buildpacks actually change the sport right here. They assist assemble the required construct and run time dependencies and create OCI-compliant photos with out specific Dockerfile directions. This automation reduces the operational burden on knowledge scientists, to not point out cognitive liberation, permitting them to focus on analytical duties.

Cloud native Buildpacks are a CNCF incubating venture. The open supply software is maintained by a group unfold throughout a number of organizations and finds great use within the MLOps area. Try the record of adopters.md and get began from the GitHub repo.

Concerning the writer: Ram Iyengar, developer advocate for Cloud Foundry Basis (a part of Linux Basis), is an engineer by follow and an educator at coronary heart. Alongside his journey as a developer, Ram transitioned into know-how evangelism and hasn’t seemed again. He enjoys serving to engineering groups around the globe uncover new and artistic methods to work.

Associated Objects:

Is Kubernetes Actually Vital for Information Science?

Kubernetes Finest Practices: Blueprint for Constructing Profitable Functions on Kubernetes

Is Kubernetes Overhyped?