Big Data: The qualities of a good data juggler
Posted: Wed Jan 08, 2025 4:24 am
Senior Data Scientist and co-director of the Master of Data Science at KSchool
A data scientist who wants to be autonomous in handling Big Data must have at least a part of a Data Engineer. In other words: it is not enough to know or learn statistical and machine learning techniques; you must have a bit of a system administrator. Knowing how to handle Linux, basic elements of cryptography and connecting to sri lanka phone data remote machines is the foundation of both the cloud and on-premises Big Data systems.
With this foundation, the data scientist can face the more specific knowledge that she must gather. The key is to understand how distributed computing integrates the work of multiple unreliable machines into a harmonious whole that knows how to recover from occasional failures and the compromises that this implies, as illustrated by the CAP theorem, for example.
The MapReduce computing paradigm is another concept/technique that illustrates how the move to distributed systems requires a change of mental framework. At KSchool we prepare our students for this by teaching them to handle both Amazon Web Services and Google Cloud Platform to execute Spark tasks, the reference Big Data tool.
A data scientist who wants to be autonomous in handling Big Data must have at least a part of a Data Engineer. In other words: it is not enough to know or learn statistical and machine learning techniques; you must have a bit of a system administrator. Knowing how to handle Linux, basic elements of cryptography and connecting to sri lanka phone data remote machines is the foundation of both the cloud and on-premises Big Data systems.
With this foundation, the data scientist can face the more specific knowledge that she must gather. The key is to understand how distributed computing integrates the work of multiple unreliable machines into a harmonious whole that knows how to recover from occasional failures and the compromises that this implies, as illustrated by the CAP theorem, for example.
The MapReduce computing paradigm is another concept/technique that illustrates how the move to distributed systems requires a change of mental framework. At KSchool we prepare our students for this by teaching them to handle both Amazon Web Services and Google Cloud Platform to execute Spark tasks, the reference Big Data tool.