Data scientists must be able to build and run code in order to create models. The most popular programming languages among data scientists are open source tools that include or support pre-built statistical, machine learning and graphics capabilities. These languages include:
R: An open source programming language and environment for developing statistical computing and graphics, R is the most popular programming language among data scientists. R provides a broad variety of libraries and tools for cleansing and prepping data, creating visualizations, and training and evaluating machine learning and deep learning algorithms. It’s also widely used among data science scholars and researchers.
Python: Python is a general-purpose, object-oriented, high-level programming language that emphasizes code readability through its distinctive generous use of white space. Several Python libraries support data science tasks, including Numpy for handling large dimensional arrays, Pandas for data manipulation and analysis, and Matplotlib for building data visualizations.
For a deep dive into the differences between these approaches, check out "Python vs. R: What's the Difference?"
Data scientists need to be proficient in the use of big data processing platforms, such as Apache Spark and Apache Hadoop. They also need to be skilled with a wide range of data visualization tools, including the simple graphics tools included with business presentation and spreadsheet applications, built-for-purpose commercial visualization tools like Tableau and Microsoft PowerBI, and open source tools like D3.js (a JavaScript library for creating interactive data visualizations) and RAW Graphs.