DataScience & Machine Learning: Where to start with Python
In this blog, we will talk about how you can use Python for practical applications of Data Science and Machine Learning. If you are looking for basic machine learning concepts, I would suggest reading this previous introductory blog and then coming back here.
A lesson in history
Today the whole world is talking of AI and machine learning. But what we must remember is that despite these terms being used a lot more now than in the past, these things have existed for over 50 years.
So what exactly is the current hype about if we’ve had the math, the logic, the field of data science, concepts of artificial intelligence and machine learning, and the relevant programming languages like python for several decades?
Timing is everything!
At the inception of these concepts and technologies the problems we faced as a civilization were different. They required different attention and treatment. The internet as we know it didn’t exist, our entire economy wasn’t based on bits and bytes as it is today. Therefore, no one gave serious consideration to these topics. There were many questions about the required processing power to make it work and frankly it didn’t align with the needs of businesses at the time. What has championed the change is (1) the digitization of the modern global economy, (2) the amount of data that produced, (3) corresponding advancements in processing power thanks to GPUs and Moore’s law, and (4) the ease of access to this enormous infrastructure with worker nodes at low costs thanks to cloud computing.
So while Python is changing the data science landscape, it is really a broader coming together of economic evolution with the ease and power of computing.
Python is a very advanced programming platform since it not only helps you solve small mathematical problems, it also helps doing advanced data science arithmetic. Before we talk about advanced examples, here are some examples of simple mathematical problems — creating a mathematical series, computing the remainder of a number, joining together two text strings, or parsing a CSV file. Future articles on this topic will delve into details but here I would keep this a beginner level blog, and also throw some food for thought so you know what is in store for future blogs.
Python comes with a rich set of libraries for a wide variety of needs. Here is how you would install the common libraries you would need for data science and machine learning:
- pip install numpy
- pip install pandas
- pip install sklearn
- pip install tensorflow
- pip install matplotlib.pyplot
It would serve you to learn these libraries because once you install these python libraries then you will have the tools we need to do data science with Python.
Know your library and the problems it can solve
I would only briefly mention what these toolkits or libraries do. More exhaustive treatment is reserved for you to google, or wait for further articles.
numpy is a matrix or array manipulation toolkit and it can be used for random integer generation and array and matrix arithmetic which we often use in data science. We also deal with series a lot, as a mathematical sequence of numbers is a very fundamental aspect of data science logic.
One important subset of data science is time series data or visualizing say the power consumption of your home in the past one month. Did you use more electricity on weekends or weekdays? What was the minimum usage when people were not at home? Or you could talk about the goals scored by a certain football(soccer) player over the last one year in all international matches. In all such problems we use time as the X axis and Y axis changes based on what we plot. In our examples, it is kWh (unit of power usage) or number of goals.
Pandas is a very powerful library in Python for data manipulation. It helps you work with various date and time ranges and you can easily deal with data analysis problems like in the above example without much trouble.
matplotlib is a plotting library commonly used when working with Python and numpy. Pyplot is the most relevant module from it for our purposes and you can use it to visualize the above examples.
Sklearn aka scikit-learn is a machine learning project in Python which is finding a lot of applications in training a computer to act like humans. So concepts like unsupervised and supervised learning are key to using sklearn. Sklearn is used for traditional machine learning and is very successful for most scenarios. Like a machine to recommend movies to you based upon your watch and like history. They use sklearn.
You might have heard the term neural networks. It is a supervised learning technique that is an attempt to mimic a layer of human neurons, with the desire to make computers learn the way humans do. Though sklearn supports simple neural networks, most practical scenarios require deep learning which is neural networks organized in many layers to gain good results — Do you want your self-driving car to distinguish between a green light and red light? Yes, I think you do, I do as well! But for deep learning, instead of sklearn you will need another library called tensorflow introduced below. While we are talking of machine learning buzz words, reinforcement learning is another advanced technique becoming quite popular for which you will have to use tensorflow instead of sklearn. A reinforcement learning example — What if you want your computer to start playing tic-tac-toe from scratch and learn a strategy to beat you?
Tensorflow is a toolkit by Google to help deal with sophisticated machine learning problems using deep learning, learnings that require a fair degree of computation, large amounts of data and are somewhat complex in nature. Tensor is a multi dimensional vector, and flow signifies that the inputs flow through multiple decision points before we arrive at output vectors.
Tensorflow has major differences from sklearn, for now suffice it to say that tensorflow starts where sklearn stops. In python, tensorflow can be easily used along with other mathematical toolkits we saw above like numpy and pandas. It is also one of the most popular toolkits for machine learning, you will find that github has plenty projects built around tensorflow.
Python, Data Science and Me
My first exposure to Python was two decades back from my CS undergrad days. But in the following 15 years, I had stepped away from it and was programming in Java, C and C++ only. In the last 5 years I switched back to Python because I started working on Data Science and Machine Learning to implement User and Entity Behavior Analytics (UEBA) in my previous startup. And then last year, I founded Query.AI where we have implemented an NLP (natural language processing) interface so users can ask data questions in plain English and get their much desired answer back in a much more intuitive way. Python was again a great choice!
Hopefully you found this a good introduction on Python’s relevance and how you can start to use it for data science and machine learning. We talked about key libraries like sklearn and tensorflow, and various other helper libraries that can help solve data science and machine learning problems from your everyday life.
Python has very advanced programming toolkits like generators and iterators lambda etc, but future articles will focus on the landscape mostly in the intermediate level programmer — neither too basic nor too advanced.
Data science is the precursor to machine learning, so initially we will mostly concern ourselves with problems and techniques, without going into advanced coding. Our approach will be a modicum of math and logic. Nothing too fancy. Our future articles will talk with code samples for both sklearn and tensorflow.
Have questions, comments, or suggestions for other topics you’d like us to cover please leave them below, we’d love to hear from you!
Don’t forget to visit www.query.ai and subscribe to our updates.