Data analysis in Python: a step by step approach

Python is a very flexible and general purpose language that during the years has gained more and more credit in the data analysis community. Unlike other languages, such as R, Scala, Matlab or Julia, Python has not been conceived to perform data analysis and in general scientific and numerical tasks, but this may be considered an advantage, because with Python you can do just…anything. 

Stats show that in 2020 around 66% of data scientists are using Python on a daily basis and 84% use it as their main language. It is also worth noting that around Python a huge and very active community has developed so if you have a problem or want to collaborate, it’s quite simple to find someone to work with. But how do you perform data analysis in Python? Is there something specific (apart from Python obviously) that you should master? Let’s see it step by step in this quick guide.

The basics first: if you don’t know any Python and/or any data science start from here 

Of course, if you don’t know any Python, but you do know how to programme, you should dedicate some time to learn the basics of the language. Python is quite an easy language to pick up, it doesn’t have a complicated syntax and if you have some coding background you can learn it very quickly. 

Being a widely used language there are plenty of tutorials, exercises, books (even free ebooks), videos, that you can use to learn what you need. Bear in mind that, to do data science using Python, you don’t need to be a Python pro: unless you need it for other purposes you won’t need to go really deep into its meanders. The following are some basic courses and resources to learn all the Python you need:

 

Of course, you certainly need to build up your competences in data science, because otherwise it would be like having a tool and not knowing what to do with it. So you’ll have to develop some statistics and data visualisation skills, and gather a certain amount of knowledge on the domain you are going to mine and analyse.

 If you need a primer in statistics and data analysis (not related to any programming language) try with this course on Probability Theory, Statistics and Exploratory Data Analysis by the HSE University.

 Python libraries: the essential ones

You should think of libraries as a set of tools ready to use that someone else developed to make certain coding tasks easier. So instead of having the burden to build a function that performs a certain operation, you can simply go to a library and just use an already made function. The wonderful thing about Python is that since it is so diffused and so widespread into the data analysis community there are really powerful dedicated libraries that you can use for your data analysis problems. Furthermore, there is a lot of documentation for each library. The main libraries for data science are: 

–       NUMPY

Numpy stands for “numerical python”. It offers pre-compiled functions for numerical routines.

–       PANDAS

This is perfect for data analysis, manipulation and visualisation. It allows high-level data structures and some tools to manipulate them.

–       MATPLOTLIB

Excellent for data visualisation. It can export graphics and other images to vector formats.

–       SCIPY

Scipy is for algebra, statistics, linear algebra

–       SEABORN

Is focused on data analysis and works well with both Numpy and Pandas. 

The main libraries that you can use for data science are pre installed into the Jupiter Notebook, a really useful tool that you could also use for collaboration since it is a web application. You can use it to create (and share) documents that contain text, code, its documentation, equations and graphics. So learning how to use the Jupiter Notebook may be a smart move.

Now you need to practice a little on real datasets. Fortunately available on the internet there are various repositories (like Kaggle or Dataquest) where you can find and freely download datasets and learn how to manipulate data. 

Useful courses and other resources

After you’ve learnt the basics you can dedicate some time to a course specifically dedicated to using Python for data science or you can read some useful books and other tutorials on the topic. You can find many excellent courses on the internet (on Coursera or Udemy for example) but if you really want to give a boost to your career the best option is to follow a real master, that grants you also some follow up after the effective course is finished. 

Talent Garden for example offers a Data Science and AI Master that, as the name suggests, doesn’t stop at learning python for data analysis but goes further, to AI and Machine Learning technologies. It also offers help in developing a portfolio, assessing your skills against the demands of the labour market and even writing your CV and cover letter.

While if you want to study data analysis with Python autonomously, the internet is really full of resources. You can start from the excellent Python Data Science Handbook which is thorough and complete and is available for free.

Sign up to our newsletter

Stay up to date with all the latest news