Usamos cookies para melhorar sua experiência. Ao aceitar, poderemos medir visitas anonimamente.

Python + Google Colab Tutorial for Data Analysis

18 de setembro de 2023

Introduction

Using data analysis is very important for creating or improving more efficient public policies.
Here, we will talk about how numbers and data can be allies of public policy-making.

Public policies are basically government plans to make society better. They can be about health, education, finance, or even culture. Sometimes, we all help to think about them!

The idea is that these policies follow the rules written in the 1988 Constitution, which is like the manual of laws in Brazil. But how do we know what to do and where to invest public money? That's where data comes in.

Data works like clues that help us understand what is happening in society:

  • How much people earn,
  • Whether they have access to services like health and education,
  • And if opportunities are fairly distributed.

For example, the Brazilian Institute of Geography and Statistics (IBGE) collects information on everything, from how many people live in a city to how long it takes them to get to work.

Transparency is crucial here. Everyone should be able to access and understand data, as this ensures fairness. There are even laws, such as the Access to Information Act and the General Data Protection Law (LGPD), that guarantee access to information and protect personal data.


Tutorial: Simple Data Analysis with Python + Google Colab

We’ll perform a simple Data Analysis using Python, Pandas, Matplotlib, and Google Colab 🚀

1. Access Google Colab

  • Open: Google Colab
  • You need a Google account.
  • It will open a new page with a blank notebook.

2. Rename the notebook

  • At the top, rename the file to lesson1.ipynb.
  • You can save it on Google Drive, GitHub, or locally.

3. Project setup

  • On the left sidebar, click the folder icon to view project files.
  • You can upload datasets directly, or mount Google Drive.

To mount Google Drive, create a new code cell and run:

from google.colab import drive
drive.mount('/content/drive')

Authorize the access by following the link and pasting the generated token.
Once complete, you’ll see:

Mounted at /content/drive

Now you can upload datasets to your Google Drive and access them in Colab.

Example:

import pandas as pd

df = pd.read_csv('/content/drive/My Drive/Datasets/imdb-reviews-pt-br.csv')
df.head()

4. Download open data (INEP/ENEM)

We’ll use data from INEP (National Institute of Educational Studies):

INEP Open Data - ENEM

  • Download the .zip file and extract it.
  • Identify the datasets we want to analyze (we’ll start with microdata).

5. Import data with Pandas

import pandas as pd

# Example with CSV file
microdata = pd.read_csv("path-to-file.csv", sep=";", encoding="ISO-8859-1")
microdata.head()

You’ll see the first rows of the dataset displayed as a Pandas DataFrame (a table with rows and columns).


6. Exploring the dataset

Check the column names:

microdata.columns.values

Select a few relevant columns to analyze:

selected = microdata.filter(items=["NO_MUNICIPIO_PROVA", "TP_FAIXA_ETARIA", "TP_SEXO"])
selected.head()

7. Simple analysis examples

Count students per municipality:

selected["NO_MUNICIPIO_PROVA"].value_counts()

Count by age group:

selected["TP_FAIXA_ETARIA"].value_counts()

Count by gender:

selected["TP_SEXO"].value_counts()

8. Data visualization with Matplotlib

import matplotlib.pyplot as plt

# Age distribution
selected["TP_FAIXA_ETARIA"].hist(bins=30)
plt.title("Age Distribution of ENEM Students")
plt.xlabel("Age Group")
plt.ylabel("Count")
plt.show()
# Gender distribution
selected["TP_SEXO"].hist()
plt.title("Gender Distribution of ENEM Students")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()

⚠️ Note: ENEM only records binary gender (M/F). This limitation highlights the importance of public policies to include broader gender options.


Conclusion

This was a basic tutorial showing how to use open public data with Python and Google Colab.
We explored:

  • How to load data from Google Drive,
  • How to explore datasets with Pandas,
  • And how to visualize results with Matplotlib.

The use of data analysis is essential for effective public policies. It allows governments, NGOs, and communities to make informed decisions aligned with real needs, while also evaluating results after implementation.


References


✨ Keep exploring, ask new questions, and share your insights. Data has the power to transform society!

Comentários