Python for Data Science: A Beginner's Tutorial With Hands-On Examples | CodeGym

By the end of this tutorial you'll have loaded a real public dataset, cleaned the messy bits, asked a real question of it, and shipped a chart that answers the question. Total time: about 60-90 minutes from a blank screen, with no installation required if you use Google Colab. Pandas, NumPy, and matplotlib do all the heavy work, and we'll use the Palmer Penguins dataset because it's friendly, modern, and small enough to play with. For the bigger picture of where data science fits into your Python career path, see our complete Python beginner guide.

Key Takeaways

You can run the entire tutorial in Google Colab without installing anything; a Chromebook is enough.
Real data work is roughly 60% cleaning, 30% analysis, and 10% charts. Most online tutorials skip cleaning, so most beginners skip the part that matters most.
Pandas, NumPy, and matplotlib cover 90% of what an entry-level Python data analyst touches daily. Scikit-learn comes next, once you want predictions instead of descriptions.
Per the Stack Overflow Developer Survey 2024, Python is the most popular language for people learning to code and ranks first among admired technologies (Stack Overflow, 2024).
Senior data engineers and data scientists using Python earn 20-40% above the all-Python median in 2026; see our Python developer salary expectations spoke for the breakdown.

What Will You Build in This Tutorial?

A small but complete data analysis of the Palmer Penguins dataset, ending in a chart that answers a single sharp question: do the three penguin species in the dataset cluster by body mass, and if so, by how much? The work runs in five steps. Step 1 loads the data. Step 2 cleans the missing values. Step 3 asks the question with pandas. Step 4 draws the chart. Step 5 writes a three-sentence story about what you found. Each step takes roughly 5 to 20 minutes the first time and shrinks on later attempts. You can run every line below in Google Colab without installing a thing. If you prefer a local setup, install Python 3.12 plus pandas, numpy, and matplotlib through pip, then open Jupyter. The code is identical either way.

60-90

minutes total
blank screen to chart

steps from load
to story

libraries cover
most beginner work

install steps
if you use Colab

Why Python for Data Science in 60 Seconds?

Three reasons keep Python on top of data work in 2026. First, the core stack (pandas for tables, NumPy for arrays, matplotlib and seaborn for charts, scikit-learn for machine learning) is mature, free, and well-documented. Second, the same language scales from a notebook on your laptop to a production deployment, so analysts and engineers share a stack. Third, the interactive feedback loop in Jupyter notebooks rewards exploration: write a line, see the result, write the next line. R is still excellent for statistics-heavy work, and SQL remains essential for anything with a database. But if you only learn one language for data work, Python is the right one. Job-market data backs the choice. Per the Stack Overflow Developer Survey 2024, Python sits at the top of the "most desired technologies" list and is the most popular language among people learning to code. Senior Python roles in data engineering and machine learning routinely clear $200K total compensation in major US markets, with the salary breakdown covered in detail in our Python developer salary spoke. Laptop on a casual workspace showing statistics and charts on screen during a self-paced data analysis session

Laptop on a casual workspace showing statistics and charts on screen during a self-paced data analysis session

What Setup Do You Need?

Use Google Colab (Zero Install, Recommended)

Open colab.research.google.com, sign in with a Google account, and click "New notebook." That's the whole setup. Pandas, NumPy, and matplotlib are already installed in every Colab notebook. The free tier gives you a Linux container with about 12 GB of RAM and a couple of CPU cores, more than enough for this tutorial and most beginner analyses.

Use Local Jupyter (If You Prefer Working Offline)

Install Python 3.12 or later from python.org, then in a terminal run pip install jupyter pandas numpy matplotlib and start the server with jupyter notebook. If you bundle a separate virtual environment for the project, even better, but it's not required for this tutorial. To avoid the most common beginner setup snags, browse our 10 beginner Python mistakes first.

Step 1: How Do You Load Your First Dataset? (5 Minutes)

Open a new notebook and paste the following into the first cell:

import pandas as pd

url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv"
df = pd.read_csv(url)
df.head()

Run the cell. You should see a table with eight columns: species, island, bill length, bill depth, flipper length, body mass, sex, and year. df.head() shows the first five rows by default. The variable df is a convention for "data frame," which is pandas's name for a labeled, two-dimensional table. Now run two more inspection commands. They're the first-look trio that experienced analysts type without thinking:

df.info()       # column types and missing-value counts
df.describe()   # summary statistics for numeric columns

info() reports 344 rows total, plus the data type of each column and how many non-null values it has. describe() shows count, mean, standard deviation, min, max, and quartiles for the numeric columns. Notice the count line for some columns sits below 344. That's the data telling you there are missing values, and Step 2 deals with them.

Step 2: How Do You Clean Messy Data? (15 Minutes)

Real data is never tidy on arrival. The cleaning step is what separates a working analysis from a misleading one, and it's where most tutorials wave their hands. We'll cover the four moves you'll use every day.

Find Missing Values

df.isna().sum()

This prints the count of missing entries per column. In Palmer Penguins, sex has 11 missing values and four numeric columns are missing 2 each.

Decide: Drop or Fill

For 2-11 missing rows out of 344, dropping is fine because the loss is under 4% and the missingness pattern looks random. For a real production analysis you'd investigate why, but for a beginner tutorial we drop:

df = df.dropna()
print(len(df))    # 333 rows remain

When missingness is higher (say, 30%), dropping rows starves your analysis. In that case you'd fill with the column mean for numbers (df["x"].fillna(df["x"].mean())) or with the most frequent category for strings. The choice between drop and fill is a judgment call that depends on how missing the data is and why.

Normalize Column Names

Column names with spaces, hyphens, or mixed case make you write quoted strings for every reference. Convert them to lowercase snake_case once and you save typing for the rest of the analysis:

df.columns = df.columns.str.lower().str.replace(" ", "_")
df.columns
# Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
#        'flipper_length_mm', 'body_mass_g', 'sex', 'year'])

Convert Column Types Where Needed

The penguins data is already typed correctly. In real datasets you'll often need pd.to_datetime(df["date_col"]) for strings that should be dates, or df["count"].astype(int) for numbers that came in as floats. Get into the habit of running df.info() after every type change to confirm the conversion took.

Step 3: How Do You Ask Questions of Your Data? (20 Minutes)

This is the analysis step. The original question was: do the three penguin species cluster by body mass? Let's answer it.

Group and Aggregate

by_species = df.groupby("species")["body_mass_g"].mean().round(0)
print(by_species)

Output:

species
Adelie       3706.0
Chinstrap    3733.0
Gentoo       5092.0
Name: body_mass_g, dtype: float64

The answer is already visible: Adelie and Chinstrap penguins weigh nearly the same on average, around 3,700 grams, while Gentoos are noticeably heavier at just over 5,000 grams. That's a real finding, not an artifact.

Filter Rows With Boolean Masks

Boolean masks let you slice a subset of rows that match a condition. The syntax reads as "rows where this column equals that value":

gentoos = df[df["species"] == "Gentoo"]
print(gentoos["body_mass_g"].describe())

Combine masks with & (and) and | (or), keeping parentheses around each clause. For example, heavy Gentoos: df[(df["species"] == "Gentoo") & (df["body_mass_g"] > 5500)].

Sort to Find Top-N

heaviest = df.nlargest(5, "body_mass_g")
print(heaviest[["species", "island", "body_mass_g"]])

nlargest(n, col) returns the top-N rows ranked by one column. There's also nsmallest. Both are faster and cleaner than sort_values + head when you only need the extremes.

Quick Correlation Check

numeric_cols = ["bill_length_mm", "bill_depth_mm",
                "flipper_length_mm", "body_mass_g"]
df[numeric_cols].corr().round(2)

You'll see flipper length and body mass correlate at about 0.87, which is very strong. That makes intuitive sense (larger penguins have longer flippers), and it'll show up clearly in the scatter plot in Step 4.

Step 4: How Do You Visualize What You Found? (15 Minutes)

Matplotlib draws the chart; pandas wraps it for you. Three plots cover most beginner needs.

Bar Chart: Mean Body Mass per Species

import matplotlib.pyplot as plt

ax = by_species.plot(kind="bar", color=["#2196F3", "#FF9800", "#4CAF50"])
ax.set_ylabel("Mean body mass (g)")
ax.set_title("Penguin body mass by species (Palmer Penguins)")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Source: Palmer Penguins dataset (Horst, Hill, Gorman), aggregated in pandas for this tutorial.

Histogram: Distribution of Body Mass

df["body_mass_g"].hist(bins=30, edgecolor="white")
plt.xlabel("Body mass (g)")
plt.ylabel("Count")
plt.title("Body mass distribution across all penguins")
plt.tight_layout()
plt.show()

The histogram should show a bimodal distribution: a tall cluster around 3,500-4,000 grams (Adelies and Chinstraps stacked) and a second cluster around 5,000 grams (Gentoos).

Scatter Plot: Flipper Length vs Body Mass

colors = {"Adelie": "#2196F3", "Chinstrap": "#FF9800", "Gentoo": "#4CAF50"}
for species, group in df.groupby("species"):
    plt.scatter(group["flipper_length_mm"], group["body_mass_g"],
                label=species, color=colors[species], alpha=0.7)
plt.xlabel("Flipper length (mm)")
plt.ylabel("Body mass (g)")
plt.title("Flipper length vs body mass, colored by species")
plt.legend()
plt.tight_layout()
plt.show()

The scatter plot makes the 0.87 correlation visible. Each point is a penguin; the upward slope shows the relationship. Gentoos cluster in the top-right (long flippers, heavy body), while Adelies and Chinstraps overlap in the bottom-left.

Step 5: How Do You Turn Numbers Into a Story? (10 Minutes)

The technical work is done. The remaining skill, and the one recruiters actually test for, is summarizing what you found in three plain sentences. Open a markdown cell in your notebook and write:

Finding 1. Penguin body mass clusters by species. Gentoo penguins are noticeably heavier (mean 5,092 g) than Adelie (3,706 g) and Chinstrap (3,733 g) penguins, while the two smaller species overlap in body mass.
Finding 2. Flipper length is a strong predictor of body mass (r = 0.87). A penguin with longer flippers is reliably heavier, regardless of species.
Finding 3. 11 of 344 rows had missing values for sex; 2 rows had missing values in four numeric columns. Dropping these reduced the dataset by 3.2%, which is small enough that the conclusions above should hold.

Notice what the story does: it answers the original question, quantifies the effect, names the strongest predictor, and flags a limitation. That's the structure every data analyst uses in real reports. The pandas syntax was a means to this end. Master the structure and the syntax becomes muscle memory in a couple of months.

What's Next After This Tutorial?

The single biggest next step is to repeat the same 5-step workflow on three more datasets. Variety teaches you what stays the same (the workflow) and what changes (the cleaning calls). Good free options to repeat the exercise:

NYC taxi trips (cleaned subset on Kaggle): datetime parsing practice.
Iris flower measurements: the smallest, classical dataset.
Titanic passenger list: categorical cleaning practice.
Any CSV you already work with from your job or hobby.

Once three more analyses feel routine, learn scikit-learn for prediction tasks (classification, regression), which is the bridge from analyst to junior data scientist. To plan the full path, see our Python learning roadmap. If you want practice projects that combine pandas with other skills, our Python data science practice projects from the projects spoke covers four hands-on starting points.

Which Pandas Pitfalls Trip Up Beginners?

Three pandas-specific traps catch nearly every beginner in the first month. None require advanced concepts to fix. SettingWithCopyWarning. When you do something like filtered = df[df["x"] > 5]; filtered["y"] = 10, pandas warns that the assignment might modify a copy rather than the original. The fix is to make a real copy upfront with filtered = df[df["x"] > 5].copy(), then assign freely. Treat the warning as the signal to add .copy(), not as something to ignore. Iterating row by row instead of vectorizing. Writing for index, row in df.iterrows(): works, but it's 50 to 200 times slower than the vectorized equivalent. Reach for df.apply(), df[col].map(), or plain arithmetic on whole columns first. Loop only when nothing else fits, which is rarer than newcomers expect. Forgetting to reset the index after groupby. The result of df.groupby("col").agg(...) uses the grouped column as the index, which sometimes surprises you when you try to merge or reshape later. Calling .reset_index() at the end converts it back to a regular column. The full beginner-pitfall list lives in our 10 common Python mistakes spoke.

Practice Data Science With Auto-Graded Tasks

CodeGym's Python track covers Data Processing as a dedicated module, with hands-on pandas and NumPy tasks graded by an AI validator in seconds. The AI mentor explains where your code went wrong when stuck. 800+ practical tasks across 62 levels in total, first level free; full plan on the pricing page. Learn Python the structured way →

Frequently Asked Questions

Do I need to learn statistics before pandas?

No. Start with pandas, then pick up statistics as specific questions arise. Mean, median, and correlation are enough math for the first three months. Statistical inference and hypothesis testing matter later, when you start making predictions, but they would slow you down today if you tried to learn them first.

Is Python better than R for data science in 2026?

For most jobs, yes. Python wins on machine learning, deployment, and engineering integration. R remains strong for academic statistics, biostatistics, and exploratory analysis with ggplot2. Per the Stack Overflow Developer Survey 2024, Python ranks above R in both popularity and admired-language scores, and the gap continues to widen in industry hiring.

How long until I can call myself a data analyst?

Three to six months of focused practice for an entry-level analyst portfolio: 8-12 finished analyses with clear narratives, plus comfort with SQL alongside pandas. The technical floor is reachable in 90 days; the storytelling layer takes longer and is what actually wins interviews.

Can I do this tutorial on a Chromebook?

Yes, in Google Colab. No installation needed; the notebook runs entirely in your browser with Python, pandas, and matplotlib pre-installed. A Chromebook handles the full tutorial without any local Python setup, which is one of the reasons we recommend Colab as the starting point.

The Bottom Line: Five Steps, One Dataset, One Real Skill

Five steps (load, clean, ask, visualize, narrate) cover roughly 80% of an entry-level Python data analyst's daily workflow. Repeat the loop on three more datasets and the syntax becomes background noise; what's left is the analytical thinking that recruiters actually pay for. For the full Python career arc from first print to senior data engineer, start with our complete Python beginner guide, and pair it with the salary expectations for what each layer of skill is worth.

Sources

pandas, User Guide: Getting Started, retrieved 2026-05-18, https://pandas.pydata.org/docs/user_guide/index.html
matplotlib, Pyplot Tutorial, retrieved 2026-05-18, https://matplotlib.org/stable/tutorials/pyplot.html
Allison Horst, Alison Hill, and Kristen Gorman, Palmer Penguins dataset (GitHub), retrieved 2026-05-18, https://github.com/allisonhorst/palmerpenguins
Google Research, Google Colab: Welcome notebook, retrieved 2026-05-18, https://colab.research.google.com/
Project Jupyter, Jupyter Notebook documentation, retrieved 2026-05-18, https://docs.jupyter.org/en/latest/
Stack Overflow, 2024 Developer Survey: Technology, retrieved 2026-05-18, https://survey.stackoverflow.co/2024/technology
Python Software Foundation, The Python Tutorial, retrieved 2026-05-18, https://docs.python.org/3/tutorial/index.html
CodeGym, Python Track Data Processing Module Outcomes, 2024-2026, retrieved 2026-05-18, internal platform data, https://codegym.cc/courses/python

Learn more about our mission and terms of service. Published article was last reviewed on 2026-05-18.