IPython and Jupyter aren’t IDEs—and that’s exactly why I use them for data science


Lots of people will use an IDE like VS Code or a regular editor like Vim, but for my work in data science and statistics, I need something different. Here’s why I use IPython and Jupyter notebooks for exploring datasets.

Exploratory programming

IPython and Jupyter let you explore data

Jupyter notebook showing a line plot of the usage share of several operating systems.

IPython and Jupyter offer something different than the standard scripting or IDE workflow. They’re interactive programming tools. You can type in some code and see instantly what will happen. You don’t have to write a script or program and run it.

This opens the door to a new style of development. Instead of having a goal in mind, you can try different approaches. One reason that statistics and data science have taken up notebooks is that these disciplines are suited to the exploratory style that IPython and Jupyter encourage. If you’re examining a dataset, you’ll probably not have an idea of what it contains. Once you can summarize it and graph it, what you can do with it becomes much clearer.

IPython

A better interactive Python interpreter

Creating an array of random numbers and taking the mean with NumPy in an iPython session in the Linux terminal.

While the standard Python interpreter is helpful for testing out ideas and learning the language, if you try to make heavy use of it, you run into its limitations. One big thing missing from the standard Python interpreter is tab completion. It’s also difficult to re-run code you’ve already run.

IPython is a big help. It includes tab completion. You just hit the Tab key and IPython will fill in things like functions or variable names. You can also zip back and forth in your history, and search your previously typed commands. It works the same way as modern Linux shells do, using the GNU Readline library.

Another handy feature is built-in “magic” commands. These are prefaced with a percent sign (%). One useful magic command is the “timeit” command.

I’ll demonstrate by first generating a 10 x 3 array in NumPy, which will be X, and a randomly generated array of 10 numbers, which I’ll designate y.

import numpy as np

rng = np.random.default_rng()

X = rng.random((10,3))

y = rng.random(10)

Then I’ll compute the least-squares solution and time it:

%timeit np.linalg.lstsq(X,y)

The time it took to execute will be returned:

Timeing the results of a least-squares computation in IPython using the %timeit magic command.

In this case, it took about 12 microseconds, which is pretty fast, though this is a small matrix. A bigger one might take longer:

X = rng.random((500,3))
y = rng.random(500)

%timeit np.linalg.lstsq(X,y)

The result is approximately 22 microseconds. This is still fast for a large linear system.

Jupyter

Mix text, code, and graphics

While IPython is a great interactive terminal program for Python, Jupyter takes it to the next level. Jupyter is an interactive notebook program. Jupyter lets you mix code, text, and inline plots. It’s a form of “literate programming,” a term coined by legendary computer scientist Donald Knuth.

Jupyter notebooks are similar to the notebook interfaces in programs like Mathematica. A notebook is built out of cells that can contain code or Markdown text. While Jupyter was originally an offshoot of IPython, it also allows you to use other programming languages like R, Julia, or Scala.

Here’s a screencast by Rob Mulla demonstrating how to create a Jupyter notebook:

The best feature of a Jupyter notebook is its persistence. I can explore a dataset with Python, and when I come back to it, I can remember what I did.

When would I use either

Exploration vs. persistence

There are some clear use cases for both IPython and Jupyter. For quick experimentation, I’ll turn to IPython. I’ll often leave it running in a background terminal. I’ve created a Pixi environment with NumPy, SymPy, and other mathematical Python libraries to give me the ultimate desk calculator.

While IPython is handy for quick calculations and throwaway computations that I’m likely not going to need to refer to later, Jupyter is useful for data exploration that I’m going to want to come back to or share with others. I’ve already uploaded a handful of my own statistical explorations in Jupyter notebooks to my GitHub account.;

Vim is still my editor of choice for regular scripts and tweaking configuration files.

A typical stats workflow

Putting it all together

I’ll demonstrate briefly by opening up a Jupyter notebook. I’ve already installed Jupyter in a directory called “stats.” An environment in Pixi is simply a directory. I want to demonstrate this in a reproducible environment. I’ve uploaded my notebook to my GitHub repository.

I’ll start up the Jupyter server:

jupyter notebook

I’ll create a new notebook. I’m going to examine a predefined data set of data taken by a waiter in a New York City restaurant, who recorded the total bill, the tip, the number of diners in the party, and whether any of them were smokers.

I’ll create a cell that includes the libraries I want to use:

import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme()
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline

This imports NumPy, pandas, Seaborn, the stats submodule from SciPy, statsmodels, and its formula API, and tells matplotlib to insert any plots into the Jupyter notebook instead of opening them in a separate window.

With the libraries imported, I can load the tips database from Seaborn, which has some built-in datasets mainly for plotting. This will be stored as a pandas DataFrame.

tips = sns.load_dataset('tips')

I’ll look at the top of the dataset.

tips.head()
Printing the first few lines of the Seaborn tips dataset in a Jupyter notebook.

Then I’ll take descriptive stats of the numerical columns. This includes the mean, the standard deviation, the minimum, the lower quartile (25th percentile), the middle value or median, and the upper quartile (75th percentile).

tips.describe()
Descriptive statistics using pandas of a restaurant tips dataset in a Jupyter notebook.

Next, I want to look at the distribution of the total bill:

sns.displot(x='total_bill',data=tips)
Histogram of total bill in a Jupyternotebook.

And the tips:

sns.displot(x='total_bill',data=tips)
Histogram of restaurant tips plotted in a Jupyter notebook.

Now I’ll look at the scatterplot of tip vs bill:

sns.relplot(x='total_bill',y='tip',data=tips)
Jupyter tip vs. bill regression line over a scatterplot in a Jupyter notebook.

I can plot a regression line over the scatterplot:

sns.regplot(x='total_bill',y='tip',data=tips)
Jupyter tip vs. bill regression line over a scatterplot in a Jupyter notebook.

There seems to be a positive linear relationship, since the line slopes upward. I’ll need to use statsmodels to get the values to plug into the classic equation y = mx + b using a formula notation popularized by R.

results = smf.ols('tip ~ total_bill',data=tips).fit()
results.summary()
Stress vs. screen time regression results in a Jupyter notebook.

The values for the y-intercept and the slope (m), in this case, the total bill, are listed in the left-most column in the table.


Building a program bottom-up

The best thing about interactive programming is that you can start with nothing and end up with a complete analysis. It’s a kind of bottom-up programming where you build a program through exploration, and then you can share your results with others.



Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Serials have become the backbone of the streaming era, especially on Netflix. Serialized television is when a show’s plot unfolds in sequential order over the course of a season. It’s long-form storytelling that typically works best with dramas—Stranger Things, The Crown, etc. Watching the episodes in release order matters. Often, these shows are binged because the complex character arcs and cliffhangers encourage streaming multiple episodes at once.

Serial shows can feel like homework, especially when you fall behind on an episode and need to catch up. That always happens to me, and it leads to anxiety I didn’t want. Thankfully, Netflix offers shows where viewers can jump at any time and not feel lost. These episodic series are perfect for jumping around and picking the episodes you want to watch. One of the most famous comedies ever fits the criteria of an episodic sitcom. Anthology shows, including a Netflix sci-fi classic, are also ideal for watching episodes out of order.

Black Mirror

Welcome to your worst nightmare

Black Mirror wants to scare you. Charlie Brooker’s sci-fi anthology series has been warning humanity about the dangers of technology since 2011. It seems like ages ago that Rory Kinnear had sexual intercourse with a pig in the first episode. Apologies for the spoiler, but the media’s role in the spread of misinformation has never been more relevant.

Black Mirror features self-contained episodes with a beginning, middle, and an end. There has only been one direct sequel: USS Callister: Into Infinity, a season 7 episode that continues the events of season 4’s USS Callister. Otherwise, feel free to jump around and check out the best episodes of each season. Since most episodes feature bleak endings, I’ll leave you with one that ends on an upbeat note: San Junipero.

Seinfeld

Greatest comedy ever?

Comedies are the perfect vehicle for episodic storytelling. While having an overarching plot throughout a season helps attract viewers, many comedy fans are just looking for a few laughs. Write a self-contained story with numerous jokes over 20 to 30 minutes, and you’re ready to go. Seinfeld, aka the show about nothing, is the ideal escape from serialized dramas.

Seinfeld stars Jerry Seinfeld as a fictionalized version of himself as he navigates the comedic scene in New York City. The show revolves around Jerry’s interactions with his friends George (Jason Alexander), Elaine (Julia Louis-Dreyfus), and Kramer (Michael Richards). The gang faces a problem, hilarity ensues, and the episode ends. That’s really all you need to know. Enjoy the laughs.

Guillermo del Toro’s Cabinet of Curiosities

The genre maestro curates new horror stories

There’s a reason why Guillermo del Toro is considered the “King of the Monsters.” The genre expert is as elite as it comes when dealing with mythology and creating new worlds. The Oscar winner relied on his horror expertise in the anthology series Guillermo del Toro’s Cabinet of Curiosities.

I hate referring to episodes of television as “mini-movies.” However, that’s how I would describe the eight episodes of Cabinet of Curiosities. Each director puts their own signature style on a story and brings audiences into their terrifying creation. Del Toro wrote two of the episodes, including one about a demon being summoned. Some are scarier than others, but horror fans will feel right at home with this series. ​​​​​​​

Beat Bobby Flay

Bobby brings the heat

As I’ve gotten older, the Food Network has become one of my favorite channels. I mean, who doesn’t love food? I love eating my (average) home-cooked meal while watching contestants duke it out in the kitchen on my favorite show, Beat Bobby Flay. The competition breaks down into two rounds. In the first round, two chefs have 20 minutes to construct a meal using a secret ingredient. The winner advances to the main event, where they face off against Bobby Flay.

The challenger gets to pick the dish for the final round, so Bobby has a disadvantage. However, Bobby is an award-winning chef with a few tricks up his sleeves. He can handle making a version of your grandmother’s lasagna. With episodes available on Netflix, be prepared to learn why Bobby always throws chiles into his dishes.​​​​​​​

S.W.A.T.

Broadcast TV still knows how to make entertaining programs

The procedural is a genre best produced on broadcast television. Name a cop, doctor, or law drama—chances are it’s a procedural on broadcast TV. While the way we watch television has changed, people still love these types of shows on CBS, NBC, Fox, and ABC. Law & Order, NCIS, and Criminal Minds are procedurals that gained a bigger following thanks to streaming.

S.W.A.T. is cut from the same cloth as Chicago P.D. and CSI. Sergeant Daniel “Hondo” Harrelson (Shemar Moore) is tasked with leading a new S.W.A.T. unit in the LAPD. This action-packed show utilizes a “case of the week” formula in which the team must solve a dangerous situation, such as active shooters and hostage situations. You’re in and out in 44 minutes. What’s better than that?​​​​​​​


Netflix has more content coming your way

After you’re done watching these shows, stay on Netflix for more top-notch content. Netflix has an entire section dedicated to thrillers, and this week, The Guilty and El Camino are two of the section’s best. Keep an eye out for new movies, like Alan Ritchson’s War Machine, which is currently in the streamer’s top 10.

Subscription with ads

Yes, $8/month

Simultaneous streams

Two or four




Source link