You’re using Excel wrong if you’re still manually cleaning data—Python does it for you in seconds


If you’ve got a messy Excel spreadsheet with invalid values, blank entries, duplicates, or other problems, you might think you have to spend hours cleaning it up. You can use Python to automate these steps instead. Here’s how.

Setting up a Python environment

Installing the necessary packages

Activate the Mamba stats environment and starting up IPython in the Linux terminal.

If you don’t already have a Python environment set up, it’s easy to do. If you’re on Windows, I recommend using Windows Subsystem for Linux, or WSL. Python tutorials tend to assume a Linux environment or at least a Unix-like environment, and it’ll be easier to follow along with other tutorials since you won’t have to translate things like pathnames.

If you don’t have WSL installed, you’ll have to do it.

You’ll also want another kind of environment on top of your system. While Python is included in many systems, including almost all major Linux distros, it’s meant more for running scripts and other included software rather than for your own programs. Depending on how quickly an OS updates its software, your version of Python might be older.

You’ll also need to install some libraries on top of Python.

There are several environments that let you install Python packages, but my favorite happens to be Pixi.

In a Linux, macOS, or WSL terminal window, type:

Pixi official website.

curl -fsSL https://pixi.sh/install.sh | sh

This will install Pixi on your machine.

With Pixi installed, you can either create an environment for your packages or you can install them globally. This will probably be the better option for this project, since you’ll have these tools at your fingertips.

The primary package we’ll use is pandas, but we’ll also want some others for this project. NumPy is the foundation for numerical computing with Python. We’ll also want to install Jupyter notebooks. This will give us a graphical way to execute Python code while allowing you to examine it later. IPython is similar, but it works from the terminal.

Let’s install these tools into a global environments

pixi global install numpy pandas jupyter ipython

Getting the data into Python

With our environment installed, we can now start cleaning our messy spreadsheet.

For this example, I’m going to be using a modified version of a dataset I found on Kaggle that’s specifically for learning to clean messy data. It’s got lots of problems, such as missing data, and a lot of “ERROR” or inconsistent terms. It was originally a .csv file, but I saved it out as an Excel file using LibreOffice to demonstrate how pandas can handle Excel files.

I’ll launch Jupyter:

jupyter notebook

This will open up a web browser, or it would if I wasn’t using WSL on Windows. I have to modify the command line to get rid of the error message

jupyter notebook –no-browser

I’ll open one of the links in my web browser.

I usually use a shell alias to bypass all of this.

Next, I’ll create a new notebook and use Python as the kernel.

I like to put a header and an explanatory note as Markdown cells.

In the first code cell, I’ll import the libraries I’m going to use

import numpy as np
import pandas as pd

Now I’ll read in the messy dataset

cafe = pd.read_excel('/path/to/messy_data.xlsx')

I can examine the dataset so you can get a good idea of the problems:

cafe.head()

Dropping missing values

You won’t miss them

The easiest thing to do to clean up a dataset is to remove any missing values. Pandas DataFrames have a built-in method to do that, dropna. Setting the data variable to this method will edit the DataFrame in place:

cafe = cafe.dropna()
Removing blank entries in the cafe dataset with Python.

Drop duplicate entries

Don’t double up on data

Now we will want to drop any duplicate entries. Pandas has a built-in method called drop_duplicates that works similarly to dropna:

cafe = cafe.drop_duplicates()
Dropping duplicated in a pandas DataFrame.

This will again edit our DataFrame in place to get rid of duplicate entries.

Modifying columns

Get rid of invalid values

We still have some problems. A lot of entries in these columns have something like “ERROR” or “UNKNOWN.” We probably won’t want them.

It’s easy to get rid of them. We’ll first create an array of the columns we want to filter:

columns = ['Item','Quantity','Price Per Unit','Total Spent','Payment Method', 'Location', 'Transaction Date']

Then we’ll create a for loop that will go through the columns and remove the fields that contain “ERROR” or “UNKNOWN”:

for i in columns:
    cafe = cafe[cafe[i] != "ERROR"]
    cafe = cafe[cafe[i] != "UNKNOWN"]
Filtering cafe data in Jupyter.

Statements in Python loops must be indented, and Python counts four spaces as indents.

What that loop does is select every value in the column that doesn’t equal “ERROR” or “UNKNOWN” and saves it in place.

Be sure to examine your DataFrame with the head or tail methods to see the effect of your change. If it did something you didn’t want, you can reload in the original data and try again.

This saved a lot of time you would have had to go through all of these in Excel, even running a search and replace operation.

You can run the data.head() command again to see the results.

Putting the data back into the Excel spreadsheet

It’s easy to convert a pandas DataFrame back into Excel

With the data now cleaned, you can save it back to an Excel spreadsheet. You can use the DataFrame’s to_excel method

data.to_excel('/path/to/cleaned_data.xlsx')

It’s easy to clean Excel data in Python

A little time spent learning Python and a bit of pandas can save hours that could be wasted individually editing spreadsheets. Python and spreadsheets like Excel make a good pair: Excel for editing and formatting data, and Python for cleaning and extracting more powerful insights.

OS

Windows, macOS, iPhone, iPad, Android

Brand

Microsoft

Price

$100/year

Developer(s)

Microsoft

Free trial

1 month

Microsoft 365 includes access to Office apps like Word, Excel, and PowerPoint on up to five devices, 1 TB of OneDrive storage, and more.




Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Whoop MG on arm

The Whoop is one of the devices that Google’s rumored screenless health tracker would compete with.

Nina Raemont/ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways 

  • Google is poised to unveil a Whoop dupe soon. 
  • Steph Curry teased a screenless health band on his Instagram. 
  • Here’s what I’d like to see from a Google fitness band. 

Could Google’s latest fitness tracker return to its original, screenless Fitbit form? All signs say yes. Google has teased a screenless, Whoop-adjacent health tracker with the help of basketball star Steph Curry. A recent Instagram post from Curry shows him wearing a screenless, fabric band around his wrist, and the accompanying caption promotes “a new relationship with your health.” 

There are scant confirmed details on this next device, but rumors suggest the band will be called “Fitbit Air.” 

Also: I replaced my Whoop with a rival fitness band that has no monthly fees – and it’s nearly as good

Why a screenless fitness band? And why now? Google’s new device could be taking interest away from popular fitness brand Whoop. Whoop’s fitness band is on the more luxurious end of the health wearables spectrum. The company offers three subscription tiers, starting at $199, $239, and $359 annually. Google’s device, on the other hand, is rumored to be more affordable with the option to upgrade to Fitbit Premium. 

Google has the opportunity to make an accessibly priced fitness band with the rumored Fitbit Air and breathe new life into its older Fitbit product lineup, which hasn’t been updated in years. 

What I’m expecting 

Here’s what I expect to see and what I hope Google prioritizes in this new health tracker.

Given Fitbit’s bare-bones approach to fitness tracking, I assume Google will emphasize an affordable, accessible fitness band with the Fitbit Air. Most Fitbit products cost between $130 and $230, so I’m expecting this band to be on the lower end of that price range. I’d also expect Fitbit to give users a free trial of Fitbit Premium. 

Also: T-Mobile is practically giving away the Apple Watch Series 11 – here’s how to get one

A long, long, long battery life 

A smartwatch with a bright screen and integrations with an accompanying smartphone consumes a lot of power. That’s why some of the best smartwatches on the market have a middling battery life of one to two days, tops. 

A fitness band, on the other hand, is screenless. That makes the battery potential on this Fitbit Air double — or even triple — that of Google’s smartwatches.

Also: I use this 30-second routine to fix sluggish Samsung smartwatches – and it works every time

The Fitbit Inspire 3 has around 10 days of battery life — with a watch display. I hope the screenless Fitbit Air has at least 10 days of battery life, plus some change. Two weeks of battery life would be splendid. 

In addition to usage time, I also hope that a screenless fitness tracker addresses some of the issues Fitbit Inspire users have complained about. Many Inspire users report that the device’s screen died after a year of use. They could still access data through the app, but the screen was dysfunctional. Despite being a more affordable Google health tracker, the Fitbit Air should last users for a few years without any hardware issues — or at least I hope it does. 

Fitbit’s classically accurate heart rate measurements 

As Google’s Performance Advisor and the athlete teasing Google’s next device, Steph Curry is sending the message that this new device, one that offers wearers “a new relationship with your health,” will be built for athletes and exercise enthusiasts. I hope this device homes in on accurate heart rate measurements and advanced sensing, as other Fitbit devices do. 

Also: I walked 3,000 steps with my Apple Watch, Google Pixel, and Oura Ring – this tracker was most accurate

Like Whoop, I hope the insights the Fitbit Air provides are performance- and recovery-driven. Whoop grew in popularity for exactly this reason. Not only do Whoop users get their sleep and recovery score, but they also see, through graphs and health data illustrations, how their daily exercise exertion, strain, and sleep interact with and inform each other. 

I’m assuming that Fitbit Premium, with its AI-powered health coach and revamped app design, may do a lot of the heavy lifting for sleep and recovery insights with this new product. 

Also: Are AI health coach subscriptions a scam? My verdict after testing Fitbit’s for a month

But I also hope Google adds a few features on the app’s home screen that specifically target athletic strain and recovery, beyond the steps, sleep, readiness, and weekly exercise percentage already available on the Fitbit app’s main screen. 

Lots of customizable, distinct bands 

I hope the Fitbit Air is cheap — and the accompanying bands are even cheaper. If the rumors of affordability are true, then I’d hope Fitbit sells bands that can be worn with the device that match users’ styles and color preferences at a similarly affordable and accessible price point. Curry wears a gray-orange band in his teaser. I hope the colorways for this device are bold, patterned, and easily distinguishable from rival fitness bands. 





Source link