Parse Dates Efficiently
29/05/2018
import os; import pandas as pd; import numpy as np
Import data from a tsv
file
df = pd.read_csv("flight_edges.tsv", sep='\t', header=None)
df.columns = [
'Origin','Destination',
'Origin City', 'Destination City',
'Passengers','Seats','Flights','Distance','Fly Date',
'Origin Population', 'Destination Population'
]
df.head()
Origin | Destination | Origin City | Destination City | Passengers | Seats | Flights | Distance | Fly Date | Origin Population | Destination Population | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | MHK | AMW | Manhattan, KS | Ames, IA | 21 | 30 | 1 | 254.0 | 200810 | 122049 | 86219 |
1 | EUG | RDM | Eugene, OR | Bend, OR | 41 | 396 | 22 | 103.0 | 199011 | 284093 | 76034 |
2 | EUG | RDM | Eugene, OR | Bend, OR | 88 | 342 | 19 | 103.0 | 199012 | 284093 | 76034 |
3 | EUG | RDM | Eugene, OR | Bend, OR | 11 | 72 | 4 | 103.0 | 199010 | 284093 | 76034 |
4 | MFR | RDM | Medford, OR | Bend, OR | 0 | 18 | 1 | 156.0 | 199002 | 147300 | 76034 |
Parse Dates (efficiently)
date_parser = lambda x: pd.datetime.strptime(str(x), '%Y%m')
# Slow way (30 seconds)
df['Date'] = df['Fly Date'].apply(date_parser)
# Faster way (1.5 seconds)
df['date_index'] = df['Fly Date']
dates = df.groupby(['date_index']).first()['Fly Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['Date'] = dates
df = df.reset_index()
df.head()
date_index | Origin | Destination | Origin City | Destination City | Passengers | Seats | Flights | Distance | Fly Date | Origin Population | Destination Population | Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200810 | MHK | AMW | Manhattan, KS | Ames, IA | 21 | 30 | 1 | 254.0 | 200810 | 122049 | 86219 | 2008-10-01 |
1 | 199011 | EUG | RDM | Eugene, OR | Bend, OR | 41 | 396 | 22 | 103.0 | 199011 | 284093 | 76034 | 1990-11-01 |
2 | 199012 | EUG | RDM | Eugene, OR | Bend, OR | 88 | 342 | 19 | 103.0 | 199012 | 284093 | 76034 | 1990-12-01 |
3 | 199010 | EUG | RDM | Eugene, OR | Bend, OR | 11 | 72 | 4 | 103.0 | 199010 | 284093 | 76034 | 1990-10-01 |
4 | 199002 | MFR | RDM | Medford, OR | Bend, OR | 0 | 18 | 1 | 156.0 | 199002 | 147300 | 76034 | 1990-02-01 |
Author: Andrea Barbon
Back