How to find Airline Data for your Data Science project?
This article will help you use latest airline data for your data science projects.
Studying airline travel is very interesting since there is movement in both space and time. It is also fascinating to study an industry that is evolving with time. However the data being used in data science projects is outdated and does not represent the current state of travel. In this article I will provide some links to public data that can be used to work on projects. I will also list some projects that can be done using these data sources. I will update this page if I am informed of any other data sources.
I find this source useful. Ourairports is another data source with cross-reference runways data. Latitude and longitude would be the most important fields. It would be useful to have code that computes distance between two locations. Please make sure that computation matches closely with distance recorded in operational data recorded (next section).
Bureau of Transportation site has all operational data recorded in United States of America. The data is available 30–60 days after current date. The data has to be downloaded month by month.
Some key points to note if you want to use this data.
- Many of the columns are computed fields. Before trying to compute any number, please check if the number if already present in one of the columns (delay minutes, departure delay minutes, etc.). Many data warehouses charge you by columns you query. So my recommendation is to load all columns or columns until first diversion (skip all columns starting after div2).
- For departure and arrival times there are both planned and actual times. Columns for planned times begin with “CRS”
- Air_Time column is time spent in the air. Taxi_Out column is time spent going from gate to runway at origin. Taxi_in column is time spent going from runway to gate at destination. Actual_Elapsed_Time column is sum of air time and taxi times.
- We have to careful about flights that have been diverted. For most projects it is better to filter them out using Diverted column
- Only the Tail number of aircraft is recorded — this is unique number given to individual aircraft when it is registered. If we need information about the aircraft type this number can be cross-referenced with aircraft data in the next section.
All airlines register their aircraft with several authorities. For FAA registered aircraft you can find the tail number along with equipment information in this database. We can cross reference the tail number to find out all the details of the aircraft type of that particular aircraft. Please note that this information is not available for aircraft not registered with FAA. If you want to use operational data in previous section, most of the tail numbers are registered.
Working with Data
Aircraft and Airport data are small enough that they can be loaded into data frames in R or Python. Operational data for flights has 7 million records per year (for US domestic flights). It is possible to work with partial data. To work on trends across years and moths, I would highly recommend adding the data into data warehouse. I have used BigQuery on GCP to work on nearly 50 million records. Very complex queries that need data across many years can be executed within few seconds. By combining BigQuery with Python, I was able to generate data required for many projects within few minutes.
Here are some project ideas for
Market trends for traffic, aircraft types, delays, etc.
Compare historical performance of various airlines
Flight traffic between airports shown on a map
Animation of flights shown on a map
Delay prediction ( there are many notebooks available online but they use outdated data)
I have described public data sources are available for domestic US airline travel. Data for international travel is not easily available. Please drop me a note if you are aware of any public data for any other travel markets.