A common myth I hear very frequently is that you can’t work with more than 1 million records in Excel. Actually, the right myth should be that you can’t use more than 1,048,576 rows, since this is the number of rows on each sheet; but even this one is false.
In this post I’ll debunk this myth by creating a PivotTable from 50 million records in Excel.
To make things more interesting, I’ll import data from 20 different text files (.csv) with 2.5 million records each.
To accomplish this, I’ll use two Excel tools: Power Pivot and Power Query. Power Query is also known as ‘Get and Transform’ in Excel 2016.
For this post I’ll be using sales records with the following fields: Region, Product, Date, and Sales. The desired goal is to be able to analyze the sales performance by year and region.
If you want to follow along, please download the files from this link.
If you don’t have Power Query on your computer, you can download it from here: Power Query Download.
The process I’ll follow is:
Data import and cleaning
As mentioned before, the data are contained in 20 text files. Therefore, the first step is to import and append the information from these files.
Note: I’ll use Excel 2016, however, the steps are the same on previous Excel versions. If you have Excel 2010/2013, go to the Power Query tab instead of the Data tab.
Step 1: Import the data into Excel using Power Query.
Go to Data New Query From File From Folder
Click on ‘Browse’ and browse for the folder that contains the files, then click OK.
Another option (the one I generally use), is to copy the path of the folder and paste it on the folder path box.
Once you click OK, press Edit on the next window.
Then expand the content by clicking on the double arrow button
Once the data is imported it will look like this:
Step 2: Remove the headers from each file
The files will be imported with headers, so you must remove them. For this you can go to any of the columns and remove the column name from the options. For example, go to the ‘Region’ column and setup a filter to exclude the word ‘Region’.
Step 3: Load the data into the Power Pivot Data Model.
After removing the headers, you just need to load the data into the Power Pivot Data Model. To do this go to File Close & Load To…
On the ‘Load To’ dialog box, select ‘Only Create Connection’, then click on the checkbox ‘Add this data to the Data Model’ and click on Load.
After you click Load, you’ll be able to use the data within Power Pivot.
Modify the Power Pivot Data Model
To make modifications to the Data Model, such as adding other columns, you can open the Power Pivot window.
To Open Power Pivot, go to the Power Pivot tab and click on Manage.
If the Power Pivot tab is not visible follow the instructions on this link to enable it.
For this example, I’ll add a column called Year to calculate the year of the date column. To add a column, go to the rightmost column and double-click the header, then type the desired name.
Then on the first row of the new column type the formula ‘=YEAR([Date])’ and press enter. The years will be calculated after pressing Enter.
Important: Another way of adding the Year column is to do it in Power Query. In this way, you don’t have to open the Power Pivot window to modify the Data Model since the Year would already be part of the source data.
Creating the PivotTable
Once the Data Model is ready, you can create the PivotTable by clicking on the PivotTable button on the Home Tab of the Power Pivot Window.
Then select the location of the PivotTable (New worksheet or Existing worksheet) and click OK.
Once you click OK, the PivotTable Fields List will appear. In this example, drag the Region field to the Columns Area, the Year field to the Rows area, and the Sales field to the Values area.
After these steps, you should get the following PivotTable with the Sales by Region and Year from 50 MILLION records.
You can take this even further and create PivotChart from the existing PivotTable.
Click on any cell of within the PivotTable and go to Insert PivotChart.
Then go to Line Line with Markers
You should get a chart like this (After a few formatting tweaks, such as adding Axis Labels, Title, …)
Finally, you can add visual filters (Slicers and timelines). Slicers allow you to filter by categorical fields; timelines allow you to filter dates.
Add a Slicer
To add a Slicer for the PivotChart, select the PivotChart and in the Analyze tab click on ‘Insert Slicer’.
Select the field to be filtered (e.g. Region) and click ‘OK’.
Add a timeline
To add a Timeline for the PivotChart, select the PivotChart and in the Analyze tab click on ‘Insert Timeline’.
Select the date field to be filtered (e.g. Date) and click ‘OK’.
Voila!
The end result should be a dynamic chart with filtering capabilities as shown below. Again, you’re dynamically visualizing millions of records.
You can download the final file from this link.
Please share this post with other people so they can benefit as well.
If you want to get notified when new posts become available. Subscribe for free to Master Data Analysis!!
View Comments
thank you for sharing, how long does it take to do this process( if its possible step by step)? and this is relevant with your hardware and excel version, can you give detail about your pc ?
Thanks
Hi Taner,
Thanks for your interest.
Doing the whole process took me 10 mins approximately. Most of the time was importing the data (~ 5min), then changing the data model (~ 1 min) and creating and formatting the PivotChart with the slicers (~ 4 min).
I have Excel 2016 (64-bit), however, if you have one of the latest versions of Excel 2013/2016 (32 bit) you should have no problems, read this: https://support.microsoft.com/en-us/kb/3160741
I did this in a fairly standard laptop (2.5GHz Processor, 8GB RAM), however you don't need this capacity to perform the analysis shown in this post.
Please let me know if you have any further questions.
Hey There,
Thanks so much for a great article! I am having a little trouble though and I was hoping you could help.
When I get to step three and try to link the data, the box at the bottom you say to check "add this data to the Data Model" does not exists.
I still get the query acknowledging the data connection exists, but when I go to create a pivot table and link them together, it says "connecting to datasource" in hte obttom right corner. Then it says "Reading Data:" and start to count up to 8 million. When it gets there, a popup says "Problem Obtaining Data."
I am using Excel 2010. Might this be the problem?
Thanks so much for any help!
Hi Joey,
Thanks for asking questions.
Your problem might be with your Excel version (32bit vs 64bit). If you're working with Excel 2010 64 bit you should be able to accomplish this.
With respect to the 'Add this data to the Data Model' option, you're right, this was a feature integrated in the version Excel 2013. Therefore, you won't have it on Excel 2010.
This post covers how to load data from Power Query to Power Pivot on Excel 2010:
https://businessintelligist.com/2014/02/07/how-to-load-power-query-directly-into-power-pivot/
Great post, nice explanation, able to download files and did it my self. Beautifully worked.
Hi Vijay,
Thanks so much for your words! These posts take a lot of time and effort to write, so it is nice to hear they are useful.
Great article. Are you limited to linking to 20 files? Can you mix linking to excel files, csv files and access databases?
Hi Alice,
Thanks for asking!
No, you're not limited to importing just 20 excel files. The second image of this post shows that there are several options to import information to Excel.
For Example, you could go to New Query -> From File -> From Folder and import all the files in the folder. Then you could go to New Query -> From Database -> From Microsoft Access and import data from a database and so on...
Once the data is in Power Query, you can clean it, append it, join it, etc