“Programmers are people with high level of patience and tolerance”. While this may sound like true, I do not think programmers are as tolerant when it comes to delays and waits irrelevant to the problem itself. Such as extremely slow hardware, slime like internet connection, etc.
As a research assistant, you spend hours a day to read hundreds of papers in order to find the latest solutions and approaches on the problem of your study, so that you can add something to it or make improvements in some ways. Or as a programmer in a commercial project, you are definitely short of time and you are required to meet the deadlines. Therefore, Not only time is not on your side, But also, it can be extremely nerve racking if slow processes are the bottleneck of your progress!
What keeps you waiting?
When developing deep learning codes, it is very important to have an efficient way of reading data from the dataset. Sometimes your data is in format of thousands of files or it simply resides on a database. Either way, loading data can take a while. One might argue that a few minutes of time is not comparable to the training time which may count up to several hours or even days!
Believe me, when you are developing and debugging the implementation of your model and/or approach, even 30 seconds of waiting time can be so annoying! I am pretty sure all data scientists have been there!
In this blog post I intend to share my solution to this problem for those who face this issue.
Cache is the Answer!
In a nutshell, my solution is to store the data in a single cache file so that your program only reads from one single point on the data storage. Compared to the case of reading thousands of points from storage (i.e. reading from multiple files) which takes much longer or querying the same data from database every single time, (possibly) with some additional intermediate processing. Now that we have a general idea of the solution to the problem, lets get hands on some coding!
Show me the code!
The code for implementing this logic is straightforward. First of all, the dataset class that inherits from the abstract Dataset class of PyTorch is defined as below:
We’ll speak about the load_from_cache method later. We need to define the following functions in order to have the dataset loader work properly:
The magic happens inside the following method, where it first tries to load dataset from the cache, and then if there is no cache file available, it tries to load data from its source and then stores it into cache:
Notice the input flag force_recache , it simply forces the method to create the cache file again, regardless of the fact that there is any cache file available already.
The two methods loaddata() and cache_dataset() have self-explanatory names. The former one contains the logic of loading data from multiple files or queries data from a database and stores the data as a list in self.dataset property.
The latter one however, simply stores the data into a cache file if there isn’t any cache file available, in the following way:
The Code In Action!
Here is a piece of code that reads and compares the two approach. (i.e with and without cache). The primary method of loading data here is reading from thousands of separate files. Which is the case for my own research. Since in my own research I use MIMIC 3 dataset that requires some permissions for access, I have provided a fake data generator that generates some data similar to the real ones in terms of format.
The code above prints the following output:
We can see a significant increase in the time of loading data.
The complete code of this blog can be found in the following repository:
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
In this blog post we have reviewed how using cache can increase the speed of loading dataset. Thus making life easier while developing and debugging a deep learning model in PyTorch.