Optimization MySQL vs Many Flat Files and HDD Utilization

问题: I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but...


I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.

It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).

From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.

Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?

In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?


A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”

You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.

On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.

  • 发表于 2019-01-21 00:33
  • 阅读 ( 167 )
  • 分类:网络文章


请先 登录 后评论


作家榜 »

  1. 小编 文章