When it comes to analyzing a specific set of files and folders, there are different types of analysis one might be interested in. For example:
For the purposes of an enterprise or system / data migration, all of the above are likely of interest. However, it can be quite difficult to develop a generic tool to ingest data for every type of file. There are many different industries and systems when it comes to migrating data. As an example, in the legal industry, analyzing emails, word documents, pdf files and other document types might be of interest. In the manufacturing industry, analyzing engineering bills of material and/or engineering diagrams would likely be more important than trying to analyze the contents of emails. While different data and content migrations may require different types of analysis for the data contained within the files , we believe that all migrations, without exception, can benefit from an analysis that describes the structure of folders and its files, as well as the types of files within each folder, as well as their extensions, size and checksums.
Knowing the answers to these questions are pertinent before attempting or even beginning any kind of enterprise data migration. Over many years of experience, we have found that any enterprise data migration must begin with an analysis of the data source(s), and to that end, we have created a tool that helps you achieve this pre-migration analysis of a data source. Note that the standard ETL process generally does not include this analysis step - feel free to check out our article on Misunderstanding ETL to learn more.
For our solution, we approached it by writing a tool that does the following (you'll need intermediate coding skills, or you can use our free tool below as a sample, or if you still need help, feel free to contact us).
CREATE TABLE `rootpath` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`pathToRoot` varchar(500) DEFAULT NULL,
PRIMARY KEY (`id`))
ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
CREATE TABLE `file` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fileExists` tinyint(1) DEFAULT NULL,
`rootpath_id` int(11) DEFAULT NULL,
`filePath` varchar(300) DEFAULT NULL,
`fileName` varchar(100) DEFAULT NULL,
`baseName` varchar(100) DEFAULT NULL,
`extension` varchar(10) DEFAULT NULL,
`fileSize` int(11) DEFAULT NULL,
`checkSum` varchar(32) DEFAULT NULL,
`processed` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK_rp_rootpathid` (`rootpath_id`),
KEY `rootPathId_filePath_fileName` (`rootpath_id`,`filePath`,`fileName`),
KEY `fileName_checkSum` (`fileName`,`checkSum`),
KEY `filename` (`fileName`),
CONSTRAINT `FK_rp_rootpathid` FOREIGN KEY (`rootpath_id`) REFERENCES `rootpath` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
Feel free to use the following tool to analyze your dataset. To use it, simply zip up your data and upload it. Note that you may include both files and folders/directories, as well as files with sub-directories or sub-folders (and so on) within your Zip Archive. We will recursively ingest the data and present a database table explorer to allow you to query against it. Note that there may be a limit on the upload size, so you won't be able to upload hundreds of megabytes or gigabytes of data. If you do need to have a larger dataset analyzed, contact us and let us know - we'd be willing to help!