Gather & Prepare Data for Data Analytics in 7 Steps

CaseWareAnalytics, Auditing, CaseWare News, Industry News, Technology

Analytics in an audit

7 Steps to Data Acquisition for an Audit or Analysis

(Written by David Stansell, Customer Service Manager of CaseWare Australia & New Zealand from information contained in “Analytics for External Auditors” – a publication with detailed information and guidance specifically for external auditors, available for sale.)

Within an External or Internal Audit engagement, knowing where, how, and what to look for within company data is crucial to meeting the audit objectives. Therefore, data gathering is a highly important activity in any audit and it is crucial to the integrity and robustness of the audit outcome to ensure the completeness and accuracy of the data used to draw conclusions.

To achieve successful Data Analytics, it is important to put together a Data Collection Plan that specifies the data required, giving greater assurance that the data provided by the client is meaningful, valid and in a condition that will require minimal manipulation before testing.

The following data collection steps provide some guidance that will enable you to capture the right data for your audits.


Step 1: Identify the Questions and the Assertions You Want to Answer

The Auditor must first understand what the objective is for the audit and what assurances are being tested, including verification that:

  • All assets, liabilities, and equity interests have been recorded and disclosed correctly.
  • Confirm the existence of assets, liabilities, and equity.
  • Valuation of Assets, liabilities, and equity are reported at the correct amount.
  • Transactions and events are recorded in the correct accounting period.


Step 2: Confirm the Data Requirements to Meet the Audit Objectives

Once the objectives and assurances have been identified, the data requirements to meet these objectives can be determined to ensure that the Auditor can draw reasonable conclusions.

Determine what data exists to meet the objectives set and how much data is needed to provide answers for objectives and assurances the Auditor is trying to gain.

In this section, rather than referring to all of the specific data requirements needed for the testing objectives in this guide, all of the relevant data requirements for each of the business processes are detailed within each of their specific sections.


Step 3: Determine the Location or Source of Data and Understand Business Processes and Procedures

This step is critical to understand where the information is stored and how information flows through the business and accompanying systems.

It is also important to identify all of the individuals within the business process who will need to be involved in the data collection process. This can include, but not be limited to:

  • Key stakeholders who may be needed to sign off on the relevant data collection.
  • Key contact who the Auditor will work with to identify the information that the audit requires.
  • An individual who will perform the data extraction. This could be a member of staff who works within the business process itself or depending on the size of the organization may be a member of the IT department/database administrator.

It is also important to note that companies invest significantly to protect their data, including multi-layered approval processes and technology safeguards; therefore, it may be a time-consuming process to obtain client approval for the required data. Understanding the business process and data handling protocols will pinpoint any data restrictions in data collection not otherwise recognized.


Step 4: Refine Data Requirements

Some of the testing objectives that would have been defined in the early stages of the audit, as well as from discussions with the client themselves, may also be based on previous experience with other similar clients or from information obtained from professional bodies or solution providers. Due to this, during the third stage of this process, it may become apparent that the client does not record or have access to certain information that is required. Depending on any regulatory requirements that require the client to provide this information or further opportunity to access this information (Outsource to a 3rd party to produce a report with the required information), you may have to reduce some of your data requirements and testing objectives.


Step 5: Determine the Format of the Data

It is recommended that you provide guidance to determine the type of file formats that the client should select when extracting data from their systems. It is worth noting that in most cases many of the common file formats have their own issues/limitations. Incorrect selection of a file format can potentially cause reporting on inaccurate data. It can also cause delays in the Data Analytics process due to the need for manual modification of the data to put it into the correct format.

This will need to be determined on a client by client basis. Some organizations use 3rd party tools to extract the data from their systems. This means that although the Auditor may have come across the accounting or ERP (Enterprise Resource Planning) software with another client, the formats available may vary.

In order to reduce any issues with requesting the wrong format, it is recommended that the Auditor works with the individual tasked with extracting the data. Based on the advice in this guide and in due course personal experience, identify the top three file formats and extract a small subset of the data (1 weeks’ worth) and import them into IDEA. This should hopefully identify any importation issues and identify the relevant file format to be used when extracting the full periods worth of data.

CaseWare IDEA has the ability to import data from a variety of file formats including AS400, dBase, Microsoft Access, Microsoft Excel, Print reports and Adobe PDF, SAP/AIS, Text and XML.

Three of the most common types of file formats available are Microsoft Excel, delimited text files, and Print reports.


Microsoft Excel

Microsoft Excel is one of the most common file types used and can be imported into IDEA.

There are some specific rules about the structure that the data must be stored within the Excel file. For example, there must be:

    • No blank rows or columns
    • No title rows
    • No sub-totals or grand-totalsThis means that the data must be a straight extract from the system without any grouping or formatting applied. The Auditor must be aware that reports in a print to view/paper format can be output to an Excel file format and will most likely require manual modification to remove groupings, blank lines, subtotals and grand totals which in itself is open to potential mistakes. One reason that the Microsoft Excel format should not potentially be the first considered format, is due to the fact that depending on the quality of the reporting software all values will be written into the Excel file in the ‘General’ format. The potential impact being that reference numbers will be treated as numbers and leading zeros will be removed. Specific examples of this would be telephone numbers and bank sort codes and account numbers. Should the Auditor want to compare these to other datasets, this could lead to valid matches not being identified.

Plain Text Files

Plain text files are commonly used to transfer large files between companies and systems. A plain text file is a file storing tabular data in plain text.

There are two main types of plain text files that IDEA can import:

Fixed Length

Common file extensions include .txt, and .fxd.

They are known as fixed length due to the fact that each record in the file is exactly the same amount  of characters and that the fields contained within the records always start in the same character position.


One of the most common types of delimited files are CSV (Comma Separated Value) files. Other common file extensions include .asc, and .del.

Unlike fixed-length files that use spaces to ensure that fields start in the same position, delimited files use special characters to define where fields start and end and is the most efficient type of text file, due to the fact that the records are only as long as the values contained within them.

It is important to note that delimited text files can sometimes be exported from core systems configured to use different delimiters. Examples include tab, colon, semicolon, and space. The choice of delimiter can, in some cases, be critical as common delimiters can also exist within the actual data itself, for example, commas can also be used in names and addresses.  

If the option to select your own delimiter is available, we recommend selecting pipe (|), tilde (~), or not sign (¬). In essence, any selected delimiter must not exist within any of the records other than being used as a delimiter.

Another option sometimes available is the inclusion of text encapsulators, the most common being quote marks (“”). By wrapping encapsulators around text values any field delimiters that occur between the encapsulators are ignored (e.g., “Smith, John”, “12, High Street”,).

Whether it be a delimited or fixed length file that is chosen, it is advised to obtain technical information about the file’s record layout. Key information which will assist when importing these files include:

    • Field Name (If not already included within the files themselves)
    • Field Type (Numeric, date, time, character)
    • Delimiter (Delimited file only)
    • Encapsulator (Delimited file only)
    • Field length (Fixed length files only)
    • Number of decimals
    • Thousands separator and decimal point characters.


Print Reports

Print reports are probably one of the easiest file types to obtain as nearly all systems have the ability to produce printed reports and are simply an electronic version of a printed document and often contains extra formatting and information. The additional formatting is used to make the printed document as clear as possible for the reader.

IDEA has the capability to import print reports using a built-in tool, Report Reader, and can extract the required data for import from the file. Importing a print report again allows for a greater degree of control, as it allows the user to select and import the fields required for analysis, add in field descriptions and amend field types prior to import.

Bear in mind that although this is a powerful functionality that allows the Auditor to convert print to view reports into structured data for easy analysis, this file type should only be considered if less time-consuming options are not available.

For more guidance on the process of importing data into IDEA, refer to the IDEA and Report Reader Tutorials located in the IDEA documentation folder. More information and guidance can also be found using the IDEA Help function (F1). Also note that if you have access to CaseWare Passport support website further information, tutorials, and videos are also available.


Step 6: Determine How the Data Will Be Transported

Careful consideration also needs to be given to how and when the Auditor will receive the required data. Ideally, all of the pre-preparation and importation of the data at a minimum will have taken place prior to the Auditor going on-site with the client.

The most efficient way of achieving this will be to use a web-based file hosting service (OneDrive Drobox etc..). The Auditor will need to ensure that the chosen service can demonstrate appropriate levels of security to satisfy clients legal requirements and redundancy against loss of data. To reduce upload times of the data and to add further levels of security, it is recommended that the client converts the data into a compressed format (.zip, .rar, .tar) and adds a secure password.

Although using web-based file hosting services are the most efficient way of sharing data, some clients will only allow the sharing of data to take place physically when the Auditor is on-site. As previously mentioned, this is not ideal as the Auditor should have already pre-prepared the data and also gained some insights to assist the audit. In this case it is worth considering, depending on the size of the client, physically collecting the data prior to the start of the audit. This will add additional cost to the audit and should only be considered if the data collected will be vast or complex in nature, or the value of audit will allow.

In addition to the above points, some clients may restrict the use of their data to only allow analysis to take place internally using their own IT equipment. This may have an impact on the audit, due to the pre-preparation process of data, and should be considered and factored in as part of the audit programme.


Step 7: Request and Verify Provided Data

The final step is to request the data from the key contacts or individuals identified in Step 3. The data can then be provided to the Auditor for verification and analysis.

An important part of the import process is to verify totals prior to any testing. Some file formats such as Microsoft Excel allows the Auditor to compare totals in the files provided by the client with its IDEA equivalent. This is assuming that the data extracted is correct. So, for completeness and accuracy control totals should be provided by the client based on information available within the user interface of the accounting software or ERP system. It is up to the Auditor whether a screenshot or a physical confirmation (Auditor views the totals when on-site with the client) is applicable.



It is important to note that regardless of the type of audit being completed an accurate and efficient data collection process is essential to maintaining the integrity and timeline of the audit or analysis. Understanding the data, the formats available, the data sources and any data restrictions in place can reduce the likelihood of errors occurring. A data collection process is essential to ensure that the data gathered is both well-defined and accurate. This provides confidence that the assurances tested are valid and true.