Merging two data files

Merge statement

The MERGE directive, described in this chapter, is designed to combine information from two files into a third file with a range of qualifiers to accomodate various scenarios. It was developed with assistance from Chandrapal Kailasanathan to replace the !MERGE qualifier (described below) which had very limited functionality. The MERGE directive is placed BEFORE the data filename lines. It is an independent part of the ASReml job in the sense that none of the files are necessarily involved in the subsequent analyses performed by the job, and there may be multiple MERGE directives. Indeed, the job may just consist of a title line and MERGE directives.

The !MERGE qualifier, on the other hand, combines information from two files into the internal data set which ASReml uses for analysis and does not save it to file. It has very limited in functionality.

The files to be merged must conform to the following basic structure:
  • the data fields must be TAB, COMMA or SPACE separated,
  • there will be one heading line that names the columns in the file,
  • the names may not have embedded spaces,
  • the number of fields is determined from the number of names,
  • missing values are implied by adjacent commas in comma delimited files. Otherwise, they are indicated by NA, * or . as in normal ASReml files.
  • the merged file will be TAB separated if a .txt file, COMMA separated if a .csv file and SPACE separated otherwise.

    Merge Syntax

    The basic merge command is MERGE file1 !WITH file2 !TO newfile .

    Typically files to be merged will have common key fields. In the basic merge, ( !KEY not specified) any fields having the same names are taken as the key fields and if the files have no fields in common, they are assumed to match on row number. Fields are referenced by name (case sensitive).

    The full command is:
    MERGE file1 [ !KEY keyfields ] [ !KEEP ] [ !SKIP fields ] !WITH file2 [ !KEY keyfields ] [ !KEEP ] [ !NODUP ] [ !SKIP fields ] !TO newfile [ !CHECK ] [ !SORT ].

    Warning: Fields in the merged file will be arranged with key fields followed by other fields from the primary file and then fields from the secondary file.

    List of MERGE qualifiers

  • !CHECK
  • requests ASReml confirm that fields having a common name have the same contents. are reported to the .asr file. If there are fields with common names which are not key fields, and !CHECK is omitted, the will be assumed different and both versions will be copied.
  • !KEY keyfields
  • names the fields which are to be used for matching records in the files. If the fields have the same name in both file headers, they need only be named in association with the primary input file. If the key fields are the only fields with common names, the !KEY qualifier may be omitted altogether. If key fields are not nominated and there are no common field names, the files are interleaved.
  • !KEEP
  • instructs ASReml to include in the merged file records from the input file which are not matched in the other input file. Missing values are inserted as the values from the other file. Otherwise, unmatched records are discarded. !KEEP may be specified with either or both input files.
  • !NODUP fields
  • Typically when a match occurs, the field contents from the second file are combined with the field contents of the first file to produce the merged file. The !NODUP qualifier, which may only be associated with the second file, causes the field contents for the nominated fields from the second file only be inserted once into the merged file. For example, assume we want to merge two files containing data from sheep. The first file has several records per animal containing fleece data from various years. The second file has one record per animal containing birth and weaning weights. Merging with !NODUP bwt wwt will copy these traits only once into the merged file.
  • !SKIP fields
  • is used to exclude fields from the merged file. It may be specified with either or both input files.
  • !SORT
  • instructs ASReml to produce the merged file sorted on the key fields. Otherwise the records are return in the order they appear in the primary file.

    Functionality

    The merging algorithm is briefly as follows: The secondary file is read in, skip fields being omitted, and the records are sorted on the key fields. If sorted output is required, the primary file is also read in and sorted. The primary file (or its sorted form) is then processed line by line and the merged file is produced. Matching of key fields is on a string basis, not a value basis. If there are no key fields, the files are merged by interleaving.

    If there are multiple records with the same key, these are severally matched. That is if 3 lines of file 1 match 4 lines of file 2, the merged file will contain all 12 combinations.

    Examples

    Key fields have different names
    !MERGE file1 !Key key1a key1b !WITH file2 !KEY key2a key2b !to newfile

    Key fields have common name and other fields are also duplicated
    !MERGE file1 !Key keya keyb !WITH file2 !to newfile !CHECK}

    !MERGE file1 !Key key !KEEP !WITH file2 !to newfile
    will discard records from file2 that do not match records in file1 but all records in file1 are retained.

    Omitting fields from the merged file
    !MERGE file1 !Key key !skip s1a s1b !WITH file2 !skip s2a s2b !to newfile

    Single insertion merging
    !MERGE adult.txt !Key ewe !KEEP !WITH birth.txt !KEEP !TO newfile !NODUP bwt

    Merge qualifier

    !MERGE c f [ !SKIP n !MATCH a b ]
    may be specified on a line following the datafile line. The purpose is to combine data fields from the (primary) data file with data fields from a secondary file (f). The effect is to open the named file (skip n lines) and then insert the columns from the new file into field positions starting at position c. If !MATCH a b is specified, ASReml checks that the field a (0< a < c) has the same value as field b. If not, it is assumed that the merged file has some missing records and missing values are inserted into the data record and the line from the !MERGE file is kept for comparison with the next record. At this stage it is expected that the lines in the !MERGE file are in the same order as the corresponding lines occur in the primary data file, and that there are no extraneous lines in the !MERGE file. (It is proposed to extend this so the orders do not need to agree and that multiple lines in the primary file could be merged with the same line of the !MERGE file.) For example, assuming the field definitions define 10 fields,
       PRIMARY.DAT   !skip 1
        !MERGE 6 SECOND.DAT  !SKIP 1  !MATCH 1 6
    
    would obtain the first five fields from PRIMARY.DAT and the next five from SECOND.DAT, checking that the first field in each file has the same value. Thus each input record is obtained by combining information from each file, before any transformations are performed.

    Return to start