Monthly Archives: March 2016

Remove BOM with the help of Python before concatenating files

One day I was struggling to concatenate files generated by 3rd party utility for Essbase upload.
Uploading did broke on every single run and I cound not find the culprit until I opened the outline file with HxD HEX editor and found that extra bytes were added between each concatenated file.
To my suprise the 3rd party utility unloaded data from database in Unicode with BOM starting each separate file.
So before concatenating those files I had to remove BOM from them.
Thanks to this SO answer it was easily achievable with this python script.

# -*- coding: utf-8 -*-
import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

Oracle: Iterate over regexp matches with hierarchical query trick

begin
  for matches in ( 
    with in_data as (
      select 'v113*(v43|v42|v900)/v54' haystack
        , 'v\d+' needle
      from dual
    )
    , matches ( a_match, occ ) as ( 
      select regexp_substr( haystack, needle, 1, 1 ) a_match
        , 1 occ
      from in_data
      union all
      select regexp_substr( haystack, needle, 1, p.occ+1 ) a_match
        , p.occ + 1 as occ
      from matches p
      cross join in_data
      where p.a_match is not null
    )
    cycle a_match set cycle to 1 default 0
    select a_match
    from matches
    where a_match is not null
  ) loop
    dbms_output.put_line( matches.a_match );
  end loop;
end;