How to do multiple REGEX search and replace operations in large file...

Status
Not open for further replies.

KurtE

Senior Member+
Sorry, I know this is probably not the best place to ask, this, but was wondering about suggestions for how to reasonably manipulate a large file, to hopefully cut it down to size and maybe make it into a more usable format.

In particular, working on the USB Host code, I often use the Saleae Logic Analyzer to capture USB traffic, for maybe 20-30 seconds. Example trying to figure out the packets that are sent back and forth to a Bluetooth controller to maybe pair with an XBox One controller... I then have the Logic analyzer output the report, which outputs a CSV formatted file which for example might be 50000 lines long... But most of the time, I can manipulate this file down to maybe 100-1500 lines of data that I want to look trhough.

And I find my self doing a bunch of standard search and replace to start reducing this data...

Things like:

a) Remove all SOF files (Start of Frame): Search for:.*,SOF,.*\n replace with nothing...
b) remove all NAK sequences: .*,IN,.*\n.*,NAK,.*\n again replace with nothing
c) Remove all ACK lines: .*,SOF,.*\n

Then I often convert the IN/OUT/SETUP sequences from one two lines to one line: something like:
find: .*,\b(IN|OUT|SETUP)\b,0x[0-9A-F]*,(.*)\n.*DATA[01],,(.*\n)
replaced with: $1,$2,$3

Note: this last one may not have been exact, maybe depends on how many ,,s I need to remove...

This up till now was pretty close how I did all of the postings and the like where I then imported this data into Excel and showed the data in forum posts for Bluetooth and joysticks...

If there was a simple way to do this, it would be great. I currently am using sublimetext search and replace... Thought I would try to make a macro that does this, but their macros don't appear to want to include search/replace stuff.

I have the beginnings of using a regreplace plug in setup to do this... That uses Python script to do the first part of this: Reg_replace_rules...
Code:
{
	"format": "3.0",
	"replacements":
	{
		"remove_usb_packets_ACK":
		{
			"case": false,
			"find": "^.*,ACK,.*\\n",
			"greedy": true,
			"replace": ""
		},
		"remove_usb_packets_IN_NAK":
		{
			"case": false,
			"find": "^.*,IN,.*\\n.*,NAK.*\\n",
			"greedy": true,
			"replace": ""
		},
		"remove_usb_packets_SOF":
		{
			"case": false,
			"find": "^.*,SOF,.*\\n",
			"greedy": true,
			"replace": ""
		}
	}
}
And then command
Code:
[
  {
    "caption": "USB Logic Packet format",
    "command": "reg_replace",
    "args": {
      "replacements": [
        "remove_usb_packets_SOF",
        "remove_usb_packets_IN_NAK",
        "remove_usb_packets_ACK"
      ]
    }
  },
]
And I think it wort of worked, BUT the performance was real bad. That is I waited for several minutes and then went outside with Annie (and Laik) and played for a bit and it was completed when I got back in...
And again this is only a subset....

If I had something that worked well like macros, I would like to expand it to maybe do things like:
Search for lines that state with IN and data area starts with: 0x02 and Insert EV_INQUIRY_RESULT, or if it is 0x17 insert EV_LINK_KEY_REQUEST
That is assume the above had data like: IN,0x01,,017 0x...
Rule like: Search for: (IN,0x01,)(,0x17) replace with: $1EV_INQUIRY_RESULT$2

Likewise for some of the OUT lines, where again if I see something like(OUT,0x01),,(0x01 0x04) insert the word HCI_INQUIRY between the commas...

Suggestions on how to setup such a script or which editor works to do something like this would be appreciated!

Kurt
 
I use UltraEdit which has macros and javascripting. The search and replace can use one of three different types of regexp: UltraEdit, UN*X or Perl. I thought I knew UN*X regexps very well but I can't get them to work. But the Perl regexp works well and essentially like what I thought was UN*X. UltraEdit can work on very large files, several GB, but I haven't tried lots of regexp on a large file. If UltraEdit interests you at all, send me one of your files and I'll try it out for you.

Pete
 
Thanks, I might consider trying ultraEdit... Earlier on I used a linux Grep to remove a lot of the excess...

Also posted up on sublime text forum, will be interesting to see if I get any responses there.
 
My language of choice is using perl for that, but then I've been programming that since 1995. For some things I might use Gnu Emacs macros and/or write Elisp macros, but that has an even higher startup time to learn than perl :)

I dislike Python as a language (due to the required indentation in the source and not using bracketing type syntax for grouping statements), but it looks like you are reading the whole thing in as a giant string, doing a lot of textual replacement within this giant string, and then writing it out. It would probably be much faster, if you read the file line by line, and using python/perl/whatever variables to keep track of the progress. That way the working set is much smaller. While it still it needa to do replacements, etc. as you process the data, it won't have to do megabytes of string movement all at one multiple times over.
 
Sorry I have to disagree. If Perl is the answer today, then the question is wrong.

Python is probably not optimal for this task, but the other suggestions sounds like easy solutions for one-off processing.
 
Thanks Michael,

It made me remember how I was doing it earlier, which reduced the data down to some reasonable size data...

The other hack I have used, that gets me a long ways there is to use a Linux window and use Grep to only extract out those lines that have ,DATA in them and the one line preceding it:

Which for example with the last run I did, the raw file for something like 20+ seconds of capturing USB was 3,007,936 lines long 8) I opened up a Ubuntu command line window on Windows 10 and then
cd'd to my desktop (where I stored the file), and did:
Code:
grep -B1 ",DATA" USB_Capture_WIndows_XBoxOne_Pair_packets2.txt > u2
Now U2 looks like:
Code:
5.905441174000000,SETUP,0x04,0x00,,,0x05
5.905444340000000,DATA0,,,,0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x00,0xE5AE
--
5.905691968000000,IN,0x04,0x00,,,0x05
5.905695042000000,DATA1,,,,,0x0000
--
5.906197022000000,SETUP,0x04,0x00,,,0x05
5.906200270000000,DATA0,,,,0x20 0x00 0x00 0x00 0x00 0x00 0x04 0x00,0x2CBF
--
5.906445216000000,OUT,0x04,0x00,,,0x05
5.906448466000000,DATA1,,,,0x59 0x0C 0x01 0x00,0xD42C
--
5.906940440000000,IN,0x04,0x00,,,0x05
5.906943546000000,DATA1,,,,,0x0000
--
5.907186586000000,IN,0x04,0x01,,,0x13
5.907189670000000,DATA1,,,,0x0E 0x04 0x01 0x59 0x0C 0x00,0xE5DB
--
5.907439662000000,SETUP,0x04,0x00,,,0x05
5.907442912000000,DATA0,,,,0x20 0x00 0x00 0x00 0x00 0x00 0x08 0x00,0x2CBA
--
So it has the extra lines -- in it... But all total it has: 2066 lines which is a bit easier to deal with...
Search/Replace of "--\n" with "" now 1378 lines

Do Search and replace: ".*\b(IN|OUT|SETUP),0x[0-9]*,(.*)\n.*DATA[0-1],,,(.*\n)" to "$1,$2,$3"
(and another search and replace of ,,, with , to remove some extra commas:
And now down to 689 line which is a little more manageable:
Code:
SETUP,0x00,0x05,,0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x00,0xE5AE
IN,0x00,0x05,0x0000
SETUP,0x00,0x05,,0x20 0x00 0x00 0x00 0x00 0x00 0x04 0x00,0x2CBF
OUT,0x00,0x05,,0x59 0x0C 0x01 0x00,0xD42C
IN,0x00,0x05,0x0000
IN,0x01,0x13,,0x0E 0x04 0x01 0x59 0x0C 0x00,0xE5DB
SETUP,0x00,0x05,,0x20 0x00 0x00 0x00 0x00 0x00 0x08 0x00,0x2CBA
OUT,0x00,0x05,,0x01 0x04 0x05 0x33 0x8B 0x9E 0x05 0x00,0xF376
IN,0x00,0x05,0x0000
IN,0x01,0x13,,0x0F 0x04 0x00 0x01 0x01 0x04,0x485F
SETUP,0x00,0x05,,0x20 0x00 0x00 0x00 0x00 0x00 0x0A 0x00,0x4CBB
OUT,0x00,0x05,,0x0B 0x20 0x07 0x01 0x12 0x00 0x12 0x00 0x00 0x00,0xF030
IN,0x00,0x05,0x0000
IN,0x01,0x13,,0x0E 0x04 0x01 0x0B 0x20 0x00,0xF466
SETUP,0x00,0x05,,0x20 0x00 0x00 0x00 0x00 0x00 0x05 0x00,0xBCBE
OUT,0x00,0x05,,0x0C 0x20 0x02 0x01 0x00,0x6E60
IN,0x00,0x05,0x0000
IN,0x01,0x13,,0x0E 0x04 0x01 0x0C 0x20 0x00,0x35D7
IN,0x01,0x13,,0x3E 0x1A 0x02 0x01 0x00 0x01 0xA4 0x66 0x77 0xC3 0x69 0x7B 0x0E 0x02 0x01 0x1A,0xB0F1
IN,0x01,0x13,,0x0A 0xFF 0x4C 0x00 0x10 0x05 0x01 0x1C 0x33 0x40 0xB3 0xC2,0x370C
IN,0x01,0x13,,0x3E 0x2B 0x02 0x01 0x03 0x01 0x03 0xC3 0x57 0x1F 0x0A 0x45 0x1F 0x1E 0xFF 0x06,0x7FCE
IN,0x01,0x13,,0x00 0x01 0x09 0x20 0x02 0x75 0x42 0xB2 0x9D 0x7E 0xD9 0x74 0x95 0x66 0x12 0x0B,0x514E
IN,0x01,0x13,,0xA1 0x80 0x6D 0x2F 0x28 0x71 0xC9 0x9B 0xA3 0x98 0xED 0xB2 0xD6,0x017B
IN,0x01,0x13,,0x3E 0x1A 0x02 0x01 0x00 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x0E 0x02 0x01 0x1A,0x4F7A
IN,0x01,0x13,,0x0A 0xFF 0x4C 0x00 0x10 0x05 0x03 0x1C 0x6E 0xEA 0xF2 0xBC,0x298E
IN,0x01,0x13,,0x3E 0x0C 0x02 0x01 0x04 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x00 0xBC,0x4802
IN,0x01,0x13,,0x3E 0x2B 0x02 0x01 0x03 0x01 0x03 0xC3 0x57 0x1F 0x0A 0x45 0x1F 0x1E 0xFF 0x06,0x7FCE
IN,0x01,0x13,,0x00 0x01 0x09 0x20 0x02 0x75 0x42 0xB2 0x9D 0x7E 0xD9 0x74 0x95 0x66 0x12 0x0B,0x514E
IN,0x01,0x13,,0xA1 0x80 0x6D 0x2F 0x28 0x71 0xC9 0x9B 0xA3 0x98 0xED 0xB2 0xD8,0xC5FA
IN,0x01,0x13,,0x3E 0x1A 0x02 0x01 0x00 0x01 0xA4 0x66 0x77 0xC3 0x69 0x7B 0x0E 0x02 0x01 0x1A,0xB0F1
IN,0x01,0x13,,0x0A 0xFF 0x4C 0x00 0x10 0x05 0x01 0x1C 0x33 0x40 0xB3 0xC1,0x364C
IN,0x01,0x13,,0x3E 0x0C 0x02 0x01 0x04 0x01 0xA4 0x66 0x77 0xC3 0x69 0x7B 0x00 0xC1,0xBC24
IN,0x01,0x13,,0x3E 0x28 0x02 0x01 0x03 0x00 0x7D 0x2B 0x22 0x27 0x97 0xC0 0x1C 0x1B 0xFF 0x75,0xB3BA
IN,0x01,0x13,,0x00 0x42 0x04 0x01 0x80 0x60 0xC0 0x97 0x27 0x22 0x2B 0x7D 0xC2 0x97 0x27 0x22,0x5ECB
IN,0x01,0x13,,0x2B 0x7C 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0xC6,0x2B8E
IN,0x01,0x13,,0x3E 0x1A 0x02 0x01 0x00 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x0E 0x02 0x01 0x1A,0x4F7A
IN,0x01,0x13,,0x0A 0xFF 0x4C 0x00 0x10 0x05 0x03 0x1C 0x6E 0xEA 0xF2 0xBF,0x28CE

Which if you take into excel: Looks something like:
Code:
SETUP	0x00	0x05		0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x00
IN	0x00	0x05	0x0000	
SETUP	0x00	0x05		0x20 0x00 0x00 0x00 0x00 0x00 0x04 0x00
OUT	0x00	0x05		0x59 0x0C 0x01 0x00
IN	0x00	0x05	0x0000	
IN	0x01	0x13		0x0E 0x04 0x01 0x59 0x0C 0x00
SETUP	0x00	0x05		0x20 0x00 0x00 0x00 0x00 0x00 0x08 0x00
OUT	0x00	0x05		0x01 0x04 0x05 0x33 0x8B 0x9E 0x05 0x00
IN	0x00	0x05	0x0000	
IN	0x01	0x13		0x0F 0x04 0x00 0x01 0x01 0x04
SETUP	0x00	0x05		0x20 0x00 0x00 0x00 0x00 0x00 0x0A 0x00
OUT	0x00	0x05		0x0B 0x20 0x07 0x01 0x12 0x00 0x12 0x00 0x00 0x00
IN	0x00	0x05	0x0000	
IN	0x01	0x13		0x0E 0x04 0x01 0x0B 0x20 0x00
SETUP	0x00	0x05		0x20 0x00 0x00 0x00 0x00 0x00 0x05 0x00
OUT	0x00	0x05		0x0C 0x20 0x02 0x01 0x00
IN	0x00	0x05	0x0000	
IN	0x01	0x13		0x0E 0x04 0x01 0x0C 0x20 0x00
IN	0x01	0x13		0x3E 0x1A 0x02 0x01 0x00 0x01 0xA4 0x66 0x77 0xC3 0x69 0x7B 0x0E 0x02 0x01 0x1A
IN	0x01	0x13		0x0A 0xFF 0x4C 0x00 0x10 0x05 0x01 0x1C 0x33 0x40 0xB3 0xC2
IN	0x01	0x13		0x3E 0x2B 0x02 0x01 0x03 0x01 0x03 0xC3 0x57 0x1F 0x0A 0x45 0x1F 0x1E 0xFF 0x06
IN	0x01	0x13		0x00 0x01 0x09 0x20 0x02 0x75 0x42 0xB2 0x9D 0x7E 0xD9 0x74 0x95 0x66 0x12 0x0B
IN	0x01	0x13		0xA1 0x80 0x6D 0x2F 0x28 0x71 0xC9 0x9B 0xA3 0x98 0xED 0xB2 0xD6
IN	0x01	0x13		0x3E 0x1A 0x02 0x01 0x00 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x0E 0x02 0x01 0x1A
IN	0x01	0x13		0x0A 0xFF 0x4C 0x00 0x10 0x05 0x03 0x1C 0x6E 0xEA 0xF2 0xBC
IN	0x01	0x13		0x3E 0x0C 0x02 0x01 0x04 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x00 0xBC
IN	0x01	0x13		0x3E 0x2B 0x02 0x01 0x03 0x01 0x03 0xC3 0x57 0x1F 0x0A 0x45 0x1F 0x1E 0xFF 0x06
IN	0x01	0x13		0x00 0x01 0x09 0x20 0x02 0x75 0x42 0xB2 0x9D 0x7E 0xD9 0x74 0x95 0x66 0x12 0x0B
IN	0x01	0x13		0xA1 0x80 0x6D 0x2F 0x28 0x71 0xC9 0x9B 0xA3 0x98 0xED 0xB2 0xD8
IN	0x01	0x13		0x3E 0x1A 0x02 0x01 0x00 0x01 0xA4 0x66 0x77 0xC3 0x69 0x7B 0x0E 0x02 0x01 0x1A
IN	0x01	0x13		0x0A 0xFF 0x4C 0x00 0x10 0x05 0x01 0x1C 0x33 0x40 0xB3 0xC1
IN	0x01	0x13		0x3E 0x0C 0x02 0x01 0x04 0x01 0xA4 0x66 0x77 0xC3 0x69 0x7B 0x00 0xC1
IN	0x01	0x13		0x3E 0x28 0x02 0x01 0x03 0x00 0x7D 0x2B 0x22 0x27 0x97 0xC0 0x1C 0x1B 0xFF 0x75
IN	0x01	0x13		0x00 0x42 0x04 0x01 0x80 0x60 0xC0 0x97 0x27 0x22 0x2B 0x7D 0xC2 0x97 0x27 0x22
IN	0x01	0x13		0x2B 0x7C 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0xC6
IN	0x01	0x13		0x3E 0x1A 0x02 0x01 0x00 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x0E 0x02 0x01 0x1A
IN	0x01	0x13		0x0A 0xFF 0x4C 0x00 0x10 0x05 0x03 0x1C 0x6E 0xEA 0xF2 0xBF
IN	0x01	0x13		0x3E 0x0C 0x02 0x01 0x04 0x01 0x70 0xA1 0xD0 0x2B 0x7E 0x71 0x00 0xBF
IN	0x01	0x13		0x3E 0x1A 0x02 0x01 0x00 0x01 0xA2 0x96 0x4B 0x65 0x85 0x5C 0x0E 0x02 0x01 0x1A
IN	0x01	0x13		0x0A 0xFF 0x4C 0x00 0x10 0x05 0x03 0x18 0xA1 0x98 0x48 0xBE
Where the lines that start with IN... Are the lines I mentioned about looking at the first byte to map, to which event happened... And the OUT line to 0
the two bytes make up the start are the command we are sending to the BT device.
 
Status
Not open for further replies.
Back
Top