Feasability of using Teensy 3.1 for a telescope autoguider (processing small images)

Status
Not open for further replies.

SamoSmr

Member
Hello everyone,

I have some experience with arduino but I am new to Teensy, just ordered some boards to play around :)

What I have in my mind for a project is a stand alone telescope autoguider (this means it should be small and light and battery has to last at least a few hours - so no beaglebone, etc). What this device does is to control a telescope based on the small changes of the position of a star on an image. It can be actually a very small image since only small changes in the of up to 5 pixels are expected.
Once I can get a position of a star on an image, other calulations and communications are easy.

I am trying to figure out if Teensy is powerful enough to take images and do some convolution on very small images.
The CMOS sensor I want to use is Aptina MT9M001 1280x1024 with 10 bit parallel output (I am limited to this one since it has to be monochrome and high sensitivity - 10 bit is also a big plus). Minimum clock frequency is 1 MHz (1 frame/s, which I would accept as usefull, but better would be if I can get 5 fps). I would probably use a 128x128 OLED display for user interfacing - black should be really black.

The idea is to read the sensor and bin the data as it is read to memory in blocks of 8x8 pixels (and use only lower 8-9 bits to avoid overflow of integers), to get a resolution of 128x160 as a preview.
How fast would Teensy be able to read parallel data and store it in a variable of the according position of the data array based on the current pixel readout position (i.e. divide the integer pixel number by 8 to get the correct positions in the 128x160 binned array and add the read integer to the one already in the memory)?
Simultanously it has to run the clock for the CMOS with 45-55% duty cycle at at least 1 MHz - 5 would be nice :)
Also, it should memorize the position of the pixel with the highest signal (it could find it later from RAM if not possible). And of course show the image on the display.

I guess, that if this is possible then it is feasable and I can manage to go further - select an area of the image and buffer only a 128x128 window of single pixels from the CMOS as a "zoom in", based on the preview image.
Once the exact area around a target star is selected, then it is sufficiend if only ~16x16 pixels around the target star are buffered and deconvoluted to determine the position with sub-pixel accurancy and this is done then continously.

Like I mentioned earlier, if it works this far, then other things like calibration of the image, communication, control, etc. are simple and should not be a problem (this part is something I could easily program now on an Arduino).

Thanks,
Samo
 
My first guess would be that what you are trying is possible, but certainly not trivial. Basic problem is that the Teensy is in the class of devices that handle audio fine but are missing the hardware to throw around the pixels from a camera easily.

So at 1 FPS and 96mhz you have 96 million commands to chew through that frame. Reading the parallel data in should work, though unsure if there are 10 bit wide ports, certainly doing a single read in of 8 should be possible. Anyway, there needs to be 1310720 of those reads every second and suddenly those 96 million commands aren't looking so good. To process those pixels into bins have 73 commands per pixel, followed by storing the 128*160 image which needs ~20K ram depending on how you fold it. You have 64kb of ram to work with, but need room for averaging, screen memory and anything else going on.

Any of those numbers could be out by a factor of 10, but at first pass I'd say that it would be possible, but going to require some serious work and you run the risk of hitting a wall where this CPU just has no more to give and still get reasonable results. Though may be possible to use the DMA engine to manage the memory loading while the CPU runs the averaging process. All comes down to how familiar you already are with working down in the bare metal of a microcontroller and how much time you have to make this work. If you can make it work will be a very elegant solution.

If being done professionally this is the sort of nail that's normally hammered with an FPGA to digest the pixel flow into something a micro controller can handle.
 
Thanks a lot for the reply. I suspected that this would be at the limit of what is possible. I guess that the devices that do this usually include FPGA to handle the image data....but they cost ~500 USD as well.

I did a quick test on my Arduino Mega 2560 to see how long this averaging data to RAM would take on an Arduino.
I simulated adding a byte (with value 1) to the 131072 "bins" of an 64x64 array, which is 256x512 image, so exactly 10x smaller that I would need for my goal.

It took 1.4s and it correctly summed up 64x2.
Doesn't look so good.
Would Teensy process this way more than 12x faster (8-bit, vs 32-bit processor)? Or just the clock frequency counts?


Code:
byte input = 2;
int data[64][32] = {0};
long time1 = micros(); 
    for (long i=0; i < 131072; i++) {
          data[(i%512)/8][i/4096] += input;
    }
long time2 = micros();
Serial.println(time2-time1);
 
Code:
byte input = 2;
int data[64][32] = {0};
long time1 = micros(); 
    for (long i=0; i < 131072; i++) {
          data[(i%512)/8][i/4096] += input;
    }
long time2 = micros();
Serial.println(time2-time1);

Hm.
This runs in 0 us. Perhaps the gcc-arm-compiler is too good and optimizes the whole loop away.
That code is not good as benchmark :)

Anyway, "int" is 32 Bit on ARM , so you want to write int16_t data[64][32] = {0}; to have the same on a 32Bit CPU
Then, i could imagine that something like (did you mean that?)
Code:
for ( int x=0; x < 64; x++) {
    for (int y=0; y < 32; y++) {
          data[x][y] += input;
    }
}
is much better.

And indeed:
Modifyfing int16_t data[64][32] = {0}; to volatile int16_t data[64][32] = {0};
with your loop says now 17784 us, which is 17ms = 0.017 seconds
But dont use volatile here in your real program ! I could imagine that it prevents the compiler to do more optimizations !
My version of the loop is 241us = 0.2ms = 0.000241 seconds...

But, hey, on the other hand 64*32 is much less than 1131071, so i 'm not sure what you want ..

Edit:
p.s. for 64*64(=4096) it's 474 us
 
Last edited:
And why not kind of run-length compress the incoming data (telefax like) on the fly? Set up a threshold and discard everything which is below that (background noise). As soon as you read values above the threshold store just the x and y coordinates of the first point and of the last point where you are above the threshold. Discard again everything which is below until the next rise above the threshold. Thus you get a set of coordinate points which circumscribe the light spots and allows to interpret the topology with much less data. I think that for tracking a light spot, a differentiated treatment of the luminosity is not necessary.
 
Hm.
This runs in 0 us. Perhaps the gcc-arm-compiler is too good and optimizes the whole loop away.
That code is not good as benchmark :)

Anyway, "int" is 32 Bit on ARM , so you want to write int16_t data[64][32] = {0}; to have the same on a 32Bit CPU
Then, i could imagine that something like (did you mean that?)
Code:
for ( int x=0; x < 64; x++) {
    for (int y=0; y < 32; y++) {
          data[x][y] += input;
    }
}
is much better.

This is not what I meant. The camera reads out pixels in lines, but if I want to bin the pixels together I need to bin pixels that are physically close. This means in the case of a 512x256 frame, data points 1-8(line1), 513-521(line2), 1024-1032, 1536-1544, ... ...., 8192-8200 (64 points in total) have to be added to the same (first) integer address in the RAM. This then yields 64x32 integers in the end.
I think the loop I wrote does allocated the data correctly to the memory, but I have no experience on what is more/less efficient:

Code:
for (long i=0; i < 131072; i++) {
          data[(i%512)/8][i/4096] += input;
    }


Maybe this would work better (sorry for still using Arduino variables)? :)
Code:
for (int n=0; n<256; n++) { //go through all image lines
  for (int i=0; i < 64; i++) { //scan the line 64x8 = 512 times
    for (int j=0; j < 8; j++)  {
      data[i][n/8] += input;  //divide the line number with 8 to add same pixel position for 8 lines in a row to the same memory position 
    }
  }
}
 
And why not kind of run-length compress the incoming data (telefax like) on the fly? Set up a threshold and discard everything which is below that (background noise). As soon as you read values above the threshold store just the x and y coordinates of the first point and of the last point where you are above the threshold. Discard again everything which is below until the next rise above the threshold. Thus you get a set of coordinate points which circumscribe the light spots and allows to interpret the topology with much less data. I think that for tracking a light spot, a differentiated treatment of the luminosity is not necessary.

Thank you for the suggestion, but this wouldn't work out of practical reasons. I still need to see the image in some form for focusing and centering.
I need differentiated treatment of the luminosity at least for the 16x16 pixels around the star, since I plan to deconvolute the data to achieve sub-pixel decision of the light spot (this is a must for the device to work)
 
Maybe this would work better (sorry for still using Arduino variables)? :)

int is ok and best for the loopcounter.


Code:
for (int n=0; n<256; n++) { //go through all image lines
  for (int i=0; i < 64; i++) { //scan the line 64x8 = 512 times
    for (int j=0; j < 8; j++)  {
      data[i][n/8] += input;  //divide the line number with 8 to add same pixel position for 8 lines in a row to the same memory position 
^^^^^^^^^^^ j is not used here...
 
Just a short update.

This is slow, 475ms to process 1310700 pixels:

Code:
uint8_t data [160][128] = {0};
    for (int i=0; i < 1310700; i++) {
          data[(i%1280)/8][i/8192] += GPIOD_PDIR & 0xFF;
    }

Much faster, just under 200ms:
Code:
uint8_t data [160][128] = {0};
for (int n=0; n < 1024; n++) { 
   for (int i=0; i < 160; i++) { 
    for (int j=0; j < 8; j++)  { 
      data[i][n/8] += GPIOD_PDIR & 0xFF;  //divide the line number with 8 to add same pixel position for 8 lines in a row to the same memory position 
    }
   }
  }

I used an LCD to display the "image in the buffer" and by manually changing the input bits on the GPIOD I could see that the code seems to work.

So from the processing point this seems to be feasible.
Next task is to figure how to trigger when to read the GPIOD port. The CMOS sensor has a pixel clock, and the data is valid after the falling edge. Should this need the use of interrupts to run the line: "data[n/8] += GPIOD_PDIR & 0xFF;" or could it be done, just by reading the clock until it falls to 0 and then reading data?
 
Fpga

Thanks a lot for the reply. I suspected that this would be at the limit of what is possible. I guess that the devices that do this usually include FPGA to handle the image data....but they cost ~500 USD as well.

Actually you can get a pretty nice FPGA for about $75 USD. I've been using this one since I backed on Kickstarter a couple of years ago:

https://embeddedmicro.com/mojo-v3.html

I've done FFT on sounds in real-time with it. I think it might work for you application.
 
So, good news! My idea is working! I managed to display the image from the CMOS on an 128x160 LCD! 1280x1024 image was binned 8x8 as it was read out to 160x128. Maximum clock frequency that worked was 4 MHz (One whole image is ~1,600,000 clock counts). I operated LCD at 24 MHz, but displaying the buffer pixel by pixel (adafruit GPX library) (+also time to calculate the correct RGB value) took 250ms per frame, so I get about 2 FPS in the end. Good enough!

The critical part of the code for reading and binning data:

Code:
uint16_t data [160][128] = {0}; 

for (int n=0; n < 1024; n++) { //go through all image lines
  while (digitalReadFast(1) == 0) {} //wait for line valid signal   
   for (int i=0; i < 160; i++) { //scan the line 160x8 = 1280 times
    for (int j=0; j < 8; j++)  {
      while (digitalReadFast(12) == 1) {} //wait for valid data (on falling pixel clock)
      data[i][n/8] += GPIOD_PDIR & 0xFF;  //divide the line number with 8 to add same pixel position for 8 lines in a row to the same memory position ; read GPIOD
//      while (digitalReadFast(12) == 0) {} //if teensy is too fast (CMOS clock 2 MHz or less) this is needed
    }
  }
}



Actually you can get a pretty nice FPGA for about $75 USD.

With 500$ I meant the device I intent to build. 75$ Would be too expensive for what I want.
 
Photo of the displayed "video". Note that high intensities had overflow, so there some bands in the image, because I divided the data with 4 (instead of 64 {8x8 pixels}). With normal levels, only the light bulb could be seen on the LCD. Image is linear, so gamma correction would be needed to display a more natural image.
20151113_224742.jpg
 
I am slowly progressing on this project. I managed to get a hold of the settings registers in the CMOS detector (two wire serial), so I can set gain, exposure, region of interest, etc.

It is also possible to set a very useful "test pattern" mode, where pixels are put out alternately with signals 0 and 1023. So I can see how well my timing works. There was some noise around (randomly reading the same pixel twice or skipping a pixel). I figured that when I switched from USB on the computer to a phone charger power supply, the data was read out perfectly.

But there were some other problem: the CMOS would hang up and stop responding after about 30s power time. This means that also serial commication was not possible. I have two CMOS detectors and while the other seemed to be working for some time, it started doing the same as well after some tries. The CMOS would stop working if reading image, and also if just putting out test pattern image. Also the data (before it would stop working) got noisy also on the phone charger power supply.

I think this has to do something with the power supply or with the shape of the clock signal (anyway this is the only input to the CMOS). The CMOS digital and analog supplies are decoupled with capacitors and inductors, so there I think it should not be a problem.
For the clock, I will check it with an ociloscope, and if it not shaped correctly I plan to shape it with a schmitt trigger, or just an inverter (is this a correct was to do it?).
About the power supply, the CMOS needs maximum 140 mA. Can teensy 3.2 provide this via the 3.3V pin, or was it a bad idea to power it from there?

Thanks if anybody is able to provide some help :)
 
Figured out, that some interrupts occur in regular intervals and cause Teensy to skip reading parrallel data.
Disabling global interrupts during frame readout from the CMOS solves the problem.
With this improvement I get stable operation reading pixels at 3 MHz.

Is it really bad if I disable global interrupts for ~0.3s? I don't expect that Teensy does anything else in this time, other than storing the pixel data to memory.
 
Just a few tips here.

1) Use a DMA channel triggered by the pixel clock to move data around. You'll be able to get up to 12Mhz ish with this setup while keeping your CPU usage at 0%

2) Once you have the data, average it into a buffer. Once this is complete swap the pointer of the array the adafruit library uses to display from to the averaged buffer you just calculated. (This is called double buffering by the way) This helps two fold. One, you don't have to copy data around twice and two, the screen will have no tearing

Don't use the adafruit drawPixel command for mass data movement. It is incredibly incredibly slow

-----

I run most of my programs with global interrupts disabled and a while(1) loop in my loop() function. This will stop millis() from working as well as USB communication. This shouldn't be an issue for what you want though. There's a way to change interrupt priority as well which maybe a less blunt way of resolving the issue. Systick and adafruit are the only interrupts I can think you'll have running. Systick by default is the highest priority interrupt


EDIT:
Overclock your Teensy to crunch those numbers a little faster. See here for how to enable overclocking. I tend to run most of my projects at 120/148MHz. Try both 120 and 148 as they have different bus speeds and 120 can sometimes yield better results. Especially when using SPI

Have a look at what Paul said as well:
There's a "FASTRUN" feature that lets you put your speed critical functions into RAM. It rarely makes much difference at normal speeds, but at 144 or 168 it can have a much more dramatic effect.
FASTRUN can be hit and miss for me but it's always worth a try.

Just wrap your function up like this:
Code:
uint16_t data [160][128] = {0}; 

FASTRUN crunchNumbers() {
for (int n=0; n < 1024; n++) { //go through all image lines
  while (digitalReadFast(1) == 0) {} //wait for line valid signal   
   for (int i=0; i < 160; i++) { //scan the line 160x8 = 1280 times
    for (int j=0; j < 8; j++)  {
      while (digitalReadFast(12) == 1) {} //wait for valid data (on falling pixel clock)
      data[i][n/8] += GPIOD_PDIR & 0xFF;  //divide the line number with 8 to add same pixel position for 8 lines in a row to the same memory position ; read GPIOD
//      while (digitalReadFast(12) == 0) {} //if teensy is too fast (CMOS clock 2 MHz or less) this is needed
    }
  }
 }
}

The loop above can also be unrolled using pragma directives this will remove your loops and make the code linear. This is faster as the assembler "goto start of loop if..." command that occurs on each loop iteration adds an instruction that has to be completed. Unrolling the loop will take up significantly more program memory though. As I'm sure you could imagine
 
Last edited:
Thank you so much for very useful feedback!

1) I was already thinking in this way, but I have no clue whatsoever how to buffer parallel data with DMA, triggered by an external clock. As you see I could not figure out better triggering than using digitalReadFast until pin is changed. :)

2) First part: This might be a problem, because there is not enough RAM to get the whole frame in the memory, I have to average on the fly to fit whole image into a small enough. I now also stream 240x240 "video" without averaging, fits nicely on a 320x240 display and it just leaves some space in RAM.
Second part: I got completely lost there... I don't see where I am copying data twice. When passing data to the LCD by using tft.pushColor(data[x][y])? Don't know what you mean with screen tearing? Unfortunately, I don't yet understand where are the benefits of using pointers, and when do you use them (but I know what they are and how are they - in theory - used). :)

Don't use the adafruit drawPixel command for mass data movement. It is incredibly incredibly slow

Thanks, I figured that out already :) It is ridiculously slow, now I use pushColor - is so fast enough, that it displays image before CMOS is ready to read the next frame.

And about interrupts - millis I will maybe need if I will need longer exposure than the internal registers of the camera can handle.
 
Sorry SamoSmr, as you quite rightly pointed out there isn't enough RAM to DMA the entire contents in.
I guess you could setup a DMA channel to transfer constantly but trigger an interrupt every 8 bytes. You could then average these in your interrupt but I doubt this will give you much if any speed increase.
The 6->8.7 MHz suggested as a maximum DMA speed sounds about right to me. I've actually had it running at 12 MHz but I was doing some pretty awkward stuff by firing 4 DMAs in series

I actually use a small CPLD called the 5M40ZE64N for things like these. In this circumstance I would have this pull in 8 bytes. Average them. And then push them out to the Teensy. Effectively downsampling for you. This is handy as these CPLDs can run upwards of 100MHz in parallel and only cost £1 here in the UK.

For something more immediate you can try, I'd be interested to hear how FASTRUN, overclocking and loop unrolling work out for you

In response to one of your older statements good job on using pushColor, I didn't realise you were using it
 
Last edited:
I tried overclocking the Teensy.
96 MHz works at 4 MHz very nicely, and at 120 MHz I could run it stable at 5.5 MHz.
FASTRUN didn't seem to help much, and I haven't tried unfolding the loops.

Here is an oscilloscope snapshot of the clock from Teensy and the pixel clock from the CMOS camera at 4MHz. I guess I should be happy with the waveshape.
IMG_1034.jpg

Reading image directly without averaging, I am more limited by the communication with the LCD (reading at 4MHz 240x240 pixels (plus overhead) should be very roughly 40 FPS). What I get is: https://www.youtube.com/watch?v=TwoObHovqq4

To use a larger area of the sensor I average now 4x4 pixels, so that I read 960x960 pixels and display in 240x240. With this I get about 2-3 FPS. Good enough for my purpose!

What I also did is, I transferred some images over the Serial port (as HEX data) and reconstructed images.

When I will find more time I will work on writing the algorithm that will determines the centroid of a bright spot in the image.
 
Another update here...progressing slowly with the project since there is a general lack of free time :)

I added more functions to the Teensy - CMOS camera project, such as auto exposure, four different modes (8x8 binning, 4x4 binning, 2x2 binning, no binning), auto select of the highest signal intensity part of the image... Also, I implemented calculation of center of mass, to determine position of a light source with sub-pixel accuracy -- the most important thing that the teensy based camera can be used as an autoguider.

In an 128x128 pixels image, the highest intensity pixel is selected. When imaging stars this is highly likely the brightest star. This approach should work, unless there are some very hot pixels around, in that case additional check needs to be done, to check if it is just one pixel that has high intensity or is a real star image.

Around the selected pixel an area of 16x16 pixels is selected and the center of mass is calculated. The center of mass has sub-pixel precision of the object position. Actually, precision is better than 1/10 pixel. Exactly what I was aiming for!
See it on the video:
https://www.youtube.com/watch?v=_mYYzRN3w94

Posx and Posy and calculated centers of masses (I missed something - Posy should be 1 unit higher, but that's irrelevant when looking at relative positions), Mcol and Mrow are pixels with maximum intensity.
 
Impressive! I've seen various people suggest things like this, but you've made quite a bit of progress here. Is this just the Aptina MT9M001 sensor and Teensy 3.1 going to a LCD display, and nothing else?
 
Impressive! I've seen various people suggest things like this, but you've made quite a bit of progress here. Is this just the Aptina MT9M001 sensor and Teensy 3.1 going to a LCD display, and nothing else?

Yes, only MT9M001, Teensy 3.2 and LCD!
Capabilities of Teensy are impressive, it just lacks RAM if I would like to do more. 256k would be perfect to have to work with 320x240 images!


and...as I mentioned earlier hot pixels are indeed a problem when I need exposures of a few seconds. So now I determine maximum signal intensity by the selected pixel + 4 adjacent pixels, this avoids selecting hot pixels as the target object.
 
Status
Not open for further replies.
Back
Top