Back Home

Overview

The DARPA shredder challenge was a project to kick start the development of software techniques to reconstruct shredded documents recovered from a warzone. This is DARPA's archive. The internet wayback machine may do better.  On a whim I decided to give the Shredder Challenge a solo shot. Ultimately the team "All Your Shreds Are Belong To Us" won the $50k prize.

The objective was to go from scanned input files like this (but higher resolution)
sc input file example

To this: (not my results)
all your shreads soln example


Approach

The top level approach was the same as assembling a puzzle:  In stages assemble groups of pieces into ever larger and larger chunks.  I assumed each puzzle would take a slightly different approach, therefore the code developed would exist mostly as a library. The solution for each puzzle would be computed by  running a sequence of programs, customized for that puzzle, each program reading input files from the previous stage(s) and writing its results into its own filesystem directory.  I integrated human guidance by having software present its 'best' guesses for a particular fit, then have the human select the best/proper fits.

Finding shreds

The first step was to break the input files into individual puzzle shreds.  First I used ImageMagick to scale and sharpen the input file.  The sharpened image file was then loaded and processed by custom code.  In that code I specified a grid. From each grid point a flood fill would search outward for all pixels colors that where deemed to be a shred (i.e. not magenta), then to progress several pixels beyond this capturing some background color pixels.  The found piece was written to a rectangular image file with a background filled with cyan for distinctiveness.

This is an example piece. I saved all image files as .ppm files for ease of programming (and lossless handling).
example piece

To prevent finding the same piece multiple times, the removed pixels were 'colored' as the source background before starting the shred search again from another grid location.

Machine Fitting

I chose to perform a exhaustive brute force fitting technique, where piece 'A' would be rotated and displaced over piece 'B' in relatively small pixel and angle increments.  The fit would was scored heuristically.  In the example above many yellow pixels from 'A' covering yellow pixels in 'B' would be deemed to be a bad fit, but a fit where only yellow pixels overlap magenta pixels would be good.  The best solutions from the first fitting group would then be then go though an optimization process where even smaller rotations and pixels displacements would be tried.  The highest scoring solutions from the optimized fits would be retained and sorted.

I decided early on (probably mistakenly) that I should not do any sort of initial/pre rotation of the pieces for alignment because it looked like this would not help solving later puzzles.  This made the range of fits that needed to be tried massive and the overall fitting computation very expensive.

To 'solve' the computation problem, I coded up a distributed computing system where a small workgroup of desktop pc's would perform the fitting operation on a pair of pieces and report the best solutions.

Human Guidance

After the Machine Fitting process, the resulting database would be loaded into the XGass program that would allow a piece to be presented and selected.  By hitting the space bar (or 'p' for previous fit) a computer machine fit could be tried and displayed in the overall puzzle.  If that optional fit was deemed 'good' by the human, it could be 'added' to the display, then its position and angle further tweaked interactively.

XGass screenshot

In this case the center piece with the slightly lighter magenta border has been selected and the piece to the right is being test fit. XGass was implemented quick and dirty with libGx for the UI and libImg for software image and transparency rendering.

Project Log / Timeline

These are basically my raw notes. ~100 hrs total over ~2 weeks

~12hrs first day able to find individual paper pieces from sample image and break into individual files

~7hrs next day. splitting code into multiple programs. 1st is yesterday's code that finds pieces and outputs them as individual files. now also output a piece directory file
2nd code reads the directory file and loads the fragments. started on a fitter algorithm and successfully implemented the rotate function (to rotate fragment images) as part of the fitter.

~9hrs 3rd day. made fitter generate reasonable results between pairs of pieces. worked on performance improvement. dropped single pass of fitter between two pieces on problem #1 from ~70min to ~8min through inlining & other optimizations. 

~5hrs 4th day. got most of distributed computing implementation done.

~6.5 hrs 5th day. got distributed computing working. modified fitter to use course grid to select 'interesting' cases then use higher resolution on those few. experimented with 1/2 and 1/4 resolution on p1.

~4hrs 6th day. small code cleanups. found hw problem in cluster machine. simplified cluster startup and monitoring wrt this problem. recognized need in problem 1 to retain higher resolution. worked/thought about how to assemble groups of pieces and how to organize the code and data to do this.

~5hrs 7th day. worked toward idea to operate in an interactive 'tree' mode. each step finds the best larger pieces from the previous step's data. the output from each step can be hand guided as an input to the next step. brought more 'cluster' machines online. wrote p2.cc p3.cc and most of p4.cc

~8hrs 8th day. worked on fitter to try an judge the quality of the fit between pieces by considering not only the distance matched in terms of shape, but also the color match across the fit of the writing. first implementation was horribly expensive. looked for reason why and discovered it was just horribly expensive. No easy tuning can be done.

~4hrs 9th day did some more work to organize output in tree structure. worked to find best solutions in a range of pieces connecting and disconnecting from remote servers. not good fit results

~7hrs 10th day. re-approaching problem 1 scaled to 1/4 to reduce computational time. not able to get good fits as 1st choice so tweaked score calculations. sort of fixed bugs in reconnecting to remote machines (race condition left). loosened interesting fit criteria because I thought I was discarding interesting pieces (then went back and tightened up criteria when fits did not improve). finally found typo in Scoring code and added algorithmic enhancement to get back to ok/sane results. sorting pieces to allow limiting search between those that have a lot of 'other' colors. (trying to reduce n^2)

~10.5hrs 11th day  must improve fitter. now if color in one piece and no related color on other piece a penalty is applied.
  l1-4 increased contrast of paper.
  l1-5 added new pixel type (paper core)
  l1-6 more properly configured paper color thresholds via false_color -> this is why more advanced fitting (50% overlap) was giving s*** results
  should rerun with 99% sharpness
  must pre-compute pixel type and store in alpha to speed all routines in Fitter.
  should add cost for background pixels in fitting window.
  must implement fit solution assembler.

~3hrs 12th day worked on gui of interactive placement tool that leverages fit data. (xgass) continued letting l1-6 run to completion

~14hrs 13th day implemented huge chunk of interactive tool xgass continued letting l1-6 run to completion

~8hrs 14th day. added features to gass & fixed bugs l1-6 finished running 2810:17 min of compute time for each machine
  then worked to implement ideas above to speed up fitter (really just storing pixel type in alpha).
  alpha change yielded 2x speedup reults in l1-1 noted typo bug in fitter
  fixed bug. also increased desired # of results per piece pair. to 100 from 5 in l1-2. i can spare the ram?
     results in l1-2 -> no huge difference in results -> some change in order.
  made fitter change to HardFit adding a cost for gaps between pieces. results in l1-3. not much better & judged some really  bad ones
     as better than some really good ones. some of best fits judged worst?!
  increased window size to 10 pixels. results in l1-4 *** much better *** but many similar solutions perhaps discarding other good ones.

??? hrs 15th day. implemented discarding all but best solutions 'close' to one another. should let more varied interesting solutions through fitter.

Evaluation

After the end of the timeline I let the code crank on puzzle 1.  The main problem with my approach was computational time.  I began looking at going to the amazon cloud, or even buying the same $ of surplus machines from craigslist. At this point I was disappointed with my filtered results, and looking at the cost of computing/electricity, I decided to drop my effort.

Postscript

Now, several years later and writing up these results, I got my software up and running on a much faster machine than I had at the time (I was on some obsolete athlonMPs - now a single Xeon  W3565).  I ran this code at 2x resolution shreds I than I used originally (>4x slower) and crunching though puzzle 1 took more than a month on a single workstation, yielding the results in the XGass screenshot above.  The fitter code needs tweaking as it gives lots of obviously bad results mixed in with some pretty good guesses.  I think my big mistake at the time was not pre-filtering the pieces to make them at least all vertical if not looking for the characteristic notch left by the shredder and orienting all the pieces in a given direction.  This would have dropped the Fitter effort by an order of magnitude and would have worked much better overall.  I also think this sort of calculation is perfect for a GPU, but at the time I was grossly unfamiliar with that technology.

Competition Files and Winners


Winning Teams Submissions

Honorable Mentions