Hacker News new | past | comments | ask | show | jobs | submit login
Sir, there's a cat in your mirror dimension (lcamtuf.substack.com)
463 points by zdw 14 days ago | hide | past | favorite | 71 comments



In most photos with a recognizable subject, spectral energy will be concentrated around the origin (the upper left corner) as it is here

https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

The same is true for the DCT of the woman. Meanwhile, the subject of a photo is typically located towards the frame's center. This helps minimize interference between the space and frequency domain data in the composite, thus preserving kitty's expression when the transform is inverted

https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_...

(and vice versa for the woman)


And this is only true of the DCT - 2D Fourier transforms of images usually concentrate the data near the center of the image.


that's sort of true and sort of false. here the origin is plotted in the upper-left-hand corner, and in the 2d fft images you're used to looking at, it's plotted in the center instead. but you can plot the dct that way too, so it's sort of false

it's sort of true in that if you plot the standard 2d fft in this coordinate system, the data will be concentrated not in one corner of the image but in all four of them. the dct really is unusual in putting all the low-frequency stuff at positive frequencies instead of equally at positive and negative frequencies


It makes me think how the lens of the camera, is focusing the light/image at the center of the sensor, so it would make sense that data is also denser at the center, where the lens concentrated more light


So… I think you’re a bit confused about how lenses work and what they do (they don’t focus all the light into the middle, they focus light from one plane onto another one. They only focus light from the center of the frame onto the center of the image - that’s why it’s an image)

But… there is something interesting about what ‘focusing’ looks like in the frequency domain, and the difference between the frequency-space-transform of a sharply focused image and a blurred image - or of the same image focused at different focal planes - shows up as a predictable transformation in the frequency space; which means you can apply transformations in frequency space that cause focus changes in the image domain like a lens does.


your first paragraph is completely wrong. the lens concentrates collimated light parallel to its axis at its focal point, regardless of where it falls on the lens. (and, strictly speaking, only at a single wavelength.) collimated light coming from near-axial directions gets focused more or less to a point on more or less the focal plane. but light at a single point doesn't have a direction, being a wave. there is in fact a very profound connection between the action of a lens and the 2d fft; see my sibling comment for more details

your second paragraph is correct, and it is a special case of the convolution theorem; see https://en.wikipedia.org/wiki/Fourier_optics#The_2D_convolut...


I don't think the idea that (idealized, camera) lenses focus light from distinct points in one plane (or at infinity) onto distinct points in another plane is 'completely wrong', but I'm open to being educated on my error.

A lens focuses light parallel to its axis onto its focal point; it focuses parallel light coming in off-axis to other points on the focal plane.

Alternatively, and equivalently, it focuses divergent light coming from common points on planes closer than infinity, onto matching points on other planes behind its focal plane.


Lenses bring parallel rays of light (alternatively, light from infinitely far away) to the focal point. They don’t bring idealized points to points.

One consequence is you can’t use lenses to bring anything to a temperature higher than the temperature of the source light. For example you can’t use lenses + moonlight to light things on fire.

Here’s an HN thread going into the physics of it: https://news.ycombinator.com/item?id=18736700


yes, as it happens, the image on the focal plane of the camera resulting from light coming from a particular direction is in fact the 2d fourier transform of the spatial distribution of that light at the lens. this property has been used to build optical-computing military machine vision systems using spatial light modulators since the 01980s, because of some other useful properties of the fourier transform, that spatial shifts become phase shifts, so you can look for a target image everywhere in an image at once. as far as i know, these systems have never made it past the prototype stage

see https://en.wikipedia.org/wiki/Fourier_optics#Fourier_transfo...


> is in fact the 2d fourier transform of the spatial distribution of that light at the lens. this property has been used to build optical-computing military machine vision systems

Amazing. Do you have any links/references about those systems and how they should work in theory?


yes, see the link above

What are you talking about?


its the same fundamental effect


Considering the title of the article, this comment had me thinking of some supernatural. Took me way to long to realize what it was talking about...


Clarifying, the specter of a hidden animal will usually take the form of a diffuse sparkle or blur, typically hovering off to the person's side and somewhat above them, and as a result when carried through to the "other side" cannot possess what remains of the person in that domain (because they are returned to the origin in turn).


I'm a little bit slow with all this stuff, can somebody confirm this is the process:

a) take photo of woman and photo of cat

b) DCT cat into the frequency domain

c) composite the frequency domain cat into the visual image of the woman

d) if you DCT the composite image, you get the cat back? (or more specifically, you get the visual cat and the frequency domain woman composited; but the visual cat dominates)


Yep, that's it.


Does that mean DCT(DCT(image)) == image?

Yeah, that's it.

From what I remember from some student project many years ago, this technique is the basis for robust digital watermarking for any kind of signals, be it images or audio.

Of course the main application is to detect copyrighted material even after signals being heavily processed (e.g. ripped or cam’d movies, provided by JPEG-2000).

If anyone in the movie industry can provide some more technical details, I’m all ears!


I once tested a watermarking system (Digimarc?) and found that while it was robust against all sorts of noise and scaling, it failed with even a 1% rotation of the image. I wonder if it was a Fourier Transform based algorithm.


A great example of the time-frequency (or space-frequency in this case) duality of fourier transforms. The math of the FT doesn't care about the "direction" that your going for the transform, so function that look similar in time/frequency will have similar FT in the frequency/time space.

In this case, embedding the frequency plot of the cat in the space plot of the women means that the FT of the women will cause the cat to appear, and vis-versa.


It's a very cool and interesting steganographic application! Want to hide an illicit image inside an innocent image? Just convert it to frequency domain and composite it onto the other image. As long as the viewer knows how to transform it back, you have a covert way to send images that is potentially hard to detect.


It would be hard to detect if the other party didn’t know what to look for, but easy if they did.

If you combined your hidden image with a one-time-pad it should be indistinguishable from noise, right? And noise would be expected in a lossily compressed image. I wonder if anyone has done that. It seems like we’d probably never know unless they told us!


There were worries after 9/11 that terrorists were using stego to plot attacks, posting their messages “hidden in plain sight” inside images on public websites.

Someone (Niels Provos?) did a pretty thorough search and analysis of images on eBay and came up with nothing. Apparently it was just post-9/11 paranoia.


A similar fun trick was used by Aphex Twin (and others) to make a weird face appear in the audio spectrogram of one of his tracks: https://news.ycombinator.com/item?id=8509105


MetaSynth has been around since the late 90s and combines time (samples) and frequency (image) transforms of audio with Photoshop-style filters of the images.

https://uisoftware.com/metasynth/


love this, venetian snares too. thanks for confirming haha, i wasnt sure how they did it! cool memories =) thx! didnt know which one it was from aphex twin. these guys are magicians :D


That post just gets better all the way through.

I can't believe I never realized the frequency domain can be used for image compression. It's so obvious after seeing it. Is that how most image compression algorithms work? Just wipe out the quieter parts of the frequency domain?


Yep, this is how both MP3 (and Ogg-Vorbis) and JPEG all work. Picking the weights for which frequencies to keep is, presumably, chosen based on some psychoacoustic model but the coarse description is literally throwing away high order frequency information.


> chosen based on some psychoacoustic model

Does audio encoding use a similar method of using matrices to pick which frequencies get thrown away? Some video encoders allow you to change the matrices so you can tweak them based on content.


Audio is one dimensional, so it doesn't use matrices but just arrays (called subbands).

And you can't get too hard into psychoacoustic coding, because people will play compressed audio through all kinds of speakers or EQs that will unhide everything you tried to hide with the psychoacoustics. But yes, it's similar.

(IIRC, the #1 mp3 encoder LAME was mostly tuned by listening to it on laptop speakers.)


I know one mix studio that has a large selection of monitors to listen to a mix through ranging from the highest of high end studio monitors, mid-level monitors, home bookshelf speakers, and even a collection of headphones and earbuds. So when you say "check it on whatever you have available", you have to be a bit more specific with this guy's setup


DCT is also often used as a substep in more complex image (or video) compression algorithms. That is, first identify some sub-area of the image with a lot of detail, then apply DCT to that sub-area and keep more of the spectrum, then do the same for other areas and keep more or less of the spectrum. This is where the quantization parameters that you have seen for video compression algorithms affect the behavior.


You don’t generally completely wipe the high frequencies, just encode it with less bits.


Images are not truly bandlimited, which means they can't be perfectly represented in the frequency domain, so instead there's a compromise where smaller blocks of them are encoded with a mix of frequency domain and spatial domain predictors. But that's the biggest part of it, yes.

Most of the problem is sharp edges. These take an infinite number of frequencies to represent (= Nyquist theorem), so leaving some out gets you blurriness or ringing artifacts.

The other reason is that bandlimited signals infinitely repeat, but realistic images don't - whatever's on the left side of a photo doesn't necessarily predict anything about whatever's on the right side.


A real image not, but a digital image built up from pixels certainly is band limited. A sharp edge will require contributions from components across the whole spectrum that can be supported on a matrix the size of the image, the highest of which is actually called the Nyquist frequency

Not quite. You can tell this isn't true because there are many common images (game graphics, text, pixel art) where upscaling them with a sinc filter obviously produces a visually "wrong" image (blurry or ringing etc), whereas you can reconstruct them at a higher resolution "as intended" with something nonlinear (nearest neighbor interpolation, OCR, emulator filters like scale2x). That means the image contains information that doesn't work like a bandlimited signal does.

You could say MIDI is sort of like that for audio but it's used a lot less often.


I thought the image transform was conceptually done on a grid of infinitely-repeating copies of the image in the spatial domain?


Yes, or by extending the pixels on the edge out forever. The question is which one is more effective for compression; it turns out doing that for individual blocks rather than the entire image is better.

(With mirroring things could happen like the left edge of the image leaking into the right, and that'd be weird.)


How are images not bandlimited? They don't get brighter than 255, 255, 255 or darker than 0,0,0


Bandlimited means limited in the frequency domain, not the spatial domain.

(Also, video is actually worse - only 16-235. Good thing there's HDR now.)


There is more to it. Often the idea isn't just that you throw away frequencies, but also that data with less variance is possible to encode more efficiently. And it's not just that high frequency info is noise, it also tends to be smaller magnitude.


I remember seeing some video where they did a FT of an audio sample and then just used mspaint to remove some frequency component and transformed back to the audio / time domain.

Something along those lines anyway.


JPEG 2000 is even weirder. That's a wavelet transform. If you truncate a JPEG 2000 file, you can still recover a lower resolution image. At some file length, the image goes to greyscale, as the color information disappears.

How is that weird? That seems like a feature.

If the cat were more focused in the upper left, I don't think this demo would work as well. DCT will have lots of high magnitude low frequency components which will drown out the cat if it is near the top left.


Also the fact that JPEG throws away a lot of data without us noticing is hardly a discovery, rather the stated purpose of the compression algorithm.


One interesting thing is that in the quantum description of position and frequency (i.e. position and momentum if you account for hbar), it is not possible to cram two different functions into one in this way because functions that differ by a position-dependent phase are different quantum states.


Is there a way to reliably scrub an image of any possible hidden watermarks that can be created like this?


"Reliably" is a difficult word. If you understand how a specific watermark works, then yes, absolutely. If you want a fully general method that counters every possible thing you might come across... well. That's hard.

"Imperceptible" watermarks work by altering detail humans don't notice or pay attention to. So your scrubber would need to reliably remove or change all such detail. Removing such detail is absolutely something we can do - the article mentions one way, other commenters make other suggestions, and also lossy image compression in general works by losing exactly such details from the compressed image so there's that as well.

But /reliably/ get rid of /everything/, so you can be /completely certain/ no watermarks encoded in ways imperceptible to a human can possibly be left, without knowledge of the specific watermarks you want to remove or at least a way to test for their presence? You're looking at some drastic technique, in the realm of "theoretically possible but impractical"; e.g. one way might be to hand the image to a human artist, commission them to paint a copy, scan that in and use that.

Note how, in the article, it's still possible to pick out the cat even as the jpeg compression level increases. If someone found a way to avoid encoding that information without degrading original image in ways noticeable to human observers, we'd all be all over that, because it would give us a way to make image files even smaller than we can now.

This is an active area of research, precisely because it is key to getting better compression for sound and video to better understand how humans perceive things, what they notice and what they do not, so that we can reliably avoid storing information that humans will not notice the absence of / changes to, while still storing everything humans do notice. It is possible that we will one day have a complete enough understanding of human perception to make some kind of general guarantees here. But that day is not today, and tomorrow doesn't look good either.


Of course. the first image of the blog post shows that you can "paint over" the largely unused area and you don't lose much of your original image. The hidden watermarks make use of this unused area so you can just paint over that area with blank data in order to "scrub" any hidden watermarks.


I'm pretty sure you could also layer the cat noise evenly over the image without significantly damaging the woman. The DCT puts all the importantly information top left, but there is nothing stopping you adding a step to distribute that information across the whole image, or using another transform that didn't have the same concentration effect

Well sure, but whatever extra step you use to encode you will also need as an extra step to decode.

This article makes the case that "steganography" should be renamed to "catography"


“Stegatography” can be an even more appropriate choice if you speak one of the languages born around the European side of the Mediterranean sea.



It's clearly a startup in Los Gatos. Herding is the premium add-on.


How is the DCT of the two images done here exactly ? Clearly 8x8 tiles like in JPEG are not used, otherwise the similar blurry background tiles would still look similar in the DCT composite. Are the 2D DCT basis functions not a thing in this case ?


The 8x8 is just a choice made for jpeg, the DCT can be done for any m x n array (or m x n x k x ...). Here the full image is transformed

Can someone please ELI5 for me?

I don't understand how the cat is encoded in the image that has both woman and cat. I assume the visible pixels are in some way slightly altered to encode the cat?


There's a magical math operation called DCT (discreet cosine transformation) which can turns things into dust (frequency domain) and back (spatial domain). So you DCT a woman and you get woman-dust. If you DCT the woman-dust you get the woman back.

So what you do is DCT a cat to get cat-dust and sprinkle it on the woman. It's hard to see the cat-dust but if you look really closely you can see it (upper left corner of the image). We now have a dusty woman.

Then you DCT the dusty woman and get a dusty cat! Look in the upper left and you can see the woman-dust. Apply the DCT again to this image and we're back to the dusty woman.

Just apply DCT all day long to swap between a dusty cat and a dusty woman!

You must be wondering why does this work? It's due to the properties of dust and human perception. When we DCT the woman and cat you'll notice most of the dust is in the upper left corner. That's where all the heavy dust is. It's fine to lose the lighter dust further out or even add more dust out there since most of the weight is in the upper left, the DCT will get you close enough.


@toast0 answered your question here: https://news.ycombinator.com/item?id=40357927


"DCT cat into the frequency domain" is not really a "ELI5" level explanation.


i know nothing of this stuff, but it reminds me of aphex twin and venetian snares encoding images into their sounds. is that a similar thing somehow? i thinknfor venetian snares the track was something like song for my cat. if you'd use certain tools, the frequencies would show a picture of a cat.

edit: venetian snares was an album, songs about my cats. you can find it on youtube, unsure if i can link it.


Yes it's a similar thing, as mentioned in the article. The spectrogram is the frequency-domain, your wave file is the time domain.


Does anyone know how to do this in Octave or some other free software rather than Matlab?

In case anyone else wants to know the solution is:

>> woman = imread("~/Downloads/woman-with-cat.png"); >> colormap('gray'); >> imagesc(woman, [0 255]); >> pkg load signal >> cat = dct2(woman); >> pkg load image >> imagesc(imsmooth(cat, "Gaussian", 1), [-4 4]);

dct2 is unimplemented in the image package but exists in the signal package.


I wonder whether this was inspired by https://xkcd.com/26/. (I'm guessing not since there's no mention of it. But it's a nice coincidence.)


this reminds me of Hough transformations.


>In MATLAB, you can do the following:

Why




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: