High-resolution efficient image generation from WiFi Mapping

(arxiv.org)

121 points | by oldfuture 13 hours ago ago

33 comments

FYI the images are not generated based on the WiFi data. The WiFi data is used as additional conditioning for a regular diffusion image generation model. So what that means is the WiFi measurements are used for determining which objects to place where in the image, but the diffusion model will then fill in any "knowledge gaps" with randomly generated (but visually plausible) data.

[-]

jstanley 11 hours ago

I'm confused about how it gets things like the floor colour and clothing colour correct.

It seems like they might be giving it more information besides the WiFi data, or else maybe training it on photos of the actual person in the actual room, in which case it's not obvious how well it would generalise.

[-]

Aurornis 7 hours ago

> I'm confused about how it gets things like the floor colour and clothing colour correct.

The model was trained on the room.

It would produce images of the room even without any WiFi data input at all.

The WiFi is used as a modulator on the input to the pre trained model.

It’s not actually generating an image of the room from only WiFi signals.

f_devd 10 hours ago

This is what GP eludes to, the original dataset has many similar reference images (i.e. the common mode is the same), and the LatentCSI model is tasked to reconstruct the correct specific instance (or a similarly plausible image in case of the test/validation set)

gblargg 6 hours ago

It wouldn't generalize at all. The Wi-Fi is just differentiating among a small set of possible object placement/orientations within that fixed space, then modifying photos taken appropriately, as far as I can tell.

esrh 9 hours ago

Think of it as an img2img stable diffusion process, except instead of starting with an image you want to transform, you start with CSI.

The encoder itself is trained on latent embeddings of images in the same environment with the same subject, so it learns visual details (that are preserved through the original autoencoder; this is why the model can't overfit on, say, text or faces).

jychang 12 hours ago

The image examples from the paper are absolutely insane.

Is this just extremely overfitted?

Is there a way for us to test this? Or even if the model isn't open source, I'd pay $1 to upload a capture from my wifi card on my linux box and upload it to the researchers and have them generate a picture and see if it's accurate

[-]

tylervigen 9 hours ago

That is not how it works. The images of the room are included in the generative training model. The wifi is "just" helping identify the locations of objects in the room.

If you uploaded a random room to the model without retraining it, you wouldn't get anything as accurate as the images in the paper.

RicDan 12 hours ago

Yeah this seems too insane to be true. I understand that wifi signal strength etc. is heavily impacted by the contents of a room, but even so it seems farfetched that there is enough information in its distortion to lead to these results.

[-]

esrh 8 hours ago

A lot of wifi sensing results that have high-dimensional outputs are usually using wideband links... your average wifi connection uses 20MHz of bandwidth and is transmitting on 48 spaced out frequencies. In the paper, we use 160MHz with effectively 1992 input data points. This still isn't enough to predict a 3x512x512 image well enough, which motivated predicting 4x64x64 latent embeddings instead.

The more space you take up in the frequency domain, the higher your resolution in the time domain is. Wifi sensing results that detect heart rate or breathing, for example, use even larger bandwidth, to the point where it'd be more accurate to call them radars than wifi access points.

esrh 9 hours ago

This is my paper (first author).

I think the results here are much less important and surprising than what some people seem to be thinking. To summarize the core of the paper, we took stable diffusion (which is a 3-part system of an encoder, u-net, decoder), and replaced the encoder to use WiFi data instead of images. This gives you two advantages: you get text-based guidance for free, and the encoder model can be smaller. The smaller model combined with the semantic compression from the autoencoder gives you better (SOTA resolution) results, much faster.

I noticed a lot of discussion about how the model can possibly be so accurate. It wouldn't be wrong to consider the model overfit, in the sense that the visual details of the scene are moved from the training data to the model weights. These kinds of models are meant to be trained & deployed in a single environment. What's interesting about this work is that learning the environment well has become really fast because the output dimension is smaller than image space. In fact, it's so fast that you can basically do it in real time... you turn on a data collection node and can train a model from scratch online, in a new environment that gets decent results with at least a little bit of interesting generalization in ~10min. I'm presenting a demonstration of this at Mobicom 2025 next month in Hong Kong.

What people call "WiFi sensing" is now mostly CSI (channel state information) sensing. When you transmit a packet on many subcarriers (frequencies), the CSI represents how the data on each frequency changed during transmission. So, CSI is inherently quite sensitive to environmental changes.

I want to point out something that most everybody working in the CSI sensing/general ISAC space seems to know: generalization is hard and most definitely unsolved for any reasonably high-dimensional sensing problem (like image generation and to some extent pose estimation). I see a lot of fearmongering online about wifi sensing killing privacy for good, but in my opinion we're still quite far off.

I've made the project's code and some formatted data public since this paper is starting to pick up some attention: https://github.com/nishio-laboratory/latentcsi

[-]

phh 9 hours ago

Is there a survey of SoTA of what can be achieved with CSI sensing you would recommend?

What is available on the low level? Are researchers using SDR, or there are common wifi chips that properly report CSI? Do most people feed in CSI of literally every packet, or is it sampled?

[-]

esrh 9 hours ago

I'd suggest reading https://dl.acm.org/doi/abs/10.1145/3310194 (2019) for a survey on early methods and https://arxiv.org/abs/2503.08008.

As for low level:

The most common early hardware was afaik esp32s & https://stevenmhernandez.github.io/ESP32-CSI-Tool/, and also old intel NICs & https://dhalperi.github.io/linux-80211n-csitool/.

Now many people use https://ps.zpj.io/ which supports some hardware including SDRs, but I must discourage using it, especially for research, as it's not free software and has a restrictive license. I used https://feitcsi.kuskosoft.com/ which uses a slightly modified iwlwifi driver, since iwlwifi needs to compute CSI anyway. There are free software alternatives for SDR CSI extraction as well; it's not hard to build an OFDM chain with GNUradio and extract CSI, although this might require a slightly more in-depth understanding of how wifi works.

equinox_nl 12 hours ago

I'm highly skeptical about this paper just because the resulting images are in color. How the hell would the model even infer that from the input data?

[-]

anthonj 10 hours ago

It is an overfitted model thst use WiFi data as hints for generation:

"We consider a WiFi sensing system designed to monitor indoor environments by capturing human activity through wireless signals. The system consists of a WiFi access point, a WiFi terminal, and an RGB camera that is available only during the training phase. This setup enables the collection of paired channel state information (CSI) and image data, which are used to train an image generation model"

dtj1123 9 hours ago

This is largely guesswork but I think whats happening is this. The training set contains images of a small number of rooms taken from specific camera angles with only that individual standing in it, and associated wifi signal data. The model then learns to predict the posture of the individual given the wifi signal data, outputting the prediction as a colour image. Given that the background doesn't vary across images, the model learns to predict it consistently with accurate colors etc.

The interesting part of the whole setup is that the wifi signal seems to contain the information required to predict the posture of the individual to a reasonably high degree of accuracy, which is actually pretty cool.

orbital-decay 12 hours ago

That's just a diffusion model (Stable Diffusion 1.5) with a custom encoder that uses CSI measurements as input. So apparently the answer is it's all hallucinated.

[-]

pftburger 11 hours ago

Right but it’s hallucinating the right colours which to me feels like some data is leaking somewhere. Because no way wifi sees colours

[-]

HeatrayEnjoyer 4 hours ago

Different materials and dyes have different dialectical properties. These examples are probably confabulation but I'm sure it's possible in principle.

[-]

plorg 4 hours ago

Assuming you mean dielectric, but I do like the idea that different colors are different arguments in conflict with each other.

moffkalast 11 hours ago

Well perhaps it can, a 2.4Ghz antenna is just a very red lightbulb. Maybe material absorption correlates, though it would be a long shot?

[-]

jstanley 11 hours ago

You can't even pick colour out of infra-red-illuminated night time photography. There's no way you can pick colour out of WiFi-illuminated photography.

AngryData 10 hours ago

There would be some correlation between the visual color of objects and the spectrum of an object in another EM frequency, many object's color share the same dye or pigment materials, but it seems pretty unlikely that it would be reliable at all with a spectrum of different objects and materials and dyes because there is no universal RGB dye or pigment set we rely upon. You can make the same red color many different ways but each material will have different spectral "colors" outside of the visual range. Even something simple like black plastics can be completely transparent in other spectrums like the PS3 was to infrared. Structural colors would probably be impossible to see discern however I don't think too many household objects have structural colors unless you got a stuffed bird or fish on the wall.

steinvakt2 11 hours ago

If it sees the shape of a fire extinguisher, the diffusion model will "know" it should be red. But that's not all that's going on here. Hair color etc seems impossible to guess, right? To be fair I haven't actually read the paper so maybe they explain this

[-]

defraudbah 11 hours ago

downvoted until you read the paper

meindnoch 9 hours ago

The model was trained on images of that particular room, from that particular angle. It can only generate images of that particular room.

brcmthrowaway an hour ago

So the applications of this work is.. surveillance. Why are there people working in this space?

nntwozz 10 hours ago

One step closer to The Light of Other Days.

"When a brilliant, driven industrialist harnesses the cutting edge of quantum physics to enable people everywhere, at trivial cost, to see one another at all times: around every corner, through every wall, into everyone's most private, hidden, and even intimate moments. It amounts to the sudden and complete abolition of human privacy--forever."

[-]

nashashmi 9 hours ago

So privacy is a mathematical function using variables of cost, capability, control, reach?

nashashmi 9 hours ago

Where is the color info coming from? It can’t come from WiFi. Is that being fed in using a photo?

malux85 11 hours ago

PSA: If you publish a paper that talks about high resolution images can you please include at least 1 high resolution image.

I know that is a subjective metric but by anyone’s measure a 4x4 matrix of postage stamp sized images are not high resolution.

[-]

mistercow 10 hours ago

1. “High resolution” in this kind of context is generally relative to previous work.

2. “Postage stamp sized” is not a resolution. Zoom in on them and you’ll see that they’re quite crisp.

amagasaki 10 hours ago

The HTML version has much larger images