Can mushrooms be mapped?
I wanted to write up a post summarizing what I’ve been working on for the past few years on my Master’s thesis in Geography. Whenever I meet up with old friends, or family I just end up with confused faces by the end of my explanation. So here’s to hoping that writing it out in a more casual way will help me better explain it in the future, but at the very least I hope you enjoy learning more about my methods and process 🙂
Part One: Background and Species
The initial idea to do a project on mushrooms didn’t come until a few months into my grad school experience. I had originally been considering doing a project on rare plants in the Red River Gorge area and doing a more traditional plant community survey. However, at the same time I was becoming more interested in the mushrooms found growing in the woods around Miami’s campus. This new interest naturally started to meld with my geography sided mind, and I started thinking about how a distribution study could even be done on a mushroom.
There are a lot of practical problems with mapping mushrooms. The traditional approach to a distribution analysis is a field survey which is conducted to identify the presence or absence of a species in an predefined area. This detailed information on presence and absence points in known locations can then be used to make projections. This approach works well for plants since they’re stationary and they are relatively long lived. Mushrooms on the other hand spend the majority of their lives under ground, and the timing of their fruiting is largely dependent upon weather conditions. So scouting an area may not yield any mushrooms, but you can’t say for certain that they’re not there; it could be a case of arriving in the wrong time.
With these limitations in mind I started looking around to see if anyone else had tried to map mushrooms before and what strategies they used to overcome these hurdles. There were only a handful of papers online, and they all made use of herbarium datasets. Having worked in a herbarium before I was excited to find a new way to apply the data I’d had some hand in creating. Unfortunately though herbarium records in the US aren’t as thorough as in Europe where’d I’d seen most of the prior studies, and so I wouldn’t get as many points as I would have liked. To get more points, I ended up using citizen science data. In particular the points were gathered from the iNaturalist platform, which is an app that lets users upload photos of observations that are already geocoded and it has a built in method for ranking the quality of the observation. In my opinion adding this component ended up being one of the cooler parts of the project and helped me to create something that was more interesting.
With this basic idea in mind I started to really dive deeper in the specifics of a project like this. For one thing, I knew I’d be limited by the amount of points in the databases even with the citizen science data included. That meant finding a species that was common enough, or one which was highly sought after. I also knew since I’d be incorporating citizen science data, that I’d want to use a species that had a very distinctive appearance so it could be easily identified by most naturalists. I went searching through the MyCoPortal, an online repository of mycological specimens, to find something that fit the bill. At first I was searching for species in Wisconsin and Michigan, but the databases didn’t have nearly enough useable specimens. After looking over several other areas, I landed on using Washington and Oregon for my study area and the Pacific Golden Chanterelle (Cantharellus formosus) for my target species.
Washington and Oregon both have a history of mushroom foraging, where the Pacific Golden Chanterelle is one of the most highly sought after culinary mushrooms and the state mushroom of Oregon. The cultural and economic significance of the species meant that there were plenty of records within the MyCoPortal database and across citizen science platforms and I could really begin working on the project.
Part Two: Modeling
Having found a suitable species and location, I went to model the distributions. I’ll try not to bore you with the details, but I’ll give a brief overview of how it was done. In total between the citizen science and herbariums I got around 400 points. This was somewhat disappointing, since the herbarium dataset started with 1,600 points. Unfortunately most of these were from old records and didn’t contain an exact location, but hopefully this method will only improve in time. So after cleaning the data and getting a dataset containing only the points I could use, I had to decide on how to map them.
The method I chose was to use the citizen science points to plot distribution, and the herbarium points to assess the accuracy. At the time I thought this would lead to more statistically robust results by having an uncorrelated data for the accuracy check, but in hindsight it may have been better to combine the datasets and use a random split of 20% of the data to assess the quality. Using this methodology would have giving me more points to map with and therefore create a more complete picture of the distributions. Either way, for my thesis I went with the method of keeping two separate datasets to get the “gold standard” of species distribution modeling.
I was also interested in accounting for spatial bias. One of the important aspects of traditional sampling techniques is the ability to sample data in a systematized way. The data I worked with on the other hand is randomly sampled and there could be biases present. In particular I hypothesized that there would be more observations in easier to access places like near a road, or by population centers. In order to counteract this, I made use of an effort variable (CITE). This is essentially just a way to approximate “effort” spent on a particular collection. To create this layer I gathered information on population density, distance to major roadways, road density, terrain ruggedness, and locations of herbariums. I then ran a principal components analysis (PCA) on the layers which results in a single layer that is a composite of all the most important features of the initial layers. With this proxy layer for effort I could then weight the points in the model to reflect the reality that just because there aren’t observations doesn’t mean it’s not habitat and I would compare the models with and without the effort variable to see how it impacted performance. Unfortunately, because of the software packages I used for the modeling, I couldn’t incorporate a weighting scheme in two of the models: random forests and maxent.
The herbarium points, environmental characteristics, and effort variable were then used to train four statistical models: general linear (GLM), maxent, random forests (RF), and artificial neural network (ANN). These models were also coupled with pseudo-absence points. These were chosen randomly throughout the study area, with a constraint that pseudo-absence points should be chosen away from areas with know presences. This procedure is a bit contentious, because to some extent you’re dictating where the data points will be and possibly having a big impact on the final output. In this case though I felt it was justified, since mushrooms can occupy a broad range within their habitat, forming vast mycorrhizal connections underground. The Pacific Golden Chanterelle is also known to have a range on the west side of the Cascade mountains, and this targeted pseudo absence point selection helped to better reflect this.
Part Three: Results
Putting all the pieces together, I was able to create the following distribution maps. These first four show the four models with the effort variable. Visually assessing the results we can see they have similar trends. There’s the most abundance in the Coast and Cascade mountains, with less abundance in the North and in the heavily populated valley areas. These overall trends match up well with local knowledge about the species, but which of the modeling procedures model it best when compared to the iNaturalist points?
To assess which model performed best, I compared the distribution maps to the known locations of the iNaturalist points. I considered a couple different statistics when doing this and settled on using the true skill statistic (TSS), for a measure of a single models performance, and the equitable threat score (ETS), which takes into account the random chance that a point is correct and allows for comparisons between models. By comparing the ETS stats below we can see that there are some clear trends. For one, it’s clear that the RF model performed best getting a near perfect score. We can also see a clear distinction between the models that made use of the effort variable and those that didn’t. I’ll talk more about my interpretation of these results in the next section.
Now let’s examine the effectiveness of the effort variable. Visually comparing between the models with and without the effort variable we can see some major differences. The projections without the effort variable encompass a much greater area. The effort variable was included to limit the projections and stop them from overfitting areas of low effort and in that respect the inclusion of this was a huge success. We can see clearly that the valley areas and areas of greater population are emphasized far less in the effort variable models.
Part Four: Things Learned and Areas to Improve
Overall, the project proved to be successful in a number of different areas. By being one of the first studies of its type it serves as a good proof of concept for the techniques and methodology used. Effectively mapping distribution on such a large scale showcased the utility of herbarium records and citizen science data that are readily available online. And the use of the effort variable to better control unsystematically obtained data was very successful. That being said though, there are a number of things I would change if I were to undertake this again.
Going after the theoretical “gold standard” of modeling was an important aspect of the project that meant keeping the datasets independent form each other. However as discussed in the first section, this approach may have limited the effectiveness of the models. An easy improvement for future projects would be to combine the datasets. That way the models have a much larger swath of data to train on likely leading to better model results. Another easy improvement would have been to limit the area of interest more intensively. Since it’s already known that the species is located to the west of the Cascade mountains using the political boundaries to limit the analysis extent included a large swath of known uninhabitable area for the species. Cutting out this uninhabitable area would likely have produced better figures as the absence points wouldn’t have needed to cover such a large area. I was curious what implementing this change would look like, so I did a quick rerun of the models and got the figures below.
Another thing I’d liked to have implemented better is the effort variable. While it’s inclusion was largely successful, there are two aspects that could be improved. One is obvious, I would have loved to use the effort variable on all four models. Unfortunately the package I was using in R limited me to using it on only two. Being able to keep the analysis consistent would have really improved the interpretability of the results. The other issue is more involved and goes back to the decision to separate the datasets. I only discovered this late in the project, but the two datasets have different trends with respect to the effort of the observations. The herbarium points are on average further from roads and population centers and are therefore more effort, whereas the majority of iNaturalist points are located within 1-5km from a major road. Training the models on the higher effort herbarium points and then running it through the effort variable weighting scheme where the effect of low effort points were lessened introduced uncertainty in understanding the results when the accuracy was tested against the lower effort iNaturalist points. Essentially, the models that made use of the effort variable were having their accuracy scores penalized by this effect and could explain the large difference in accuracy scores between the models with and without the effort variable. This issue could have been resolved by combining the datasets together before running the models.
Despite these shortcomings the project serves as a good starting point for a low cost distribution analysis of this type and it also leads to some interesting questions. For example what are the direct applications of the distribution analysis done on the Cantharellus formosus? Well as noted in the intro, the Pacific Golden Chanterelle is a very valuable economic resource for the region, and because the species can’t be cultivated every year there are mushroom hunter that go out and find it. Understanding the distributions of the species can then give an idea of where to expect these mushroom hunters. Knowing this ahead of time could impact policy decisions like creating more trails in the area to deter people from going off trail or by having rangers there to stop overharvesting.
Understanding the current distributions can also help with protecting the species from human caused environmental issues like climate change or deforestation. The current distribution maps could be used to identify large habitat zones and prevent overexploitation. By running a similar distribution analysis, but projecting the results in onto expected conditions with climate change predictions we get an idea of how the distribution may change in the future under these climate change predictions. Under current climate change scenarios the Pacific Northwest is expected to get warmer and drier with more persistent wildfires and understanding how this may impact the species in the area is key towards protecting this important natural resource.