All in One View
Content from AI-Assisted Coding
Last updated on 2026-04-30 | Edit this page
Overview
Questions
Objectives
The introduction of ChatGPT-4 in 2024 was a watershed moment for using natural language to communicate with computers. ChatGPT and rival models like Claude and Gemini provide human-like answers in response to increasingly complex prompts. These large-language models (LLMs) have come to dominate conversations about the future of knowledge, art, and tech.
What is a large-language model?
A full explanation of how LLMs work is beyond the scope of this lesson, but a simplified description may be useful. In short, ChatGPT, Claude, and Gemini are all transformer models. When a user submits a prompt, these models split the prompt into tokens, which are then converted into numerical representations called vectors.
- A token is a fragment of data representing part of a word, image, or other data object
- A vector is a numerical representation of the token
The vectorized prompt is then passed through a series of transformers, each of which examines and transfers information between elements of the prompt. As the prompt makes its way through the transformers, the model refines its interpreation, ultimately using this output to to predict the tokens that make up the response.
The process of generating an output from a prompt is called inference.
Challenges using LLMs
LLMs can be a useful tool but have significant limitations. Some risks associated with using LLMs include:
- Responses may be inaccurate. LLMs famously hallucinate, that is, provide plausible but incorrect responses. Hallucinations are believed to be intrinsic to how these models are built and trained. Models can also inherit biases in their training data and do not reliably return exact text. Certain hallucinations in coding pose security risks, which we will discuss later.
- Responses are non-deterministic. The same prompt may give different responses at different times, posing a challenge for reproducibility.
- Information submitted to an LLM may be disclosed to other entities. Prompts may be added to the model and shared with other users. Paid accounts may protect submitted data, but LLMs can still leak information in other ways, for example, by revealing chat histories in response to an adversarial prompt. Disclosure of information to an LLM may also have intellectual property implications. As a rule, do not submit sensitive or confidential data to an LLM unless your organization has explicitly approved it for such use.
- Over-reliance on LLMs may degrade associated cognitive skills. Using an LLM to learn a new skill may reduce independent performance and persistence.
- Recent changes to billing practices may increase cost for complex workflows. Technology companies have subsidized the cost of using LLMs, particularly for heavy users, and a shift to usage-based billing is intended to bring user payments in line with compute costs.
- The process of training LLMs raises serious ethical concerns. Training and using an LLM requires significant resources. Legal questions about how these models were trained persist. Large models require vast quantities of data, and large models haved frequently been found to have been trained on unlicensed, copyrighted materials. They may return copyrighted material without disclosing it.
Follow organizational policies
Many organizations have formal AI use policies. Familiarize yourself with expectations around AI in your organization and discipline before using it in research.
Knowing about these risks does not make you immune to them. As one example, the primary AI reporter for Ars Technica, a prominent technology blog, was recently fired for publishing hallucinated quotes generated by an AI tool despite being intimately familiar with these issues. Likewise, an ever-increasing number of lawyers are being disciplined for citing invented cases. Like journalists and lawyers, researchers are responsible for the accuracy of their work and must be vigilant in how they use AI and document its contributions to their research.
General strategies
- Verify information provided by the LLM
- Generate code, not information
- Test results against validated data
Best practices for open science apply whenever you publish:
- Retain and archive copies of your original data
- Test code against multiple datasets, not just your exact data
- Publish associated datasets in suitable repositories
- Provide instructions for installation, testing, and use
Content from Use Case: Georeferencing
Last updated on 2026-04-30 | Edit this page
Overview
Questions
- What is georeferencing?
- Why are some challenges with georeferencing?
- How can LLMs be used to address these challenges?
Objectives
- Describe georeferencing
Georeferencing is the process of converting textual locality data to coordinates. It is a critical workflow in long-lived natural history collections where the majority of specimens were collected before GPS existed. Researchers often filter by locality when searching for specimens, and specimens without coordinates may not be used as frequently. However, georeferencing is a time-consuming process, and guidelines are complex. Despite the availability of several map-based tools designed to speed up georeferencing (like GEOLocate and GeoPick), only x% of NMNH specimen records have been georeferenced as of 2025.
Some researchers have suggested that LLMs can be used to georeference collections at scale. Recent work has found that LLM approaches can be as accurate as manual georeferences at a much lower cost in both money and time.
LLMs and digitization
This workflow starts from specimens that have already been added to a database. However, a large fraction of the NMNH collection has not been digitized at all. LLMs are a promising tool for digitization efforts. For example, many models are adept at reading and extracting data from handwritten text, even cursive.
The remainder of this lesson will use a chat-based approach to georeference locality strings with the goal of adding the coordinates in the primary specimen database. (A real-world implementation would use an application programming interface, or API, which would allow us to query the LLM programatically, but we’ve opted to use the more familiar chat interface here.) This approach has several things to recommend it:
- Georeferencing is time consuming and often tedious
- Recent work suggests that LLM georeferences are comparable to those done by people
- Geospatial data is relatively easy to validate to a degree where we can be confident that a point is at least in the right area
- Both the NMNH collection database and common data standards provide dedicated fields to annotate georeferences
But it also raises some red flags.
Challenge
What are some risks associated with using an LLM for this workflow?
The primary risk is recording/publishing inaccurate data. Some sources of error may include:
- Gross errors in coordinates resulting from confabulations or other similar errors. These may include fabricated coordinates and unreasonable uncertainty radii.
- Inconsistent results in response to prompts from the same locality
- Researchers may be hesistant to use LLM-generated data
More generally, risks might include running insecure code.
Challenge
How can these risks be mitigated?
- Annotate georeferences so that users know that AI has been used. Be
aware that:
- Researchers accessing data in bulk may not bother to read those annotations
- Researchers may omit AI georeferences if they are clearly flagged
- Validate the georeferences, for example, by comparing them to manual georeferences, checking the coordinates against administrative boundaries, and checking the consistency of coordinates returned by similar prompts.
Content from Prompting
Last updated on 2026-04-30 | Edit this page
Overview
Questions
Objectives
A prompt is a series of instructions submitted to an LLM to generate a response. Prompt engineering is the creation of nature-language prompts that coax the desired output from an LLM. The most familiar way to submit a prompt to an LLM is via a chat interface, which proceed as conversations.
We will be working from the Etherpad and an LLM chat for this part of the lesson.
Consider the locality string Springfield Township, NJ.
Prompt the LLM to georeference this locality as follows:
Please provide coordinates for Springfield Township, NJ
Copy the response into the Etherpad for this lesson.
Challenge
Read over the responses in the Etherpad. What do you notice about them?
Here are some common attributes of responses:
- Multiple localities match this string. Most responses will probably include two localities, but some may have three or even a different number.
- By default, the response is given as text, which makes it challenging to extract coordinates.
- Different responses may contain different coordinates.
Are there any outlier responses?
The initial response is promising but difficult to evaluate quantitatively. We can update the prompt to produce output that is easier for a script to parse.
Please provide coordinates for Springfield Township, NJ. Include all localities matching this locality string. Output coordinates as JSON including the following keys for each match: country, stateProvince, county, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters, georeferenceRemarks, and sourceURL. The response should only include the JSON.
Darwin Core
The field names used in this prompt are mostly from Darwin Core, a widely used natural history data standard. NMNH uses Darwin Core to share its data with aggregators like the Global Biodiversity Information Facility (GBIF).
We can also supply structured data:
Please provide coordinates for the following locality: stateProvince: NJ municipality: Springfield Township locality: 100 m SW of Hobart Ave and Beacon Rd Include all localities matching this locality string. Output coordinates as JSON including the following keys for each match: country, stateProvince, county, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters, georeferenceRemarks, and sourceURL. The response should only include the JSON object.
Create a file called coords.json. Open it with VS code
or another text editor. Copy the JSON from the last response into the
file and save. Please also copy the JSON to the Etherpad.
Content from Assisted Coding
Last updated on 2026-04-30 | Edit this page
Overview
Questions
Objectives
- Understand the difference between traditional coding, assisted coding, and vibe coding
- Use an LLM to create a Python script to map coordinates to counties
- Introduce the geopandas package and geospatial concepts
- Read through generated code to understand how it works
This lesson will use LLM-assisted coding to create a Python script
that we can use to assess the coordinates in the
coords.json file.
Coding styles
- Traditional coding: Low trust. Coder writes out their code manually, referring to documentation or forums if they get stuck.
- Assisted coding: Medium trust. Coder consults with an LLM to write blocks of code that the coder than reviews and integrates into the larger application. They can ask the LLM follow-up questions to better understand the code and test blocks as they go to ensure that code is working correctly.
- Vibe coding: High trust, resource intensive. Coder relies on the LLM to write most/all of their code, even an entire application. They review out and provide the LLM additional prompts to modify functionality but mostly do not touch the code itself.
Using LLMs shifts the focus of the coder from writing code to reading and testing code. They can be very useful for understanding what code is doing, but be careful–some studies suggest that over-reliance on LLMs may reduce persistence and independent performance. Working through problems is critical to learning how to write and read code.
Code generated using assisted methods must be vetted before being run. Risks of running unvalidated code include:
-
Accidental deletion of files. Functions like
os.unlink()orshtuil.rmtree()can delete files or entire directories. Opening a file in write mode will delete its contents. - Cybersquatting attacks. Generated code may include hallucinated package names, which can be used by adversarial actors to install malacious software in an attack known as slopsquatting.
Earlier in the lesson, we considered some ways we might vet coordinates returned by an LLM. Possibilities included:
- Using a map to check each set of coordinates
- Comparing coordinates to existing specimens with similar locality information
- Checking whether the coordinates fall in the expected administrative division
We will work on the third option here.
Challenge
Prompt the LLM to write Python code to determine which US county a set of coordinates is in, then answer the following questions:
- Can you follow the code returned by the LLM?
- What concepts are unfamiliar to you?
- How can we improve the prompt?
Remember: You can use the LLM itself to ask about unfamiliar concepts.
Concepts that commonly occur in the code returned for this prompt but that are not covered in the Python lesson include:
- Python objects like classes, functions, and
__main__ - Geospatial concepts like shapefiles, coordinate reference systems, spatial indexes, and spatial joins
- External libraries like geopandas, shapely, and pyogrio
Because generative AI is non-deterministic, this list is not comprehensive.
How can we improve this prompt?
Introduction to geopandas
geopandas is a geospatial library based on
pandas. It allows us to draw maps and perform geospatial
analyses (like calculating distances and areas) using similar syntax to
pandas.
Geospatial analysis is an enormous topic. This overview will be limited to concepts that are likely to appear in the generated code.
Let’s load the JSON file we created in the previous lesson. First
we’ll use the read_json() method to load the JSON file as a
DataFrame:
OUTPUT
| country | stateProvince | county | decimalLatitude | decimalLongitude | geodeticDatum | coordinateUncertaintyInMeters | georeferenceRemarks | sourceURL | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | New Jersey | Union | 40.725962 | -74.350546 | WGS84 | 120 | Locality described as 100 m southwest of the i… | https://www.bing.com/maps?cp=40.726597~-74.349… |
Now we’ll create a GeoDataFrame from the
DataFrame:
PYTHON
geodf = gpd.GeoDataFrame(
df,
geometry=gpd.points_from_xy(df["decimalLongitude"], df["decimalLatitude"]),
crs=4326,
)
geodf
OUTPUT
| country | stateProvince | county | decimalLatitude | decimalLongitude | geodeticDatum | coordinateUncertaintyInMeters | georeferenceRemarks | sourceURL | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | New Jersey | Union | 40.725962 | -74.350546 | WGS84 | 120 | Locality described as 100 m southwest of the i… | https://www.bing.com/maps?cp=40.726597~-74.349… | POINT (-74.35055 40.72596) |
A coordinate reference system (CRS) is used to measure locations on or near the Earth’s surface. Components of a spatial reference include:
- An ellipsoid that apprixmates the shape of the Earth
- A point of origin (for example, the Prime Meridian)
- A unit (typically either degrees or minutes)
- Axes and order
Different CRS are suited to different tasks. Some CRS are worldwide while some are optimized for specific regions. Common CRS include:
- WGS84 (EPSG:4326) (worldwide, used by GPS)
- NAD83 (EPSG:XXXX) (North America)
The main thing to know here is that the CRS must be the same
when comparing datasets. Changing from one coordinate system to
another is referred to as projection. Use the to_crs()
method to project a GeoDataFrame to another CRS. There are
many ways to specify the new CRS, but the easiest is by EPSG code:
"epsg:4326" or 4326: