All in One View

Last updated on 2026-04-30 | Edit this page

Overview

Questions

Objectives

The introduction of ChatGPT-4 in 2024 was a watershed moment for using natural language to communicate with computers. ChatGPT and rival models like Claude and Gemini provide human-like answers in response to increasingly complex prompts. These large-language models (LLMs) have come to dominate conversations about the future of knowledge, art, and tech.

What is a large-language model?

A full explanation of how LLMs work is beyond the scope of this lesson, but a simplified description may be useful. In short, ChatGPT, Claude, and Gemini are all transformer models. When a user submits a prompt, these models split the prompt into tokens, which are then converted into numerical representations called vectors.

A token is a fragment of data representing part of a word, image, or other data object
A vector is a numerical representation of the token

The vectorized prompt is then passed through a series of transformers, each of which examines and transfers information between elements of the prompt. As the prompt makes its way through the transformers, the model refines its interpreation, ultimately using this output to to predict the tokens that make up the response.

The process of generating an output from a prompt is called inference.

Challenges using LLMs

LLMs can be a useful tool but have significant limitations. Some risks associated with using LLMs include:

Responses may be inaccurate. LLMs famously hallucinate, that is, provide plausible but incorrect responses. Hallucinations are believed to be intrinsic to how these models are built and trained. Models can also inherit biases in their training data and do not reliably return exact text. Certain hallucinations in coding pose security risks, which we will discuss later.
Responses are non-deterministic. The same prompt may give different responses at different times, posing a challenge for reproducibility.
Information submitted to an LLM may be disclosed to other entities. Prompts may be added to the model and shared with other users. Paid accounts may protect submitted data, but LLMs can still leak information in other ways, for example, by revealing chat histories in response to an adversarial prompt. Disclosure of information to an LLM may also have intellectual property implications. As a rule, do not submit sensitive or confidential data to an LLM unless your organization has explicitly approved it for such use.
Over-reliance on LLMs may degrade associated cognitive skills. Using an LLM to learn a new skill may reduce independent performance and persistence.
Recent changes to billing practices may increase cost for complex workflows. Technology companies have subsidized the cost of using LLMs, particularly for heavy users, and a shift to usage-based billing is intended to bring user payments in line with compute costs.
The process of training LLMs raises serious ethical concerns. Training and using an LLM requires significant resources. Legal questions about how these models were trained persist. Large models require vast quantities of data, and large models haved frequently been found to have been trained on unlicensed, copyrighted materials. They may return copyrighted material without disclosing it.

Callout

Follow organizational policies

Many organizations have formal AI use policies. Familiarize yourself with expectations around AI in your organization and discipline before using it in research.

Knowing about these risks does not make you immune to them. As one example, the primary AI reporter for Ars Technica, a prominent technology blog, was recently fired for publishing hallucinated quotes generated by an AI tool despite being intimately familiar with these issues. Likewise, an ever-increasing number of lawyers are being disciplined for citing invented cases. Like journalists and lawyers, researchers are responsible for the accuracy of their work and must be vigilant in how they use AI and document its contributions to their research.

General strategies

Verify information provided by the LLM
Generate code, not information
Test results against validated data

Best practices for open science apply whenever you publish:

Retain and archive copies of your original data
Test code against multiple datasets, not just your exact data
Publish associated datasets in suitable repositories
Provide instructions for installation, testing, and use

Key Points

Content from Use Case: Georeferencing

Last updated on 2026-04-30 | Edit this page

Overview

Questions

What is georeferencing?
Why are some challenges with georeferencing?
How can LLMs be used to address these challenges?

Objectives

Describe georeferencing

Georeferencing is the process of converting textual locality data to coordinates. It is a critical workflow in long-lived natural history collections where the majority of specimens were collected before GPS existed. Researchers often filter by locality when searching for specimens, and specimens without coordinates may not be used as frequently. However, georeferencing is a time-consuming process, and guidelines are complex. Despite the availability of several map-based tools designed to speed up georeferencing (like GEOLocate and GeoPick), only x% of NMNH specimen records have been georeferenced as of 2025.

Some researchers have suggested that LLMs can be used to georeference collections at scale. Recent work has found that LLM approaches can be as accurate as manual georeferences at a much lower cost in both money and time.

Callout

LLMs and digitization

This workflow starts from specimens that have already been added to a database. However, a large fraction of the NMNH collection has not been digitized at all. LLMs are a promising tool for digitization efforts. For example, many models are adept at reading and extracting data from handwritten text, even cursive.

The remainder of this lesson will use a chat-based approach to georeference locality strings with the goal of adding the coordinates in the primary specimen database. (A real-world implementation would use an application programming interface, or API, which would allow us to query the LLM programatically, but we’ve opted to use the more familiar chat interface here.) This approach has several things to recommend it:

Georeferencing is time consuming and often tedious
Recent work suggests that LLM georeferences are comparable to those done by people
Geospatial data is relatively easy to validate to a degree where we can be confident that a point is at least in the right area
Both the NMNH collection database and common data standards provide dedicated fields to annotate georeferences

But it also raises some red flags.

Challenge

What are some risks associated with using an LLM for this workflow?

Show me the solution

The primary risk is recording/publishing inaccurate data. Some sources of error may include:

Gross errors in coordinates resulting from confabulations or other similar errors. These may include fabricated coordinates and unreasonable uncertainty radii.
Inconsistent results in response to prompts from the same locality
Researchers may be hesistant to use LLM-generated data

More generally, risks might include running insecure code.

Challenge

How can these risks be mitigated?

Show me the solution

Annotate georeferences so that users know that AI has been used. Be aware that:
1. Researchers accessing data in bulk may not bother to read those annotations
2. Researchers may omit AI georeferences if they are clearly flagged
Validate the georeferences, for example, by comparing them to manual georeferences, checking the coordinates against administrative boundaries, and checking the consistency of coordinates returned by similar prompts.

Key Points

Content from Prompting

Last updated on 2026-04-30 | Edit this page

Overview

Questions

Objectives

A prompt is a series of instructions submitted to an LLM to generate a response. Prompt engineering is the creation of nature-language prompts that coax the desired output from an LLM. The most familiar way to submit a prompt to an LLM is via a chat interface, which proceed as conversations.

Callout

We will be working from the Etherpad and an LLM chat for this part of the lesson.

Consider the locality string Springfield Township, NJ.

Prompt the LLM to georeference this locality as follows:

Please provide coordinates for Springfield Township, NJ

Copy the response into the Etherpad for this lesson.

Challenge

Read over the responses in the Etherpad. What do you notice about them?

Show me the solution

Here are some common attributes of responses:

Multiple localities match this string. Most responses will probably include two localities, but some may have three or even a different number.
By default, the response is given as text, which makes it challenging to extract coordinates.
Different responses may contain different coordinates.

Are there any outlier responses?

The initial response is promising but difficult to evaluate quantitatively. We can update the prompt to produce output that is easier for a script to parse.

Please provide coordinates for Springfield Township, NJ. Include all localities matching this locality string. Output coordinates as JSON including the following keys for each match: country, stateProvince, county, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters, georeferenceRemarks, and sourceURL. The response should only include the JSON.

Callout

Darwin Core

The field names used in this prompt are mostly from Darwin Core, a widely used natural history data standard. NMNH uses Darwin Core to share its data with aggregators like the Global Biodiversity Information Facility (GBIF).

We can also supply structured data:

Please provide coordinates for the following locality:

stateProvince: NJ
municipality: Springfield Township
locality: 100 m SW of Hobart Ave and Beacon Rd

Include all localities matching this locality string. Output coordinates as JSON including the following keys for each match: country, stateProvince, county, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters, georeferenceRemarks, and sourceURL. The response should only include the JSON object.

Create a file called coords.json. Open it with VS code or another text editor. Copy the JSON from the last response into the file and save. Please also copy the JSON to the Etherpad.

Key Points

Content from Assisted Coding

Last updated on 2026-04-30 | Edit this page

Overview

Questions

Objectives

Understand the difference between traditional coding, assisted coding, and vibe coding
Use an LLM to create a Python script to map coordinates to counties
Introduce the geopandas package and geospatial concepts
Read through generated code to understand how it works

This lesson will use LLM-assisted coding to create a Python script that we can use to assess the coordinates in the coords.json file.

Coding styles

Traditional coding: Low trust. Coder writes out their code manually, referring to documentation or forums if they get stuck.
Assisted coding: Medium trust. Coder consults with an LLM to write blocks of code that the coder than reviews and integrates into the larger application. They can ask the LLM follow-up questions to better understand the code and test blocks as they go to ensure that code is working correctly.
Vibe coding: High trust, resource intensive. Coder relies on the LLM to write most/all of their code, even an entire application. They review out and provide the LLM additional prompts to modify functionality but mostly do not touch the code itself.

Using LLMs shifts the focus of the coder from writing code to reading and testing code. They can be very useful for understanding what code is doing, but be careful–some studies suggest that over-reliance on LLMs may reduce persistence and independent performance. Working through problems is critical to learning how to write and read code.

Code generated using assisted methods must be vetted before being run. Risks of running unvalidated code include:

Accidental deletion of files. Functions like os.unlink() or shtuil.rmtree() can delete files or entire directories. Opening a file in write mode will delete its contents.
Cybersquatting attacks. Generated code may include hallucinated package names, which can be used by adversarial actors to install malacious software in an attack known as slopsquatting.

Earlier in the lesson, we considered some ways we might vet coordinates returned by an LLM. Possibilities included:

Using a map to check each set of coordinates
Comparing coordinates to existing specimens with similar locality information
Checking whether the coordinates fall in the expected administrative division

We will work on the third option here.

Challenge

Prompt the LLM to write Python code to determine which US county a set of coordinates is in, then answer the following questions:

Can you follow the code returned by the LLM?
What concepts are unfamiliar to you?
How can we improve the prompt?

Remember: You can use the LLM itself to ask about unfamiliar concepts.

Show me the solution

Concepts that commonly occur in the code returned for this prompt but that are not covered in the Python lesson include:

Python objects like classes, functions, and __main__
Geospatial concepts like shapefiles, coordinate reference systems, spatial indexes, and spatial joins
External libraries like geopandas, shapely, and pyogrio

Because generative AI is non-deterministic, this list is not comprehensive.

How can we improve this prompt?

Introduction to geopandas

geopandas is a geospatial library based on pandas. It allows us to draw maps and perform geospatial analyses (like calculating distances and areas) using similar syntax to pandas.

Geospatial analysis is an enormous topic. This overview will be limited to concepts that are likely to appear in the generated code.

PYTHON

import geopandas as gpd
import pandas as pd

Let’s load the JSON file we created in the previous lesson. First we’ll use the read_json() method to load the JSON file as a DataFrame:

PYTHON

df = pd.read_json("data/coords.json")
df

OUTPUT

	country	stateProvince	county	decimalLatitude	decimalLongitude	geodeticDatum	coordinateUncertaintyInMeters	georeferenceRemarks	sourceURL
0	United States	New Jersey	Union	40.725962	-74.350546	WGS84	120	Locality described as 100 m southwest of the i…	https://www.bing.com/maps?cp=40.726597~-74.349…

Now we’ll create a GeoDataFrame from the DataFrame:

PYTHON

geodf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["decimalLongitude"], df["decimalLatitude"]),
    crs=4326,
)
geodf

OUTPUT

	country	stateProvince	county	decimalLatitude	decimalLongitude	geodeticDatum	coordinateUncertaintyInMeters	georeferenceRemarks	sourceURL	geometry
0	United States	New Jersey	Union	40.725962	-74.350546	WGS84	120	Locality described as 100 m southwest of the i…	https://www.bing.com/maps?cp=40.726597~-74.349…	POINT (-74.35055 40.72596)

A coordinate reference system (CRS) is used to measure locations on or near the Earth’s surface. Components of a spatial reference include:

An ellipsoid that apprixmates the shape of the Earth
A point of origin (for example, the Prime Meridian)
A unit (typically either degrees or minutes)
Axes and order

Different CRS are suited to different tasks. Some CRS are worldwide while some are optimized for specific regions. Common CRS include:

WGS84 (EPSG:4326) (worldwide, used by GPS)
NAD83 (EPSG:XXXX) (North America)

The main thing to know here is that the CRS must be the same when comparing datasets. Changing from one coordinate system to another is referred to as projection. Use the to_crs() method to project a GeoDataFrame to another CRS. There are many ways to specify the new CRS, but the easiest is by EPSG code: "epsg:4326" or 4326:

PYTHON

geodf = geodf.to_crs(4326)

Key Points