Chat JFK
The National Archives JFK Assassination Records most recent 2025 release of files had me wondering what the entire released collection looked like, so I found the JFK Assassination Bulk Download Files page and took a peek at the 35 or so ZIP files going back to July of 2017.
I had browsed through some of the individual PDF documents back in 2024, and just did some conversion of PDF documents to Markdown for use in indices for vector search by LLM agents. Now my mind raced: what could we learn if we had an LLM searching through this content and taking notes?
Naturally, I arrived at combining these JFK files and the PDF conversion stack I had tied out for great research justice you can learn more about by reading on.
Goal
My goal derived mainly from educational value and discovering interesting connections. To that end, I would retrieve the public domain data from the official source and transform (with the highest possible fidelity) the PDF data into Markdown formatted text data.
I could then use the textual data with LLM agentic workflows or maybe fine tune a new model and call it Chat JFK.
Process
My modest lab for this project consisted of a single Nvidia RTX 3090 Founders Edition GPU with 24GB VRAM combined with some open source software.
A range of open source projects exist that use vision models to assist with converting binary document formats to text format, which in turn becomes easier for an LLM agent to consume for other use cases.
I set out to try these solutions, and find the best fit for the PDF file data I needed to process.
PDF data
I tried a handful of projects, from Docling to Markitdown, but found that for this dataset, the MinerU project provided the most accuracy and capabilities for my lab environment.
I used a single server to fetch the ZIP files, extract them, run MinerU over them, and save the outputs. After some splitting of some folders to account for the strict Hugging Face entries per folder recommendation of less than 10000, I ended up with these folders.
jfk-pdf1 jfk2023c jfk201804c jfk20171103
jfk-pdf2 jfk2023d jfk201804d jfk20171109-1
jfk-pdf3 jfk2023e jfk201804e jfk20171109-2
jfk2021 jfk2023f jfk201804f jfk20171215a
jfk2022-1 jfk2025a jfk201804g jfk20171215b
jfk2022-2 jfk2025b jfk201804h jfk20171215c
jfk2023a jfk201804a jfk201804i jfk2017111710-1
jfk2023b jfk201804b jfk201804j jfk2017111710-2
Within each folder exist sub-folders named for the converted PDF file. For
example, the first sub-folder in jfk-pdf1 is 104-10162-10011, named for the
original 104-10162-10011.pdf file.
If we peek within that folder with tree:
tree -L 1 jfk-pdf1/104-10162-10011/hybrid_auto
We find a hybrid_auto folder that holds all the artifacts generated by a
MinerU conversion.
jfk-pdf1/104-10162-10011/hybrid_auto
├── 104-10162-10011_content_list.json # Content list metadata
├── 104-10162-10011_content_list_v2.json # Version 2 content list additions
├── 104-10162-10011_layout.pdf # Layout analysis highlighted PDF
├── 104-10162-10011.md # Markdown of text content
├── 104-10162-10011_middle.json # Intermediate processing results
├── 104-10162-10011_model.json # Model inference results
├── 104-10162-10011_origin.pdf # Original PDF
└── images
The layout analysis PDF shows the highlighting and identification performed by the vision model.
This fragment directly from 104-10162-10011_layout.pdf shows green and red
highlighting of layout along with numerical annotation of text elements.

The images folder holds any image areas of the PDF saved as JPG files.
Unusual files
One thing about data science and data wrangling is that the science can at times be wonky, and in this case a perfect capture of high fidelity information from the source was unfortunately not possible. I made sure to note in the dataset card that the project does not assure 100% coverage of the source material.
Here are some anecdotal examples of a breakdown in the process which resulted in missing at least these 8 files:
- 104-10072-10314.pdf
- 104-10077-10401.pdf
- 104-10120-10100.pdf
- 104-10142-10133.pdf
- 104-10172-10167.pdf
- 176-10036-10101.pdf
- 104-10177-10088.pdf
- 104-10171-10131.pdf
Oddly enough, MinerU refuses to process these files at all, with an error like this example:
Error: No supported documents found under 104-10072-10314.pdf
After viewing some manually they appear fine, but some further investigation with PDF tools might shed some light on the issue.
Then there was audio
The complete released files includes 12.8 GB of zipped WAV audio across 6 files,
named jfk-wav1.zip through jfk-wav6.zip.
At first, I had interest in using the state of the art in open speech-to-text models to take a shot at transcribing the audio files to augment the dataset.
The tests ran with English as the specified language. I tried both Moonshine and Whisper Large-V3 on a file and the results sort of spoke for themselves.
They spoke in copious amounts of at times heavily obscenity-laced and totally incorrect transcription. Some repetition and over-all hallucination occurred as well.
Then I tried actually listening to a couple files. While I expected pretty bad quality, the actual quality borders on worse than bad. At times a 60 Hz hum rides along with what turns out to be mixtures of English, Spanish, and Russian speech degraded to such poor quality that I could hardly make out any words myself.
These audio files would need significant wizardry to de-noise and clarify the sound to the point that one could listen and decipher the audio content long before asking the model to do so.
An experiment for another day? Yes, but not today and not for this project.
2 random MP3s please
Some of the ZIP file contain MP3 audio files mixed with the PDF files.
- jfk201804i.zip
- 233-jfk-000443-tape1-s1.mp3
- 233-jfk-000443-tape2-s2.mp3
The files contain usable audio recorded in 1976, and seem good to transcribe with a speech-to-text (STT) model. I gave that a go, and discovered that these models produce close but still imperfect results.
Check out this brief example output comes from Moonshine streaming-small:
[00:02:20] And?
[00:02:21] And which?
[00:02:23] How long did you live?
[00:02:25] For the company.
[00:02:26] About 15 years.
[00:02:28] All right, sir.
[00:02:30] And in 1963, the first time I saw him,
[00:02:33] Get with your booty station.
[00:02:36] Well, I came here from Washington, D.C.
[00:02:39] Hmm.
Hmm, indeed. “Get with your booty station.” got incorrectly transcribed from “what was your duty station?”
Even these higher quality audio recordings presented some challenges for the open STT models.
This example output from Whisper V3 Large produced far more accurate results:
All right.
And that is of the United States government.
Oh, definitely.
All right.
And at which, how long did you work for the company?
About 15 years.
All right, sir.
And in 1963, what was your duty station?
Well, I came here from Washington, D.C.
It must have been the 6th of June, if I'm not mistaken.
NOTE: The transcribed audio from the MP3 files does not presently appear
in the complete dataset. You can check the included SHASUMS.txt for every
included file and its SHA 256 hash.
Results
After a week and about a dozen runs on the PDF data, I gradually processed the files after one false start where I failed to keep all the MinerU output. While you might feel tempted to just grab the Markdown files and run, they reference images extracted from the PDF, and the other JSON and PDF data has value for other LLM or ML use cases.
Total post-processing dataset:
- Files: 1689962
- Size: 227 GB
I strongly encourage you to preserve all output, and keep in mind that MinerU even keeps (with slight renaming) the original PDF, so you don’t really need to worry about the inputs hanging around.
A wild new Hugging Face dataset appears
After my experiment to convert these files, I put a copy of just the Markdown files on Hugging Face as the Chat JFK dataset.
Final Markdown dataset:
- Files: 71709
- Size: 1.56 GB
Check out the dataset card for more details.
The dataset represents raw data without any clean up whatsoever.
While the accuracy of the transcription of text from the PDF files to Markdown comes close to an accurate reproduction in some cases, some examples of hallucination most certainly appear in the dataset.
Always reference and refer the original document to confirm content before using it for anything important.
Use the dataset in creative ways
Build a Query Markup Documents index and search it
If you have a machine with at least 16GB, you can try these simple experiments with the fantastic on-device search from Query Markup Documents (QMD).
After you get the dataset, you can point qmd at it as data source for a
collection that you can then search by name in fun ways with your favorite
LLM agent.
This example uses the Pi coding agent and the Gemma-4 31B IT model served by llama.cpp on my single 3090.
PROMPT:
Research the jfk collection for locations in Florida and generate a compact
summary of the significant locations and events or persons associated with
them. Write the summary as Markdown in the file JFK-FLORIDA.md and keep it
under 250 lines total.
RESPONSE:
The research into the JFK collection for locations in Florida has been
completed. Significant findings include:
- Tampa: Centered around Santo Trafficante Jr., a powerful LCN figure who had
connections to the Marcello brothers and played a role in CIA plots
against Fidel Castro.
- Miami: A hub for Havana-based activity and the site of a major 1967 incident
at Tamiami Airport, where MIRR leader Orlando Bosch's associates were
arrested with bombs and napalm intended for ships trading with Cuba.
- Jacksonville: The base for the United Florida Ku Klux Klan (UFKKK), which was
targeted by FBI counterintelligence efforts to remove leadership like
Boyd Hamby.
- Ft. Lauderdale: Associated with Jimmy Fazio and his Fireside Steak House,
as well as LCN-linked activity at nightclubs like Bachelors III.
The summary has been written to JFK-FLORIDA.md.
The complete JFK-FLORIDA.md content:
Significant Locations and Events in Florida (JFK Collection)
Tampa
- Santo Trafficante Jr.: Born in Tampa; head of the La Cosa Nostra (LCN) family in the Tampa area with control over various criminal activities across Florida.
- Marcello Connection: Carlos and Joseph Marcello (from New Orleans) visited Santo Trafficante at his Tampa home in early June 1966.
- CIA Operations: Santo Trafficante is noted as having played an “accompanying role” in the CIA-inspired plots to assassinate Fidel Castro.
Miami
- Tamiami Airport:
- On January 15, 1967, Bureau of Customs agents arrested three men preparing to depart with bombs, napalm, and small arms.
- The event is associated with Orlando Bosch, leader of the MIRR (Movimiento Insurreccional Revolucionario), and a conspiracy to bomb ships trading with Cuba.
- Havana Connection: Miami served as a primary hub for individuals traveling between Havana, Cuba, and the U.S., including counterrevolutionary contacts and intelligence assets.
- Informants: Notable informants based in Miami include William Eisenberg (MM T-2) and Joseph Merola (MM T-3).
Jacksonville
- United Florida Ku Klux Klan (UFKKK): Jacksonville was a primary base for the UFKKK.
- FBI Counterintelligence: The FBI utilized informants within the Klan to destabilize the organization, including efforts to effect the removal of Boyd Hamby as Grand Dragon of the United Klans of America, Realm of Florida.
- Other Regional Activity: Related Klan activity and meetings were recorded in Orlando, Plant City, Sebring, Sebastian, and Lakeland (e.g., Forrest Club #11).
Ft. Lauderdale & East Coast
- Jimmy Fazio: Operated the Fireside Steak House in Ft. Lauderdale. In December 1962, he was informed by a source that “New York ‘boys’” had been seen in Cocoa Beach.
- Bachelors III Nightclub: Located in Ft. Lauderdale; associated with Bobby Vannuchi and Joe Namath, and site of violence involving LCN figure Jimmy Capatorto.
Assessment: Aside from the copious passive voice and other writing style rule violations, the agent created pretty decent and educationally informative content that could further improve through more rigorous prompt and skill tuning and through the use of more powerful models.
Though one could presume that the accuracy of the resulting content exceeds what a zero shot prompt without the search context, one should remain aware that this perceived accuracy also depends entirely on the quality of the conversion process from the source PDF files.
Let’s try another prompt about a specific individual.
PROMPT:
Research the jfk collection for all references to Frank STURGIS. Write the
summary as Markdown in the file STURGIS.md and reference each document
filename in the summary.
RESPONSE:
I have completed the research on Frank Sturgis within the JFK collection.
The findings were summarized in STURGIS.md. Key highlights include:
- Background: A former US Marine (First Raider Battalion) and multi-branch
veteran trained in unconventional warfare, piloting, and espionage.
- Intelligence Work: Operated as a triple agent; recruited by the State
Department/CIA to infiltrate Castro's inner circle. He served as
security chief for the Cuban Air Force and managed intelligence networks.
- Anti-Castro Ops: Participated in an anti-Castro leaflet raid over Havana
(1959) and proposed several assassination plots against Fidel Castro,
including a crossfire ambush at Campo Libertad.
- Controversies: FBI records describe him as a "braggart" and "liar," noting
that some of his paramilitary organizations were mere fronts for personal
gain. He later became infamous as a Watergate burglar.
- Strategic Impact: Recruited agents like Marie Lorenz, whose theft of maps
from Castro's office reportedly assisted the U.S. in locating Soviet missile
sites during the Cuban Missile Crisis.
The summary includes references to all relevant documents found during
the search.
The complete STURGIS.md content:
Frank Sturgis (Frank Fiorini)
Personal Background
Frank Anthony Sturgis, born in Norfolk, Virginia, also known as Frank Fiorini, Fred Frank Fiorini, Frank Attila, Fred Attila, and Frank Campbell. He had an extensive military background, serving in the U.S. Marine Corps during World War II (First Raider Battalion), and later in the Air Force, Army, and Navy Reserve. His training included judo, knife-fighting, jungle warfare, piloting aircraft, and operating small, fast ships.
Intelligence Career and CIA Involvement
Sturgis operated as a “triple agent” during the Cold War:
- Early Cuban Contact: Initially joined Fidel Castro’s army in 1957 to spy for Carlos Prio; he became a gunrunner for Castro, purchasing weapons from International Armament Corp (Interarmco).
- CIA Recruitment: He was recruited by Park F. Woolam of the State Department in Santiago de Cuba and subsequently worked for the CIA.
- Roles in Cuba: Served as security and intelligence chief for the Cuban Air Force under Major Pedro Diaz Lanz and supervised gambling casinos owned by American mobsters.
- Espionage Network: Sturgis claimed to have constructed an espionage network in Cuba that remained functional for years. He also recruited other agents, including Marie Lorenz, who stole secret documents and maps from Castro’s headquarters—information that reportedly helped the U.S. identify Soviet missile sites during the Cuban Missile Crisis.
Anti-Castro Activities and Assassination Plots
Sturgis was deeply involved in plots to overthrow foreign governments (including Cuba, Panama, Guatemala, Dominican Republic, and Haiti) and specifically targeted Fidel Castro:
- Assassination Proposals: He claimed that as early as February 1959, he suggested the assassination of Castro to CIA contacts. Proposed methods included a crossfire ambush at Campo Libertad and planting explosives in Castro’s office.
- Leaflet Raid: On October 21, 1959, Sturgis served as co-pilot for Pedro Diaz Lanz during an anti-Castro leaflet dropping raid over Havana.
- Paramilitary Groups: He founded the “International Brigade” and the “International Anti-Communist Brigade” (IACB). However, FBI investigations described him as a “braggart” and a “liar,” characterizing the IACB as a “paper organization” used by Sturgis for personal financial gain.
Legal and Public Notoriety
- Citizenship: In 1960, he faced expatriation hearings due to his service in the Cuban armed forces; his U.S. citizenship was restored in March 1961.
- Watergate: He later gained widespread notoriety as one of the burglars in the Watergate break-in.
- Public Claims: In a 1975 interview with the Miami Herald, Sturgis admitted to being a CIA operative from 1959 to 1968 and discussed his involvement in numerous assassination plots and conspiracies.
Referenced Documents
jfk20171109-1/104-10096-10131.md(FBI background, aliases, and early activities)jfk2022-2/104-10221-10314.md(Miami Herald interview regarding CIA and assassination plots)jfk2025b/135-10001-10288.md(JFK Assassination System reference)jfk2022-2/docid-32105623.md(Agency file regarding testimony and tapes)jfk2023e/104-10221-10039.md(Personnel file starting Jan ‘69)jfk201804f/104-10256-10286.md(FBI report on contact via State Department)jfk20171109-1/104-10048-10438.md(Identification and birth records)jfk20171109-2/104-10221-10189.md(DOJ reference to aliases)
For one last example and idea with qmd again doing the search, a prompt
for “what happened on this day?” style retrieval of events.
PROMPT:
Research the JFK collection for recorded events that relate to the date
June 17 and summarize what you find with document references in Markdown
format, then save in the file JUNE17.md
RESPONSE:
JFK Collection Research: June 17
Research into the JFK collection identified the following recorded events relating to the date June 17:
June 17, 1963
- Gambling Raid Legal Proceedings: On the morning of June 17, 1963, all principals arrested in a gambling raid appeared before Trial Magistrate William Lokar of St. Marys County, Maryland. The hearings were subsequently postponed until the following morning, June 18.
- Grand Jury Reconvened: On the same day (June 17, 1963), Circuit Court Judge Phillip H. Dorsey reconvened the Grand Jury for the Circuit Court of St. Marys County, Maryland, to investigate all aspects of the gambling matter.
Document Reference:
qmd://jfk/jfk2017111710-2/docid-32331412.md(DocID #332745)
June 17, 1972
- AMSNAP-3 Relations: A memorandum authored by Clement E. Chen reviewed “Significant Factors Covering Station Relations with AMSNAP-3” specifically in the light of events that occurred on June 17, 1972.
Document Reference:
qmd://jfk/jfk2022-1/104-10096-10131.md(DocID #65389e)
The output lists the PDF document IDs so you can cross reference with the original PDF file.
Summary
Rapid advances in local model and tooling technologies enable a surprising range of capabilities for even casual researchers or data scientists to examine the historical record with tools we could not even imagine just a decade before.
Running such experiments in a modest local lab represents a massive shift in how people can begin to use LLMs with information regardless of its source format.